Abstract
Background
Several studies have demonstrated that protein fold space is structured hierarchically and that powerlaw statistics are satisfied in relation between the numbers of protein families and protein folds (or superfamilies). We examined the internal structure and statistics in the fold space of 50 aminoacid residue segments taken from various protein folds. We used interresidue contact patterns to measure the tertiary structural similarity among segments. Using this similarity measure, the segments were classified into a number (K_{c}) of clusters. We examined various K_{c }values for the clustering. The special resolution to differentiate the segment tertiary structures increases with increasing K_{c}. Furthermore, we constructed networks by linking structurally similar clusters.
Results
The network was partitioned persistently into four regions for K_{c }≥ 1000. This main partitioning is consistent with results of earlier studies, where similar partitioning was reported in classifying protein domain structures. Furthermore, the network was partitioned naturally into several dozens of subnetworks (i.e., communities). Therefore, intrasubnetwork clusters were mutually connected with numerous links, although intersubnetwork ones were rarely done with few links. For K_{c }≥ 1000, the major subnetworks were about 40; the contents of the major subnetworks were conserved. This subpartitioning is a novel finding, suggesting that the network is structured hierarchically: Segments construct a cluster, clusters form a subnetwork, and subnetworks constitute a region. Additionally, the network was characterized by nonpowerlaw statistics, which is also a novel finding.
Conclusion
Main findings are: (1) The universe of 50 residue segments found here was characterized by nonpowerlaw statistics. Therefore, the universe differs from those ever reported for the protein domains. (2) The 50residue segments were partitioned persistently and universally into some dozens (ca. 40) of major subnetworks, irrespective of the number of clusters. (3) These major subnetworks encompassed 90% of all segments. Consequently, the protein tertiary structure is constructed using the dozens of elements (subnetworks).
Background
Despite the vast number of aminoacid sequences, protein folds (or superfamilies) are quantitatively limited [14]. Consequently, protein fold classification is an important subject for elucidating the construction of protein tertiary structures. A key word to characterize protein folds is "hierarchy". Wellknown databases – SCOP [5] and CATH [6] – have classified the tertiary structures of protein domains hierarchically. Similarly, a tree diagram was produced to classify the folds [7].
Mapping the tertiary structures of fulllength protein domains to a conformational space, a structure distribution is generated: a socalled protein fold universe [811]. A key word to characterize the fold universe is "space partitioning". A twodimensional (2D) representation of the fold universe was proposed in earlier reports [12,13], where the universe was partitioned into three fold (α, β, and α/β) regions. A threedimensional (3D) fold universe was partitioned into four fold regions: allα, allβ, α/β, and α+β [10]. Software that is accessible on a web site, PDBj http://eprots.protein.osakau.ac.jp/globe.cgi webcite, serves the distribution on a global surface [14].
The structures of short protein segments have also been studied: Segments of a few (2–3) aminoacid residues long were projected in a twodimensional (2D) space, where some typical combinations frequently appeared [15]. Fold universes of segments of 4–9 residues long [16] and 10–20 residues long [1719] showed several clearly distinguishable structural clusters. A systematic survey for 10–50 residue segments has shown that the fold universe is classifiable into segment universes of three types: short (10–22 residues), medium (23–26 residues), and long (27–50 residues) [20]. In this work, the 3D shape of the universe varied abruptly at 23 and 27 residues long. A sequencestructure correlation found in short segments supports the tertiary structure prediction of fulllength proteins [2123].
These studies of protein segments and domains exemplify some structural clusters existing in the lowdimensional (2D or 3D) conformational space. The benefit of the lowdimensional expression is that one can readily imagine the shape of the universe. Increasing the segment length, however, the lowering of the space dimensionality hides the internal architecture of the structure distribution. Consequently, the internal architecture of the distribution for 50residue segments (or longer segments) is unclear [20]. To compensate the fulldimensional information to the lowdimensional expression, a network is helpful in which two structures close to each other in the fulldimensional conformational space are connected.
Presume an ensemble of points (or nodes). Internode linkages form the networks. The network concept has been applied recently to biological systems [2427]. Structurally similar segments can be linked for the segment fold universe. The structural similarity is computed for the overall structures of two segments (i.e., all coordinates of the segments). Therefore, the similarity is a quantity defined in fulldimensional space. Consequently, a 2D or 3D universe consisting of linked nodes involves fulldimensional information. To assign internode linkage in the ensemble, a score is important to quantify the structural similarity between two tertiary structures. Interresidue contact (native contact) patterns have been used as reaction coordinates in protein folding studies [2830]. When two structures have similar native contact patterns, they exhibit similar interresidue packing. Results of several studies indicate that the native contacts are useful indicators to assess the protein folding process [3143] and folding time scale [4143].
Herein, we constructed a fold network of 50residue segments taken from four major structural classes of protein domains. We used the interresidue contact pattern for the similarity score. The resultant networks showed the main partitioning, as expected. Furthermore, as a new finding, the network of the segment structures was partitioned into dozens of universal communities (subnetworks). From these observations, we propose a novel protein structure hierarchy with community sites at a hierarchy level. The novelty of the currently identified hierarchy was ensured by nonpowerlaw statistics in the hierarchy, which differs from powerlaw statistics characterizing other hierarchies ever found for protein tertiary structures.
Results
As described in Methods, 50residue segments were taken from representative proteins and classified into K_{c }clusters, each of which consists of structurally similar segments. We calculated the native contact patterns that are common in each cluster, and constructed networks by connecting the clusters according to their contact pattern similarity. In Results, we first examine the general aspects of the obtained clusters. Second, we check the conformational distribution using a 3D map. Finally, we analyze the characterization of 50residue segment universe using a network analysis.
As described in this paper, indices i and j are used for specifying residue positions in a 50residue segment, s and t for segment ordinal numbers, u and v for cluster ordinal numbers, and w for a community ordinal number.
General aspects for clusters
Figure 1A portrays the dependence of the average cluster size <S > (Eq. 3) on the number K_{c }of clusters. Actually, K_{c }determines the spatial resolution to view the universe of the 50residue segments: With decreasing K_{c}, <S > increases because structurally different segments are fused into a cluster. The change of <S > was rapid for small K_{c }and slow for larger K_{c}.
Figure 1. <S > and <O > as a function of K_{c}. (A) <S > is the average cluster size (Eq. 3). The error bar shows the standard deviation over clusters. (B) <O > is the average number of segments supplied by a protein to a cluster (see the text for a detailed definition of <O >).
The segments were generated by sliding a 50residue window one residue by one residue along the domain sequences (see Methods). Consequently, two segments taken from the same protein domain with mutual adjacency in the sequence might have similar structures and might therefore be involved in a cluster. We did the following analysis to verify this possibility quantitatively: Presume that a cluster u involves n_{m }segments originated in a protein m. Subsequently, we introduced a quantity: , where the summation is taken over proteins that supply segment(s) to the cluster u, and N_{p }is the number of those proteins. Figure 1B presents a plot of the average of O_{u }as a function of K_{c}: . For K_{c }= 1000, <O > converged to 2.2. Consequently, a protein supplies only two or three segments to a cluster on average: i.e., a cluster does not contain excessive segments derived from a single protein for K_{c }≥ 1000.
Figure 2 depicts the number (n_{u}) of segments involved in a cluster as a function of the cluster ordinal number for K_{c }= 1000. The decay of n_{u }is nonexponential. It is particularly interesting that even cluster #950 involves more than 100 segments, which means that the cluster comprises more than 40 (= 100/2.5) different proteins (<O > ≈ 2.5 for K_{c }= 1000). In the last 50 clusters, n_{u }decreased quickly. These clusters consist of randomly structured segments. Although segments were taken from allα, allβ, α/β, and α+β SCOP classes, the structures can be random.
Figure 2. Number n_{u }of segments in a cluster as a function of the ordinal number of the cluster.
Figure 3 depicts <f >_{Kc }(Eq. 9) depending on K_{c}. The value of <f >_{Kc }was 0.60–0.65 for K_{c }≥ 1000. The similarity threshold f_{0 }for assigning the intercluster linkage (Eq. 7) was 0.7. Figure 3 presents that the interresidue similarity is compatible with the intracluster similarity.
Figure 3. Averaged correlation coefficient <f >_{Kc }(Eq. 9) for intracluster segments as a function of K_{c}.
Fold universe and network of clusters
The intercluster (internode) links were assigned to the K_{c }clusters according to the adjacency matrix a_{uv}. Directly connected clusters have mutually similar interresidue contact patterns. Internal architectures of the networks were investigated by dividing the networks into communities (subnetworks) using Newman's method [44]. In parallel, we projected the networks into a 3D space to obtain positions in the conformational space (see Additional file 1 for details). Although the clusters were embedded in the 3D space, the intercluster links were given to clusters that are mutually close in the fulldimensional space.
Additional file 1. Supplementary Methods and Supplementary Results. There are three sections in the Supplementary Methods as follows: (1) The method of embedding the intercluster network into 3D space. (2) The definition of Fmeasure. (3) The coloring method for clusters in the 3D network. In the Supplementary Results, tertiary structures of fragments in the same cluster and those in the same community are discussed.
Format: DOC Size: 600KB Download file
This file can be viewed with: Microsoft Word Viewer
Each community was characterized by five biophysical structural features: the α, β, αβ secondarystructure elements, the radius of gyration, and the number of interresidue contacts, denoted respectively as n_{α}, n_{β}, n_{αβ}, R_{g}, and N_{contact}. Then, the communities were classified into four types (α, β, αβ, and randomly structured communities) depending on the five structural features (see Methods for details).
Figure 4 portrays the 3D cluster distributions at K_{c }= 1000, 2000, and 3000, where a single color was assigned to a community depending on secondarystructure elements n_{α}, n_{β}, and n_{αβ }(see Additional file 1 for details). This figure clearly illustrates that the 3D cluster network is partitioned into four foldregions (mainly α, mainly β, αβ, and randomly structured regions) independent of K_{c}, which respectively consist of α, β, αβ, and randomly structured communities. We termed this partitioning as "main partitioning". Figure 5 shows that the overall shape of the network adopted a threeleaf clover shape (mainly α, mainly β, and αβ regions surrounding the randomly structured region). We checked quantitatively whether the 3D distribution reflected the original fulldimensional distribution by calculating Fmeasure (see Additional file 1 for the definition of ). The value of was, respectively, 0.804 for K_{c }= 1000, 0.673 for K_{c }= 2000, and 0.593 for K_{c }= 3000. The large value of for K_{c }= 1000 indicates that the 3D cluster distribution fairly reflects the fulldimensional distribution. The value decreased concomitantly with increasing K_{c}. However, the threeleaf clover shape of the distribution was conserved at various K_{c}, which strongly suggests that the main partitioning exists in the 50residue segments universe.
Figure 4. Networked 3D distribution of clusters for K_{c }= 1000 (A), 2000 (B), and 3000 (C). In this figure, a sphere represents a cluster. The larger the sphere, the more segments the cluster involves. The coloring method for clusters and intercluster links is explained briefly below (see Additional file 1 for details): The α, β, and αβ communities are, respectively, red, blue, and green. The larger the secondarystructure contents in a community, the greater the color strength. All randomly structured communities are shown in black. Colors assigned to clustercluster links are as follows: red for links within α communities, blue for those within β communities, green for those within αβ communities, and black for other links.
Figure 5. Main and subpartitioning of the cluster network.
Figure 6 displays segment tertiary structures picked from clusters. This figure portrays that the structure classification by the five structural features correlates well with the visual secondarystructure constitution. Most segments originating in the allα SCOP fold class were assigned to the α communities (see a1 and a2 in Figure 6). Those that originated in the allβ SCOP fold class were assigned to the β communities (see b1 – b3). The majority of segments taken from the α/β SCOP fold class were assigned to the αβ communities (see c1 – c4), although some were involved in other fold regions. In contrast, segments from the α+β SCOP fold class scattered to all the fold regions because the α+β proteins are a mixture of helices, strands, and randomly structured fragments, where the α and β secondarystructure elements are not necessarily neighbors to each other in the sequence. Consequently, the 50residue segments from the α+β proteins can involve various structural features. The randomly structured region contained clusters with a few secondarystructure elements (see r1 – r4 in Figure 6). However, its polypeptide packing was loose, as portrayed in Figure 7, where the randomly structured clusters had large R_{g}.
Figure 6. Tertiary structures picked from 3D distribution for K_{c }= 1000 Colors. of clusters are the same as those depicted in Figure 4. Intercluster links are not shown. This figure is presented with the same orientation as that of Figure 4.
Figure 7. Radius of gyration R_{g }of clusters. With increasing R_{g}, the cluster color is redder. This figure is presented with the same orientation as that of Figure 4.
Nonpowerlaw statistics
The proteindomain universe is known to be an extremely biased distribution [8,45]. Many studies have suggested a powerlaw statistic to represent the relation between the number of families and the number of folds [9,46,47]. For instance, Shakhnovich and coworkers created a proteindomain universe graph (PDUG) with adoption of a DALI Zscore for the similarity score, and showed that the domain universe followed a powerlaw distribution [9]. Consequently, it is interesting to check if the currently produced network of the 50residue segments follows the power law distribution.
First, we calculated the number (n_{seg}) of segments involved in each cluster. Figures 8A, B, and 8C portray the relation between n_{seg }and the number of clusters that respectively involve n_{seg }segments at K_{c }= 1000, 2000, and 3000. The distributions were symmetric (the value of skewness was 0.138 for K_{c }= 1000, 0.006 for K_{c }= 2000, and 0.066 for K_{c }= 3000) on the Xaxis, log(n_{seg}), and far from the powerlaw statistics. Therefore, the currently obtained universe differs from those that have ever been reported. Additionally, we calculated the number (n'_{seg}) of segments involved in each community, and showed the relation between n'_{seg }and the number of communities involved n'_{seg }fragments for K_{c }= 1000, 2000, and 3000. We again obtained nonpowerlaw statistics in the relation (data not shown).
Figure 8. Relation between number (n_{seg}) of segments involved in a cluster and number of clusters for K_{c }= 1000 (A), 2000 (B), and 3000 (C).
Next, we calculated a connectivity distribution, P(k), of the networks to investigate details of the cluster network [48]. The P(k) is defined as a distribution function of clusters that have k links to other clusters. Figures 9A, B, and 9C respectively present P(k) at K_{c }= 1000, 2000, and 3000. Subsequently, P(k) decays exponentially with increasing k. Therefore, these distributions are exponential ones (or possibly truncated powerlaw distributions). Consequently, nonpowerlaw networks (i.e., nonscalefree networks) are again observed for the current networks.
Figure 9. Connectivity distribution P(k) of cluster network at K_{c }= 1000 (A), 2000 (B), and 3000 (C). The Xaxis k shows the number of links of a cluster connected to other clusters. Solid lines are the bestfit curves drawn assuming that P(k) decays with k exponentially.
Robustness of communities
We conducted modularity analysis to study cluster networks from another perspective. First, the networks were divided into communities (see Methods). A modularity Q_{mod }is an index to assess how well the network is divided into communities [49]: 0 ≤ Q_{mod }≤ 1. A network with a large Q_{mod }is characterized by numerous intracommunity links and a few intercommunity links. Figure 10A portrays the K_{c }dependence of Q_{mod}, which has the maximum at K_{c }= 200, indicating that the communities were highly isolated at K_{c }= 200. For K_{c }> 200, the communities were connected gradually by links, thereby decreasing Q_{mod}. For K_{c }≥ 1000, Q_{mod }converged to a value (0.63), which indicates that the 50residue segment network is characterized by high modularity.
Figure 10. K_{c }dependence of N_{com }and Q_{mod}. (A) The K_{c }dependence of modularity Q_{mod }(Eq. 10). (B) The bar graph shows the K_{c }dependence of number, N_{com}, of communities assigned to the left yaxis. The line with filled circles represents the ratio (assigned to right yaxis) of clusters in major communities to all clusters.
We next calculated the number of communities at various K_{c}. We classified the communities into major and minor communities. Major ones are communities consisting of more than three clusters. Then, minor ones are small isolated communities consisting of only one or two clusters without links to other communities. No community involves only one cluster linked to another community. The K_{c }dependence of the number (N_{com}) of the major communities is presented in Figure 10B. The minor communities do not characterize the overall property of the network because only 10% of clusters belong to the minor communities at any K_{c}. The increment of N_{com }with increasing K_{c }was rapid for 100 ≤ K_{c }≤ 1000 and slow for K_{c }≥ 1000. The values of N_{com }were, respectively, 36, 38, and 38 at K_{c }= 1000, 2000, and 3000. This result shows that the number of communities was conserved for K_{c }≥ 1000.
In addition to the analysis presented above, we checked to determine whether the contents (i.e., segments) involved in the communities are conserved with variation of K_{c}. Subsequently, we assigned a single color to communities common to the universes at K_{c }= 1000 (Figure 11A), 2000 (Figure 11B), and 3000 (Figure 11C). For instance, the majority of segments in the orange community of Figure 11A were involved in the orange ones in Figures 11B and 11C. Consequently, the communities are conserved well in the universes at different K_{c}. In other words, the network partitioning into communities is universal, independent of the spatial resolution (i.e., K_{c}). We termed this intercommunity partitioning as "subpartitioning", whereas the main partitioning is interregional partitioning (Figure 5).
Figure 11. Communities at K_{c }= 1000 (A), 2000 (B), and 3000 (C). For each universe, only the top 13 communities by the number of involved clusters are shown. A single color is assigned to communities that are common to the three universes. Communities that are not common among the three are not shown, nor are minor communities.
Discussion
Herein, we described universal partitioning of two types in the 50residue segment networks (Figure 5) based on the network analysis. The main partitioning (the network separation by fold regions) resembles that in the classification scheme of existing databases such as CATH and SCOP. The mainly α, mainly β, αβ, and randomly structured regions consist respectively of α, β, αβ, and randomly structured communities. However, for the first time, we found communities in the segment fold universe: this subpartitioning (network separation by communities) is a novel finding. High modularity ensures persistently existing communities, where the intracommunity clusters are linked tightly and the intercommunity clusters are linked weakly. The universality of the subpartitioning was remarkable for f_{0 }(0.65 ≤ f_{0 }≤ 0.75). Nevertheless, outside this range, the universality vanishes gradually. Our results reveal a hierarchically structured universe for 50residue segments, as depicted in Figure 12. This hierarchy is robust because the main and subpartitionings are independent of K_{c }for K_{c }≥ 1000.
Figure 12. Hierarchy in the segment universe proposed from the current study.
Figure 10B portrays that the current universe for the 50residue segments consists of some dozens (ca. 40) of major communities. Kihara and Skolnick reported that the current PDB database might cover almost all structures of small proteins [50]. Crippen and Maiorov generated many selfavoiding conformations of a chain and suggested that the possible structures of a 50residue chain are classifiable roughly into a small number of types, although the secondarystructure formation was not incorporated in their model [51]. A study proposed the conjecture that tertiarystructure evolution of proteins might be achieved using limited repertoires of basic units such as supersecondary structure elements [52]. Results of such studies are consistent with our results because we have shown that protein tertiary structures can be decomposed into the dozens of major communities of 50residue segments. Actually, 90% of clusters belong to the major communities. To link those studies with our study more closely, detailed contents of each major community should be investigated. In fact, such a research project is proceeding now. Moreover, the role of the minor communities in the protein structure construction should be studied.
The currently observed 50residue segment universe was characterized by the nonpowerlaw distribution. Our result apparently differs from the powerlaw distribution widely known for the hierarchical protein domain universe [9,46,47,53]. The emergence of the nonpowerlaw statistics might be related to the usage of the interresidue contact, which is a more relaxed similarity measure than widely used ones such as RMSD or the DALI Zscore. It is known that in the powerlaw statistics the rate for isolated clusters in the entire clusters is high [53]. In our nonpower law statistics, the rate was low because the relaxed measure provided linkages between clusters. Thus, the two statistics compensate to each other to survey the fold universe. From the nonpowerlaw universe, we could show a novel hierarchy (Figure 12) in the universe and the existence of 40 repertories (Figure 10) to construct the protein tertiary structures, which have not been reported from the powerlaw universe. These results were also found in the 60 and 70residue segment universes (data not shown). This suggests that the nonpower law is likely to be a general property for segment universes.
The current network helps to trace conformational changes of segments along the network linkages. Supplementary Results displays that the conformation gradually changes when shifting the view from a cluster to another (see Additional file 1).
The interresidue contact (native contact) has been widely used as a reaction coordinate in protein folding (see Introduction). We intend to use the currently obtained networks for protein folding study. The networks of fixedlength segments are readily applicable for conformational sampling in protein folding, where the chain length is usually fixed. The randomly structured clusters are located at the root of the distribution (Figure 4 and Figure 5), from which the segment conformation can diversify to mainly α, mainly β, or αβ regions with increased compactness (Figure 7).
Conclusion
We constructed a 50residue segment network for investigating the protein local structure universe. The network was partitioned into some dozens (ca. 40) of major communities with high modularity (0.60 <Q_{mod }< 0.65), independent of the spatial resolution (K_{c}). The major communities existed universally and persistently in the universe. Surprisingly, 90% of all segments were covered by the major communities. Consequently, numerous similarities exist among local regions (i.e., 50residue segments) of proteins. Furthermore, the currently constructed segments networks are characterized by nonpowerlaw (nonscalefree) statistics, which apparently differs from reported characteristics for the fold universe of fulllength proteins.
Methods
This section includes six subsections. The first three – "Generation of 50residue segment library", "Clustering segments", and "Computation of interresidue contact patterns" – are preparative subsections describing construction of the 50residue segment fold universe. In the subsection titled "Construction of a universe and network", construction of the fold universe and the network is described. "Modularity analysis" presents analyses used to examine the network. The subsection "Characterization of communities by structural features" describes a method to characterize communities depending on five structural features. Specification of indices i, j, s, t, u, v, and w is given at the beginning of Results.
Generation of 50residue segment library
We generated a structure library of 50residue segments with reference to the allα, allβ, α/β, and α+β fold classes defined in the SCOP database (release 1.69) [5]. The SCOP database presents a list that provides a representative for each protein family. We selected tertiary structures of the representative domains from the PDB database [54] with elimination of multichain domains, those involving structurally undetermined regions, and those shorter than 50 residues. Furthermore, we eliminated domains consisting of 400 residues or more, which might involve structurally repeating units. Then we obtained 1803 domains (456 from allα, 393 from allβ, 393 from α/β, and 561 from α+β). A domain that is n_{r }aminoacid residues long produces n_{r } 49 segments from sliding a 50residue window along the sequence one residuebyone residue. Finally, we obtained an ensemble of 186 821 segments (32 040 from allα, 39 375 from allβ, 63 177 from α/β, and 52 229 from α+β). The residue site of each segment was renumbered from 1 to 50 in our study.
Clustering segments
We classify the collected segments into clusters as follows: First, the interC_{α }atomic distances were calculated for segment s, where the distance between residues i and j is denoted as r_{s}(i, j). We eliminated residue pairs i  j < 3 because the distances for these pairs are similar for all segments. In other words, those distances have less sensitivity to discriminate the structural differences of segments. Then, the number (N_{pair}) of the C_{α}atomic pairs in a 50residue segment is 1128: N_{pair }= 1128. The set of distances is expressed as a N_{pair}dimensional vector: = [r_{s}(1, 4), r_{s}(1, 5), ..., r_{s}(47, 50)]. We define the root mean square distance (rmsd_{st}) between and as in the N_{pair}dimensional Cartesian space: .
For classifying the 186 821 segments into K_{c }clusters, we applied Lloyd's Kmeans algorithm [55] to the set of rmsd_{st }values, where s, t = 1, ..., 186821. One should set K_{c }in advance in the Kmeans algorithm. We examined various values for K_{c }(K_{c }≤ 5000). In Lloyd's method, the K_{c }clusters are set randomly at the beginning. The finally converged clusters are output. We have checked that the main results are independent of the initial set of clusters.
We calculated the center () of a cluster u in the N_{pair}dimensional space as , where the element is given as
The n_{u }is the number of constituent segments of the cluster u.
We defined a size S_{u }of the cluster u as
This equation simply quantifies the average distance from the cluster center to segments belonging to the cluster u in the N_{pair}dimensional space. The average cluster size is defined simply as
where the summation is taken over all the K_{c }clusters.
Computation of interresidue contact patterns
In this subsection, we present computation of the intercluster and intracluster structural similarity based on the interresidue contact patterns. The interresidue contacts in segment s were defined as follows: Calculating all the interheavy atomic distances between residues i and j for the segment, their minimum distance was registered as the interresidue distance q_{s}(i, j). Then, if q_{s}(i, j) < 6.0 Å, we judged that the residues i and j were contacting and set a quantity c_{s}(i, j) to 1 (otherwise, c_{s}(i, j) = 0). Here, we again eliminated residue pairs of i  j < 3 in the calculation of c_{s}(i, j). The set of c_{s}(i, j) constructs a matrix C_{s}, where element (i, j) is c_{s}(i, j).
The upper limit (6.0 Å) for q_{s}(i, j) allows no penetration of a water molecule between residues i and j: At q_{s}(i, j) = 6.0 Å, the substantial space for water penetration between the residues is approximately 2.0 Å (= 6.0  2 × 2.0) assuming that radii of segment heavy atoms are 2.0 Å. This space of 2.0 Å is smaller than the diameter of a water molecule (2.8 Å).
A structural similarity between segments s and t might be counted by comparing C_{s }and C_{t}. However, a strict comparison engenders an oversight of the similarity in the following case: Presume that c_{s}(i, j) = 1 and c_{t}(i,+ 1, j) = 0 in the segment s, and c_{s}(i, j) = 0 and c_{t}(i,+ 1, j) = 1 in segment t. The interresidue contacts in these segments differ but they are similar. The strict comparison does not count such a similarity. To incorporate such similarity, smoothing of C_{s }was performed as
This smoothing (see Figure 13) was done only when residues i' and j' are not contacting and the residues i and j are contacting in the segment. If Eq. 4 produces a negative value, then c_{s}(i', j') is set to zero. If a noncontacting residue pair (i', j') has multiple values for c_{s}(i', j') attributable to contributions of some contacting pairs around (i', j'), then the largest value is assigned to the noncontacting pair. As described in this paper, the interresidue contact matrix C_{s }indicates that after the smoothing.
Figure 13. Smoothed interresidue contacts c(i, j) (Eq. 4). It is presumed that residue pair (i, j) is in contact (i.e., c(i, j) = 1), and that the other pairs are noncontacting. Equation 4 provides negative c_{s}(i', j') at sites where an inequality, i  i' + j  j' + (i  i'  j  j') > 5, is satisfied. Besides, this inequality is satisfied without exception when any one of the three inequalities, i  i' > 2, j  j' > 2, or i  i'  j  j' > 2, is met. Those negative c(i, j) = 1), and that the other pairs are noncontacting. Equation 4 provides negative c_{s}(i', j') are reset to zero (see text).
Here, we calculate the contact patterns which are specific to a cluster. For this purpose, we averaged C over the entire segment library and over all segments in cluster u. We denote these averaged matrices as and , respectively. Then, we defined a quantity , where element (i, j) is denoted as . The similarity between clusters u and v was measured using the following correlation coefficient:
where
The term in Eq. 5 is defined by setting u = v in Eq. 6, and the term by setting = 1. A large correlation coefficient indicates similar interresidue contact patterns between the clusters.
The coefficient is useful as a distance between clusters u and v in a multi dimensional space. Consequently, the set of coefficients define a multidimensional weighted graph (i.e., weighted network). In this work, we must convert this weighted graph into an unweighted one to perform community analysis, which only deals with the unweighted graph. Therefore, we introduce an adjacency matrix a_{uv }in which element (u, v) is given as follows.
The interresidue contact patterns are similar between clusters u and v only when . Herein, we set f_{0 }to 0.7. The meaning of 0.7 is explained in the Results section.
We next assessed the intracluster similarity. First, we defined a quantity for a segment s, where element (i, j) of ΔC_{s }is denoted as ΔC_{s}(i, j). Then, we averaged ΔC_{s}(i, j) over the segments in cluster u:
We define a matrix G_{u }for that the element (i, j) as g_{u}(i, j). Then, we calculated the correlation coefficient f(G_{u}, ΔC_{s}) between G_{u }and ΔC_{s }for segments in cluster u, using the same definition as that in Eq. 5. Subsequently, we calculated an averaged correlation coefficient <f >_{u }over f(G_{u},ΔC_{s}) of the segments in the cluster u. This quantity is a measure to express the similarity of the interresidue contact patterns among the segments in cluster u. Finally, <f >_{u }was averaged over all clusters.
The larger the value of , the more similar the interresidue contact patterns in each cluster are, on average.
Construction of a universe and network
We constructed a distribution (i.e., fold universe) of K_{c }clusters in a 3D conformational space with embedding clusters into the 3D. Details are presented in Additional file 1. As explained in the Introduction, lowering of the space dimensionality hides the internal architecture of the fold universe. To compensate the fulldimensional information to the 3D distribution, links were assigned to clusters with similar interresidue contact patterns (a_{uv }= 1). The generated networks were subjected to the modularity analysis described in the next subsection.
Modularity analysis
To investigate a property of the cluster network, we divided the network into communities (i.e., subnetworks) using an efficient method [44]. An example of a network is presented in Figure 14, where two communities (Com 1 and Com 2) exist. A modularity Q_{mod }is an index to assess how well the network is divided into communities [49]:
Figure 14. Two network types. Network (A) has larger modularity Q_{mod }than (B) does. Filled circles form a community (Com 1); open ones construct the other community (Com 2). Lines between circles represent links.
where I_{w }is the number of links connecting clusters within a community w, N_{com }is the number of communities existing in the entire network, and I is the number of links existing in the entire network. The quantity d_{w }is called the "total degree", which is defined for each community as d_{w }= 2I_{w }+ I_{wother}, where I_{wother }is the number of links connecting clusters in the community w and clusters outside the community. The value of Q_{mod }is 0–1: Q_{mod }approaches 1 when the number of links connecting different communities decreases. For instance, the network in Figure 14A has Q_{mod }of 0.466 (I = 34, I_{1 }= 18, I_{2 }= 15, d_{1 }= 37, and d_{2 }= 31). That of Figure 14B has Q_{mod }of 0.388 (I = 37, I_{1 }= 18, I_{2 }= 15, d_{1 }= 40, and d_{2 }= 34). The two networks are equivalent except for the intercommunity links.
Characterization of communities by structural features
The manner of differentiating the communities is important. Herein, we characterize the communities depending on five biophysical structural features: radius of gyration (R_{g}), number of interresidue contacts ( with removal of pairs of i  j < 3), number of αhelical residues (n_{α}), number of βhelical residues (n_{β}), and the sum of n_{α }and n_{β }(i.e., n_{αβ }= n_{α }+ n_{β}).
First, we calculate the five quantities for each segment. The secondarystructure assignment to each residue in a segment is done using software available at the STRIDE web site http://webclu.bio.wzw.tum.de/stride/ webcite[56]. Next, we took the average for each of the five quantities over segments in a community. We designate the average quantities in a community w as R_{g}(w), N_{contact}(w), n_{α}(w), n_{β}(w), and n_{αβ}(w). Then, we classify the communities into α, β, αβ, and randomly structured ones according to the five quantities: Randomly structured communities are those with R_{g }> 14 Å and N_{contact}(w) < 100 or those with n_{αβ}(w) < 15. In the remaining communities, α communities are those with n_{α}(w) > 0.7 × n_{αβ}(w). In the remaining communities, β communities are those with n_{α}(w) > 0.7 × n_{αβ}(w). The finally remaining communities are classified as αβ communities. Each segment in the αβ communities significantly involves both an α helix and a β strand.
Authors' contributions
This study was conceived and carried out by JI, who also developed the main part of the methodology. YS participated in some analyses. IK participated in discussions. KT participated in the coordination of the study. He also helped to write the manuscript. JH participated in developing the methodology, designed the study, and wrote the manuscript. All authors read and approved the final manuscript.
Acknowledgements
KI and JH were partly supported by BIRD of Japan Science and Technology Agency (JST). JH was also partly supported by New Energy and Industrial Technology Development Organization (NEDO).
References

Chothia C: Proteins. One thousand families for the molecular biologist.
Nature 1992, 357:543544. PubMed Abstract  Publisher Full Text

Gibrat JF, Madej T, Bryant SH: Surprising similarities in structure comparison.
Curr Opin Struct Biol 1996, 6:377385. PubMed Abstract  Publisher Full Text

Coulson AFW, Moult J: A unifold, mesofold, and superfold model of protein fold use.
Proteins 2002, 46:6171. PubMed Abstract  Publisher Full Text

Liu X, Fan K, Wang W: The number of protein folds and their distribution over families in nature.
Proteins 2004, 54:491499. PubMed Abstract  Publisher Full Text

Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures.
J Mol Biol 1995, 247:536540. PubMed Abstract  Publisher Full Text

Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM: CATH – a hierarchic classification of protein domain structures.
Structure 1997, 5:10931108. PubMed Abstract  Publisher Full Text

Efimov AV: Structural trees for protein superfamilies.
Proteins 1997, 28:241260. PubMed Abstract  Publisher Full Text

Holm L, Sander C: Mapping the protein universe.
Science 1996, 273:595602. PubMed Abstract  Publisher Full Text

Dokholyan NV, Shakhnovich B, Shakhnovich EI: Expanding protein universe and its origin from the biological Big Bang.
Proc Natl Acad Sci USA 2002, 99:1413214136. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hou J, Sims GE, Zhang C, Kim SH: A global representation of the protein fold space.
Proc Natl Acad Sci USA 2003, 100:23862390. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hou J, Jun SR, Zhang C, Kim SH: Global mapping of the protein structure space and application in structurebased inference of protein function.
Proc Natl Acad Sci USA 2005, 102:36513656. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Holm L, Sander C: Protein structure comparison by alignment of distance matrices.
J Mol Biol 1993, 233:123138. PubMed Abstract  Publisher Full Text

Orengo CA, Flores TP, Taylor WR, Thornton JM: Identification and classification of protein fold families.
Protein Eng 1993, 6:485500. PubMed Abstract  Publisher Full Text

Standley DM, Kinjo AR, Kinoshita K, Nakamura H: Protein structure databases with new web services for structural biology and biomedical research.
Brief Bioinfo 2008, 9:276285. Publisher Full Text

Takahashi K, Go N: Conformational classification of short backbone fragments in globular proteins and its use for coding backbone conformations.
Biophys Chem 1993, 47:163178. Publisher Full Text

Tomii K, Kanehisa M: Systematic detection of protein structural motifs. In Pattern discovery in biomolecular data. Edited by Wang JTL, Shapiro BA, Shasha D. New York: Oxford University Press; 1999:97110.

Choi IG, Kwon J, Kim SH: Local feature frequency profile: A method to measure structural similarity in proteins.
Proc Natl Acad Sci USA 2004, 101:37973802. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ikeda K, Tomii K, Yokomizo T, Mitomo D, Maruyama K, Suzuki S, Higo J: Visualization of conformational distribution of short to medium size segments in globular proteins and identification of local structural motifs.
Protein Sci 2005, 14:12531265. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sawada Y, Honda S: Structural diversity of protein segments follows a powerlaw distribution.
Biophys J 2006, 91:12131223. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ikeda K, Hirokawa T, Higo H, Tomii K: Proteinsegment universe exhibiting transitions at intermediate segment length in conformational subspaces.
BMC Structural Biology 2008, 8:37. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Simons KT, Kooperberg C, Huang E, Baker D: Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions.
J Mol Biol 1997, 268:209225. PubMed Abstract  Publisher Full Text

Bonneau R, Strauss CE, Rohl CA, Chivian D, Bradley P, Malmström L, Robertson T, Baker D: De novo prediction of threedimensional structures for major protein families.
J Mol Biol 2002, 322:6578. PubMed Abstract  Publisher Full Text

Chikenji G, Fujitsuka Y, Takada S: A reversible fragment assembly method for de novo protein structure prediction.
J Chem Phys 2003, 119:68956903. Publisher Full Text

Jeong H, Mason SP, Barabási AL, Oltvai ZN: Lethality and centrality in protein networks.
Nature 2001, 411:4142. PubMed Abstract  Publisher Full Text

Holme P, Huss M, Jeong H: Subnetwork hierarchies of biochemical pathways.
Bioinformatics 2003, 19:532538. PubMed Abstract  Publisher Full Text

Guimerà R, Amaral LAN: Functional cartography of complex metabolic networks.
Nature 2005, 433:895900. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Palla G, Derényi I, Farkas I, Vicsek T: Uncovering the overlapping community structure of complex networks in nature and society.
Nature 2005, 435:814818. PubMed Abstract  Publisher Full Text

Go N: Theoretical studies of protein folding.
Annu Rev Biophys Bioeng 1983, 12:183210. PubMed Abstract  Publisher Full Text

Go N, Abe H: Randomness of the process of protein folding.
Int J Pept Protein Res 1983, 22:622632. PubMed Abstract

Wolynes PG, Onuchic JN, Thirumalai D: Navigating the folding routes.
Science 1995, 267:16191620. PubMed Abstract  Publisher Full Text

Galzitskaya OV, Finkelstein AV: A theoretical search for folding/unfolding nuclei in threedimensional protein structures.
Proc Natl Acad Sci USA 1999, 96:1122911304. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Munoz V, Eaton WA: A simple model for calculating the kinetics of protein folding from threedimensional structures.
Proc Natl Acad Sci USA 1999, 96:1131111316. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Shea JE, Brooks CL III: From folding theories to folding proteins: a review and assessment of simulation studies of protein folding and unfolding.
Annu Rev Phys Chem 2001, 52:499535. PubMed Abstract  Publisher Full Text

Koga N, Takada S: Roles of native topology and chainlength scaling in protein folding: A simulation study with a Golike model.
J Mol Biol 2001, 313:171180. PubMed Abstract  Publisher Full Text

Makarov DE, Keller CA, Plaxco KW, Metiu H: How the folding rate constant of simple, singledomain proteins depends on the number of native contacts.
Porc Natl Acad Sci USA 2002, 99:35353539. Publisher Full Text

Zhou HX: Theory for the rate of contact formation in a polymer chain with local conformational transitions.
J Chem Phys 2003, 118:20102015. Publisher Full Text

Nakamura HK, Sasai M, Takano M: Scrutinizing the squeezed exponential kinetics observed in the folding simulation of an offlattice Golike protein model.
Chem Phys 2004, 307:259267. Publisher Full Text

Mitomo D, Nakamura HK, Ikeda K, Yamagishi A, Higo J: Transition state of a SH3 domain detected with principle component analysis and a chargeneutralized allatom protein model.
Proteins 2006, 64:883894. PubMed Abstract  Publisher Full Text

Ikebe J, Kamiya N, Shindo H, Nakamura H, Higo J: Conformational sampling of a 40residue protein consisting of α and β secondarystructure elements in explicit solvent.
Chem Phys Lett 2007, 443:364368. Publisher Full Text

Kamiya N, Mitomo D, Shea JE, Higo J: Folding of the 25 residue Abeta(12–36) peptide in TFE/water: temperaturedependent transition from a funneled freeenergy landscape to a rugged one.
J Phys Chem B 2007, 111:53515356. PubMed Abstract  Publisher Full Text

Baker D: A surprising simplicity to protein folding.
Nature 2000, 405:3942. PubMed Abstract  Publisher Full Text

Kamagata K, Arai M, Kuwajima K: Unification of the folding mechanisms of nontwostate and twostate proteins.
J Mol Biol 2004, 339:951965. PubMed Abstract  Publisher Full Text

Kamagata K, Kuwajima K: Surprisingly high correlation between early and late stages in nontwostate protein folding.
J Mol Biol 2006, 357:16471654. PubMed Abstract  Publisher Full Text

Newman MEJ: Finding community structure in networks using the eigenvectors of matrices.
Phys Rev E 2006, 74:036104. Publisher Full Text

Grant A, Lee D, Orengo C: Progress towards mapping the universe of protein folds.
GenomeBiology 2004, 5:107. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Koonin EV, Wolf YI, Karev GP: The structure of the protein universe and genome evolution.
Nature 2002, 420:218223. PubMed Abstract  Publisher Full Text

Qian J, Luscombe NM, Gerstein M: Protein Family and Fold Occurrence in Genomes: Powerlaw Behaviour and Evolutionary Model.
J Mol Biol 2001, 313:673681. PubMed Abstract  Publisher Full Text

Barabási AL, Albert R: Emergence of scaling in random networks.
Science 1999, 286:509512. PubMed Abstract  Publisher Full Text

Newman MEJ, Girvan M: Fast algorithm for detecting community structure in networks.
Phys Rev E 2004, 69:026113. Publisher Full Text

Kihara D, Skolnick J: The PDB is a covering set of small protein structures.
J Mol Biol 2003, 334:793802. PubMed Abstract  Publisher Full Text

Crippen GM, Maiorov VN: How Many Protein Folding Motifs are There?
J Mol Biol 1995, 252:144151. PubMed Abstract  Publisher Full Text

Soding J, Lupas AN: More than the sum of their parts: on the evolution of proteins from peptides.
BioEssay 2003, 25:837846. Publisher Full Text

Krishnadev O, Brinda KV, Vishveshwara S: A graph spectral analysis of the structural similarity of protein chains.
Proteins 2005, 61:152163. PubMed Abstract  Publisher Full Text

Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The protein data bank.
Nucleic Acids Res 2000, 28:235242. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lloyd SP: Least squares quantization in PCM.
IEEE Transactions on Information Theory 1982, 28:129137. Publisher Full Text

Frishman D, Argos P: Knowledgebased protein secondary structure assignment.
Proteins 1995, 23:566579. PubMed Abstract  Publisher Full Text