Connectivity independent protein-structure alignment: a hierarchical approach1 Macromolecular Modeling Group, Institute of Chemistry and Biochemistry, FU Berlin, Takustrasse 6, 14195 Berlin, Germany 2 Computer Science Research, Zuse Institute Berlin, Takustrasse 7, 14195 Berlin, Germany
BMC Bioinformatics 2006, 7:510doi:10.1186/1471-2105-7-510 The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/7/510
©
2006 Kolbeck et al; licensee BioMed Central Ltd. AbstractBackgroundProtein-structure alignment is a fundamental tool to study protein function, evolution and model building. In the last decade several methods for structure alignment were introduced, but most of them ignore that structurally similar proteins can share the same spatial arrangement of secondary structure elements (SSE) but differ in the underlying polypeptide chain connectivity (non-sequential SSE connectivity). ResultsWe perform protein-structure alignment using a two-level hierarchical approach implemented in the program GANGSTA. On the first level, pair contacts and relative orientations between SSEs (i.e. α-helices and β-strands) are maximized with a genetic algorithm (GA). On the second level residue pair contacts from the best SSE alignments are optimized. We have tested the method on visually optimized structure alignments of protein pairs (pairwise mode) and for database scans. For a given protein structure, our method is able to detect significant structural similarity of functionally important folds with non-sequential SSE connectivity. The performance for structure alignments with strictly sequential SSE connectivity is comparable to that of other structure alignment methods. ConclusionAs demonstrated for several applications, GANGSTA finds meaningful protein-structure alignments independent of the SSE connectivity. GANGSTA is able to detect structural similarity of protein folds that are assigned to different superfamilies but nevertheless possess similar structures and perform related functions, even if these proteins differ in SSE connectivity. BackgroundProtein-structure alignment remains a great challenge in structural genomics and constitutes an important tool for applications in protein classification, protein-structure prediction, drug design and studies of evolutionary relationships. The number of known three-dimensional protein structures determined by NMR spectroscopy and X-ray crystallography is increasing rapidly. At the end of the year 2005 the Protein Data Bank (PDB) [1] contained more than 33,000 protein structures. Hence, efficient methods to detect structural similarity among different proteins, analogous to methods for sequence alignment are clearly needed. The pairwise protein-structure alignment problem can be defined as the task of identifying maximal common substructures of two proteins according to a given similarity measure. Algorithms solving this problem use different representations of protein structures. GRATH [2], SSM [3], TOP[4], TOPS [5], MATRAS [6], PROTEP [7] and VAST [8] work on protein secondary structure level only. Such secondary-structure representation is also used for index-based database searches [9,10]. DALI [11], CE [12], SSAP [13], FASE [14] and SCALI [15] work on the residue level or a combination of secondary-structure and residue level. Another approach employs methods derived from computer vision to compare 3D models [16]. TOPSCAN [17] uses topology string representations. Other approaches tackle even the problem of aligning multiple structures [18,19]. Theoretical work characterizing protein architectures focused initially on pure β-strand proteins [20-23] or on pure a-helical proteins [24]. Proteins of mixed topology containing a-helices and β-strands were considered as undirected labeled graphs [20,25-27]. Useful comparison of three-dimensional protein structures require a structure-similarity score that is transferable to biological and chemical classifications reflecting protein architectures. Several measures for protein-structure similarity have been proposed. The root mean square deviation (RMSD) of equivalent atom positions of a protein pair [28] is widely used. Other similarity measures are distance map similarity [29] and contact map overlap (CMO) [30,31], which employ residue pair distances and contacts, respectively. CMO is based on the notion of contacts between two residues. A contact map captures a 3D structure in condensed form, representing the 3D protein conformation as a symmetrical, square, Boolean matrix of contacts. Such contact maps are also used as basic information to elucidate protein structures from NMR spectroscopy [32]. Although they simplify the description of protein structures, they still provide all necessary information to reconstruct a protein structure with sufficiently high accuracy. In the CMO approach, the protein-structure alignment problem is solved by analyzing the similarity of their contact maps. CMO-based structure alignment was introduced by Godzik and Skolnick [30] and was proved to be NP-hard by Goldman et al. [33]. However, Caprara et al. [34] succeeded with integer programming to get solutions for maximum CMO of protein-structure pairs in reasonable CPU times. Nevertheless, the protein-structure alignment problem is computationally hard to solve. To reduce the computational burden of protein-structure alignment connected with direct use of pairwise-residue assignment, we employ in the present study a hierarchical approach. On the first level of the hierarchy, protein-structure alignment of SSEs is performed. On the second level, solutions for the CMO are searched on the residue level. In analogy to protein sequence alignment, structure alignment methods can work with either a global or a local strategy. Global strategies start from whole structures and remove poorly matched parts of the structure. In contrast, local strategies start from small matching units and attempt to enlarge and merge these. The presented method (GANGSTA: Genetic Algorithm for Non-sequential, Gapped protein STructure Alignment) uses a global strategy. Protein architectures are essentially defined by the spatial arrangement of α-helices and β-strands (SSEs). These SSEs generally form the central part of protein structures, while loop, turn and coil structures are more irregular and preferentially localized on the protein surfaces. Furthermore, the composition and arrangement of a-helices and β-strands are highly conserved evolutionary in contrast to the conformations of loops, turns and coils connecting these SSEs. Hence, restriction to these SSEs is advantageous for structure comparison, since it focuses on the regular parts of the structure, which can be characterized more compactly, thereby facilitates recognition of structural similarity. GANGSTA considers only these regularly structured SSEs, which greatly reduces the complexity of the protein-structure alignment problem and facilitates structure alignments with non-sequential SSE connectivity. It is a widely assumed that similar protein structures can be aligned while the SSE connectivity in the polypeptide chain (sequential SSE connectivity) is conserved. Nevertheless, a considerable number of proteins possess different SSE connectivity but share the same architecture (i.e. the same spatial arrangement of SSEs: see Yuan et al. [15] for a detailed list). It has been shown that permuted SSE alignments (i.e. alignments with non-sequential SSE connectivity) occur often [35]. Structure alignments involving proteins of different SSE connectivities pose a computational challenge. Using protein representations in terms of graphs on the secondary-structure level, we can describe structure alignment as a search for the maximum common subgraph [7,20,26], a problem that is known to be NP-complete. Therefore, we decided to use a genetic algorithm (GA) to perform connectivity-independent alignments on the SSE level, since evolutionary algorithms provide reasonable strategies to solve NP-complete problems [36]. GAs have been used previously for structural alignment [37-40] and for detecting appropriate structure templates in homology modeling [41]. Only few methods, such as SARF [42,43], K2 [37,38], MASS [19,44] or SCALI [15], can align protein-structure fragments in non-sequential order. However, none of these methods optimizes the matching of protein graphs. GANGSTA was developed to produce high quality global protein-structure alignments independent of SSE connectivity by optimizing the contact map overlap. The method can be used for pairwise protein-structure alignment or fast database searches with respect to a specific protein structure and is available through a web server [45]. For the case of pairwise structure alignment, we provide a statistical significance related to our similarity measure in the form of a P-value, the probability that a better score can be reached by structure alignment of unrelated proteins. The performance of GANGSTA was assessed in pairwise structure alignments and database scans with sequential and non-sequential SSE connectivity. We show GANGSTA's ability to detect structural similarity of protein folds that are assigned to different superfamilies but nevertheless posses similar structures and perform related functions, even if these proteins differ in SSE connectivity. ResultsProtein-structure alignment with GANGSTA: an exampleTo demonstrate the capability of GANGSTA to find protein structures with different SSE connectivities, we consider the structure alignment of the two SCOP domains 2uagA1 and 1gkuB1. In CATH [46] these protein domains correspond to 2uagA01 and 1gkuB02, respectively. The name convention of protein domains in SCOP and CATH are as follows: {pdb_id|CHAIN_id|domain_id} for instance {2uag|A|01}. They share the same protein-structure class (alpha/beta) but belong to different fold and superfamily categories in SCOP. Both structures have an incomplete Rossmann structure motif [47] in common. The Rossmann structure motif is ubiquitous in the universe of protein structures. It occurs with different SSE connectivities and comprises four α-helices and four β-strands. In the incomplete Rossmann structure motif one dangling α-helix is missing. Generally, it serves as a device for binding functionally relevant cofactors, such as nucleotide di(tri)phosphates and flavins. In the SCOP classification scheme [48], the polypeptides 2uagA1 and 1gkuB1 belong to the folds "MurCD N-terminal domain" and "P-loop containing nucleoside triphosphate hydrolase", respectively. In CATH [46], these two polypeptides are classified in the homologous superfamilies "NAD(P)-binding with Rossmann-like domain" and "P-loop containing nucleoside triphosphate hydrolase", respectively. Both proteins share the same level of CATH topology "Rossmann-fold". In the pairwise structure alignment mode the smaller protein structure (source) is superimposed on the larger protein structure (target). In the target structure only the SSEs useful for the alignment are considered, while the omission of an SSE in the source structure (introducing a gap) is penalized (see method section). Fig. 1 shows the result of the GANGSTA structure alignment for the two polypeptides as superposition of aligned SSEs. Table 1 summarizes results obtained from the pairwise structure alignment of the complete set of SSEs of source structure 2uagA1 on the target structure 1gkuB1. Although the two protein domains possess different SSE connectivities, GANGSTA was able to align them with a good P-value (below 0.05 corresponding to a confidence level of 95%, see methods section) considering all SSEs of the source structure (i.e. introducing no SSE gaps).
Table 1. Summary of structure alignment of 2uagA1 (source) and 1gkuB1 Significance of the GANGSTA score for pairwise structure alignmentsOne important application of protein-structure alignment is to find out whether a pair of proteins is structurally or evolutionarily related. SCOP or CATH databases are often used for such a classification task. Whether the similarity measure employed in GANGSTA (GANGSTA score) is suitable to assign two protein structures to the same SCOP superfamily was tested by a statistical study similar to the one described in [14]. For that purpose, we performed structure alignments of 4892 protein domain pairs where both proteins belong to the same SCOP superfamily (dataset SAME_SF40) and 88909 structure alignments of domains pairs where both proteins belong to different SCOP superfamilies (dataset DIFF_SF40). The two datasets are explained in more detail in the method section. For the protein-structure alignments from both datasets the distributions of GANGSTA scores are shown in Fig. 2. A Gumbel distribution was fitted to the GANGSTA score distribution of the DIFF_SF40 dataset with mean μ = 27.938 and standard deviation σ = 12.746 [see eqs. (18) and (19)], as described in the method section. According to Fig. 2, the distributions of GANGSTA scores of the two datasets overlap partially. Hence, it is not possible to conclude reliably from the similarity of two protein structures that they belong to the same superfamily of proteins. The ability of the GANGSTA score to discriminate between related and non-related protein structures can be illustrated as coverage versus error rate plot as shown in Fig. S2 of Supplement Data [see 1] [14,49] evaluated according to Ortiz et al. [50]. In short, the coverage is the ratio of true positives at a given P-value threshold, while the error rate defines the number of false positives for that threshold. In the above application, GANGSTA is able to detect 48% and 67% of the SCOP superfamily relationships at a confidence level of 99% and 95%, respectively. Additional file 1. Supplement Data. Microsoft Word Document. Contains additional information on implementation details of the algorithm, additions to the analysis of the significance of the GANGSTA score, detailed datasets used for our tests, detailed listing of the clusters found for the Rossmann structure motif. Format: DOC Size: 1.9MB Download file This file can be viewed with: Microsoft Word Viewer
GANGSTA structure alignments with non-sequential SSE connectivityWe studied the performance of GANGSTA for alignment of protein structures with non-sequential SSE connectivity that are known from literature (example: the four helix bundles or the TRAF-domain-like fold studied in [19,44]). Additionally, we show significant alignments of protein structures with non-sequential SSE connectivity involving the Rossmann and Rossmann-like structure motif according to classifications in SCOP or CATH. All comparisons were done in the pairwise structure alignment mode using Stride [51] for SSE assignment. Four-helix bundlesAs a first application we selected the protein domain 2hmzA as reference structure for four-helix-bundles and aligned it pairwise with the nine other protein domains from the Four-Helix-Bundle dataset (see method section for details; results are given in Fig. 3). For all pairwise alignments the SSE assignment (relative to the reference structure), the GANGSTA score, P-value, and RMSD are listed. GANGSTA was able to align all structures within 95% confidence level. Only three protein domains (1le2A, 1aep, 1flx) were not aligned within 99% confidence level and they all contain alignment gaps (i.e. some SSEs of the source structure were not aligned). All structure alignments were superimposed with an RMSD smaller than 3.5 Å. It is noteworthy that only the alignment of 256a with the reference structure 2hmzA is optimal with sequential SSE connectivity. Fig. 4 shows the structural superposition of the two protein domains 2hmzA and 3inkC.
β-strands of TRAF immunoglobin datasetWe aligned the reference structure 1frtB against all other seven domains of the TRAF dataset (see method section for details; results are shown in Fig. 5). GANGSTA was able to align six of the seven proteins within 95% confidence threshold. Only protein domain 1k2fA could not be aligned with a significant P-value (0.2774). This is the only protein in the TRAF dataset that could be aligned to the reference structure if two gaps are introduced in the 1k2fA structure, resulting in a structure superposition with 4.3 Å RMSD. For all other structures the corresponding RMSDs are smaller than 2.7 Å. All structure alignments of 1frtB with proteins from different families possess different SSE connectivities: Only the structure alignments with members of the same family as the reference structure (1bmg, 1igtA, 1k8iA) possess the same SSE connectivity. Fig. 6 shows the superposition of 1frtB with 1czyA (left) and with 1kzzA (right), two proteins from SCOP superfamilies that differ from 1frtB. Both alignments are non-sequential in SSE connectivity relative to 1frtB.
Rossmann structure motifHere we consider a sufficiently complex and widespread structure motif, the Rossmann structure motif [47], which was first identified in dinucleotide-binding proteins. We used the SCOP domain 2uagA1 as reference structure and the Rossmann dataset (see method section for details) as target structures. Six of the seven proteins are classified as Rossmann-fold in CATH topology level except 1dhs, which is classified in SCOP as Rossmann-fold. The results are shown in Fig. 7. GANGSTA was able to align all proteins with the reference structure 2uagA1 within the 99% confidence level. All alignments were non-sequential with respect to the SSE connectivity of the reference structure, and all superpositions could be made with RMSD smaller than 4.2 Å.
Robustness of GANGSTAThe following tests are made to demonstrate the robustness of GANGSTA against variations in structure and SSE connectivity for a given fold motif. These tests also demonstrate that GANGSTA can retrieve approximately the same set of target structures when different source structures of the same motif are used. This symmetric behavior is an important feature that makes GANGSTA suitable for database scans. To assess these features we performed the following tasks. 1. A database scan with 2uagA1 (incomplete Rossmann structure motif) as reference structure (source) was conducted. The results were clustered according to the SSE connectivity pattern with respect to the reference structure. 2. Two new reference structures were generated by considering only the aligned SSEs (representing the incomplete Rossmann structure motif) (reduced structures) of 1dhs and 1cjcA2. These proteins belong to the two largest clusters containing structures of different SSE connectivity (with respect to the Rossmann structure motif) that were obtained from the preceding database scan. The corresponding reduced structures were used as reference (source) structures for two additional database scans. 3. The results obtained from all three database scans (task 1 and 2) were compared to determine whether GANGSTA is able to retrieve the same set of similar structures despite variations in the reference structures. Since GANGSTA returns all alignments for a database scan, we used a cutoff at a GANGSTA score [see eq. (10)], of 0.15. This cutoff corresponds to a P-value of 0.127, giving a significance level of 87% for pairwise structure alignment. Since we are using an estimated contact overlap q [eq. (9)] for database scans, this P-value is only an upper bound on the real structure similarity and serves only as selection criterion sufficient for this experiment. For the first task, the top 100 structure alignments with non-sequential SSE connectivity were monitored. The last of these structures was found at rank position 154. We considered in the following all aligned structures (with sequential and non-sequential SSE connectivity) of a rank lower then or equal to 154. From this set of 154 structures, the alignments involving gaps in the source structure were omitted. This yields a total of 135 structures (result set 1) that can be aligned with a significance level of 99% or higher against the reference structure (see Table 2). These 135 protein structures were then grouped into clusters containing the same SSE connectivity as the reference structure (i.e. the incomplete Rossmann structure motif 2uagA1). This results in 44 different clusters [see 1, Table S8]. For the subsequent two tasks, we considered the largest two clusters (with 11 and 9 members, respectively) of structures with SSE connectivities different from the reference structure 2uagA1 (result set 1). Table 2. Structure alignments versus 2uagA1 For the second task, we chose two representative incomplete Rossmann structure motifs from the two largest clusters with non-sequential SSE connectivity: 1dhs and 1cjcA2. Since both structures are larger than 2uagA1, we reduced them to the aligned SSEs: 1dhs(98–123, 277–320,328–357) and 1cjcA2(8–26,30–37,61–71,78–100,360–370), respectively. We performed a database scan with these two reduced reference structures to obtain result sets 2 and 3. Finally (third task), we compared the structures from the two largest clusters in result set 1 (column 2 in Fig. 8) with those retrieved by alignment with either 1dhs or 1cjcA2 or with both. Fig. 8 lists the results of this assessment. From 73 structures found with 2uagA1 as reference structure we retrieved 35 with 1dhs and 29 with 1cjcA2 as reference structure (column 2 of Fig. 8).
Protein-structure alignment tasks with sequential SSE connectivityMost programs or servers for protein-structure alignment deal with sequential structure alignments only and most of the known curated structure alignments or benchmark sets for structure alignment are constructed to test methods preserving the sequential SSE connectivity. To obtain a more representative comparison with other alignment methods we tested the performance of GANGSTA for structure alignments with exclusively sequential SSE connectivity. The two structure alignment tasks we conducted here complement the evaluation of web-based programs and servers for structure alignment applied in recent performance tests by Novotny et al. [52,53]. The authors identified protein structures as true positives (i.e., structures that are known to have an architecture similar to the reference structure) by using CATH classification [46]. The various servers evaluated in [53] all use different databases and scoring systems, so simple counting of true positives was not feasible. Therefore, we used a simple binary scoring system: at least one true positive either was or was not found in the list of significant hits. For servers that did not indicate the significance of the hits, up to 100 hits were examined. This was done for GANGSTA as well, see below. A true positive was defined on the topology level in the CATH classification scheme. Each reference structure was submitted to all servers evaluated in the Novotny study, and it was determined whether any of the structures, other than the reference structure, were found as true positive. To have a similar test scenario, we decided to reproduce these structure alignment tasks using the database scan version of GANGSTA. All database scans were performed using DSSP for SSE assignment. We used the GANGSTA score [eq. (10)] to rank the resulting structure alignments. However, no P-value could be evaluated, since for database scans GANGSTA calculates only an estimated contact map overlap q [eq. (9)] to increase the computational performance (see method section). The first task was based on a selection of protein domains (Novotny dataset, see method section) belonging to four different CATH classes (mainly-α, mainly-β, mixed α-β, few SSEs) as used in [53]. Proteins from the fourth CATH class (few SSEs) have only low secondary structure content and thus few SSE contacts. Since GANGSTA considers α-helices and β-strands only, we tested it only on those proteins of the Novotny dataset (reduced Novotny dataset) belonging to CATH classes mainly-α, mainly-β and mixed-α-β. This resulted in 53 reference proteins [see 1, Table S5]. The results of the structure alignment with GANGSTA and 11 other methods are shown in Table 3. Except for the data obtained with GANGSTA all data were taken from the literature [52,53]. Average performances differ slightly from the literature values, since the structures with low secondary structure content were omitted. In analogy to the preceding investigations on the Novotny dataset [53] the assignment of a reference structure was successful with GANGSTA, if at least one target with appropriate CATH topology was found among the top 100 ranked protein domains. GANGSTA was able to detect true positives for 52 of all 53 reference structures (98%) of the reduced Novotny dataset except for the mainly-α protein 1c3u. This protein had been moved to another topology in more recent CATH versions [46] (Table 3 and [1, Table S5]), so we could not compare the GANGSTA results to results listed for other methods. Hence, GANGSTA reaches the best result possible for the reduced Novotny dataset. Table 3. Comparison of different structure alignment methods for three structure classes according to CATH [46] The second task considers a database search with eleven pairs of structures from the Fischer dataset (see method section for details) that were considered as difficult structure alignment cases [8] before. According to Novotny et al. [53], a search was considered to be successful, if for a reference structure the defined result structure or a homologous structure was found. Again true positives were searched among the top 100 ranked targets from structure alignment. GANGSTA was able to find appropriate result structures for each of the eleven protein pairs (see Table 4 for more details). Seven results were found at top 1 position, eight within the top 10, and all within the top 50 ranked structures. Hence, in this test GANGSTA outperforms DALI and CE, which both found ten out of eleven possible structure pairs [53]. Table 4. Results for the Fischer dataset. ImplementationThe GANGSTA structure-alignment method is implemented in C++ in a first version only for UNIX systems. It is available as web application at [45]. The user can perform pairwise structure alignments or database searches against a library of 3D structures. The database in use is the SCOP40 (see method section, databases). The assignment of secondary structure can be done with DSSP [28], Stride [51] or according to the HELIX/SHEETS records in PDB [1] files. In Table S1 [see 1] the runtimes for some exemplary pairwise structure alignments and database searches are shown. All calculations were done on a Linux AMD Opteron 242 system, using one thread for the entire program including all initializations. DiscussionWe have tested GANGSTA on different datasets to assess its performance for challenging tasks in protein-structure alignment. These include (1) classification of protein superfamilies, (2) searching for structure alignments with non-sequential SSE connectivity, (3) testing robustness against structural variation, and (4) comparison with other methods considering datasets of protein structures that require sequential SSE connectivity. We could show that for structure alignments from different SCOP superfamilies the distribution of GANGSTA scores follows the well known Gumbel distribution. The same distribution was reported by Levitt and Gerstein [54], MAMMOTH [50] and FASE [14], which use different measures of structural similarity and different optimization algorithms. The discrimination between structurally related and non-related proteins (At a confidence level of 95% (99%), 67% (48%) true positives were found by GANGSTA as pictured in the coverage-error plot in Fig. S1 [see 1]) is comparable with other methods. At a confidence level of 99% PrISM [49] reported 54% and MAMMOTH [50] 50% true positives. At a confidence level of 95% MAMMOTH reported 60% and FASE [14] 72% true positives. In contrast to these studies GANGSTA reports the P-value for SCOP superfamily classification instead of SCOP fold classification. This test is more demanding, since protein structures may share the same SCOP fold but belong to different SCOP superfamilies. Generally protein-structure alignments are validated using classification schemes that discriminate according to specified criteria between related and un-related structures. For this purpose most studies use the CATH or SCOP database of classified proteins. However, these databases were also generated with specific classification criteria, which naturally may build in biases. This adds to the difficulties of fairly comparing different methods of protein-structure alignment. Kolodny et al. [55] showed that comparisons based on receiver operating characteristic (ROC) curves are often unsatisfactory with respect to the quality of protein-structure alignment. So far, the best insight into the quality of a protein-structure alignments can be obtained by visual inspection. This depends on the structural and functional features upon which the viewer focuses and is obviously subjective in nature. Protein-structure alignments from different SCOP families and superfamilies have demonstrated that GANGSTA is able to find reasonable structure alignments that may provide new insights for structure-function relationships of proteins and also for structural motifs that occur with different SSE connectivities. The results for the Rossmann dataset demonstrate that GANGSTA finds structural similarities for proteins that are known to have similar function but no obvious structural or sequence similarity. The Rossmann structure motifs are ubiquitous, appearing in the large enzyme family of kinases [56] that catalyze the transfer of phosphate groups. In these proteins, the Rossmann structure motif constitutes just a small fraction of the whole structure, which can differ significantly in the remaining part of the structure. Hence, SCOP classifies these proteins in different superfamilies, such as MurCD N-terminal domain, FAD/NAD(P)-binding domain, HAD-like, NAD(P)-binding Rossmann-fold domains, DHS-like NAD/FAD-binding domain, UDP-Glycosyltransferase/glycogen phosphorylase, and Flavodoxin-like. The structural similarity found by GANGSTA hints at functional similarity in nucleotide binding. GANGSTA is able to detect the structural similarity of those proteins despite their topological differences with respect to SSE connectivity. Protein structures with different SSE connectivity often exhibit large structural variations in terms of RMSD, but can simultaneously have large contact overlaps and a GANGSTA score [eq. (10)] close to zero, corresponding to high quality structure alignment. In a test for robustness of GANGSTA the incomplete Rossmann structure motif 2uagA1 could be retrieved with database scans using 1dhs or 1cjcA2 as reference structure. The robustness also demonstrates the symmetric behavior of GANGSTA. From the 73 structures found with 2uagA1 from the largest three clusters (column 2 in Fig. 8) 35 (columns 3+4) and 29 (columns 3+5) were retrieved by structure alignment on database scans with the Rossmann structure motif taken from the structures 1dhs and 1cjcA2, respectively, although there are large variations in the Rossmann structure motif of these three reference structures (see RMSD in Table 2). The fact that variation in SSE connectivity did not influence the retrieval of similar structures is not surprising, since GANGSTA considers the SSEs as independent secondary structure elements and disregards the connecting polypeptide loops. Analog to a recent study [15], GANGSTA found different clusters of protein domains with different SSE connectivities for the Rossmann structure motif. Among these aligned structures with non-sequential SSE connectivities are protein domains belonging to different CATH [46] topology levels or different SCOP [48] fold levels. Hence, GANGSTA is able recognize structure similarities of protein domains that share the same CATH architecture but belong to different CATH topologies. Thus, GANGSTA may be useful to classify protein structure domains. Rossmann structure motifs with different SSE connectivities carry out similar functions, which is a clear example of convergent evolution. The fact that protein function can be correlated with CATH architecture rather than the more detailed CATH topology is an interesting observation. Although GANGSTA was designed and implemented specifically to find unusual protein-structure alignments with non-sequential SSE connectivity that are hard to detect, we could show that even for sequential SSE connectivity GANGSTA is able to compete with other established protein-structure alignment methods like DALI [11], VAST [8], YAKUSA [52], and CE [12]. Regarding the number of aligned residues and the overall RMSD results individual pairwise protein-structure alignments with GANGSTA are generally somewhat inferior to the results obtained with other methods. But, for the more imprecise database scan method GANGSTA outperforms structure-alignment methods that consider sequential SSE connectivity only. ConclusionGANGSTA is able to find meaningful protein-structure alignments independent of the SSE connectivity. The RMSD is often used as a similarity measure for structure alignment. We could show that functionally related protein domains can have large structural variations in terms of RMSD. The contact map overlap (CMO) and the newly introduced GANGSTA score [eq. (10)] can identify structures with different SSE connectivity not detectable by methods maintaining SSE connectivity. Structure-alignment methods considering the geometry of loops that connect the regularly structured SSEs (α-helices and β-strands) in a protein have a strong bias for sequential SSE connectivity. Hence, these methods have difficulty finding structural alignments that are non-sequential in SSE connectivity. Even if a protein fold cannot be aligned to another protein structure while maintaining the SSE connectivity, structural similarity may still exist for different SSE connectivities despite large RMSD. GANGSTA tends to align large fold motifs regardless of SSE connectivity. This is due to the following features. (1) GANGSTA does not optimize distances between residue pairs, but maximizes the number of residue pair contacts. (2) The number of gaps (i.e. the number of not aligned SSEs in the source structure) is restricted to make sure that a maximum number of of SSEs and consequently also of residues are aligned. (3) GANGSTA ignores loop structures, which helps to find structure alignments that are non-sequential in SSE connectivity. MethodsGeneral scope of methodFor the protein-structure alignment problem, we call the smaller of the two protein structures the source structure and the larger the target structure. To increase flexibility of structure alignment we allow, in analogy with sequence alignment, gaps in the source structure. These gaps are assigned a penalty to ensure a global alignment. Thus, not all SSEs of the source structure are explicitly aligned. Gaps in the target structure occur naturally and are not subject to a penalty, since at most the number of SSEs in the source structure can be aligned. Note that no gaps are allowed within SSEs. In the present approach, protein structures are modeled graph-theoretically as contact maps on two hierarchical abstraction levels. On the residue level, the structure of a polypeptide chain with N residues can be represented by an (N × N) – matrix C of residue-pair contacts, where Cij is 1 if there is a contact between i and j and 0 otherwise. Residue-pair contacts can be defined in different ways. One definition is based on the shortest distance between any atom pair of residues i and j (all atom contact), which are in contact if this distance is smaller than a given threshold R [57,58]. Alternatively, a contact can be defined by Cα- or Cβ-atom pair distances of the corresponding residues [59]. In our application, a contact is established, if the Cα-atoms of two residues are separated by less than 11 Å, a value optimized empirically for protein-structure recognition by Bastolla et al. [60]. On the secondary-structure level, a contact between two SSEs is established if there exists at least one contact on the residue level of these SSEs. The GANGSTA procedure for protein-structure alignment is organized in two hierarchical levels. On the first level, SSEs are aligned by a GA that optimizes SSE contact similarity, yielding a selection of promising structure alignments. On the second level, equivalent SSEs are shifted relative to each other to maximize residue contact overlap. Graph representation of SSEsThe three-dimensional arrangement of SSEs in a protein can be modeled suitably as an attributed, undirected graph
Structure alignment on SSE level (1st level of the hierarchy)The problem of finding a structure alignment for a source protein structure (s) onto a target protein structure (t) of equal or larger size, represented by the graphs Gs and Gt, can be understood as the task to find the maximum common subgraph (maximum subset of Vs and Vt) Gst ⊆ Gs, Gt. Thus, a structure alignment can be specified as subgraph isomorphism gst: Gs → Gt composed of two maps: gV: Vs → Vt and gE: Es → Et relating structural details between the two considered proteins. There are two constraining conditions that must be fulfilled for a valid structure alignment: only the same type of SSEs (α or β) can be matched and equivalent SSEs cannot have length differences exceeding ten residues | These two conditions must hold only for SSEs that are explicitly considered in a structure alignment. If gaps are introduced, some SSEs in the source structure are ignored. Furthermore, the difference in contacts and the SSE pair orientation mismatch has to be minimized. In eq. (4) the binary operation ⊙ compares the SSE pair orientation, x, y ∈ {O, X, A, P}, of two structures To evaluate the quality of a given structure alignment for a pair of proteins (s,t), represented by the graph monomorphism gst, we use the following objective function The first term in the objective function measures the structural similarity between source and target proteins by comparing the number of contacts between aligned SSEs. It is normalized to yield unity for contact identity (each contact in the source structure can be mapped on the target structure) and zero for no common contacts. The second term considers similarity in the relative orientation of SSE pairs in source and target structures, again normalized to yield unity for a perfect match and zero, if none of the orientations agree. These two terms are tuned by the weights wC and wO. Matching SSEs with length differences above a threshold is penalized depending on SSE type by the parameter L. A small number of SSEs from the source structure may not be considered for structure alignment. Those gaps are penalized by the gap penalty factor GP to ensure that the GA tries to find the maximum common subgraph instead of an arbitrary, small subgraph. Depending on its sign, the term Seq is a weight to favor sequential or non-sequential solutions [see 1, implementation details]. The parameters wC, wO, and penalty factors L, GP, Seq in eq. (6) were optimized to yield maximum separation of structure pairs belonging to the same SCOP superfamily from those belonging to different SCOP superfamilies (see Fig. 2) referring to the GANGSTA score, eq. (10). Genetic algorithmGAs are heuristic methods to tackle difficult optimization problems. GAs use principles of evolution to create a set of individuals and to let them evolve from generation to generation using specific gene operations. Individuals are possible solutions (generally sub-optimal) of the optimization problem. In our case individuals can be identified with a specific graph monomorphism gst probing the similarity between two protein structures (source and target), which can be represented in terms of a vector The |Vs| components of A new generation evolves by gene exchange and mutations applied to individuals to find improved solutions with larger values of the objective function, eq. (6). The newly generated children and the fittest parents form the next generation. This procedure is repeated until the optimum is found or a suitable stop criterion is reached. We use the following gene operators in our GA: 1 Gene operators exchanging genes between pairs of individuals: 1A Random crossover: A random number of randomly selected components of 1B Two-point crossover: Two components of 1C Helix crossover: All components of 2 Gene mutation operators applied to a single individual: 2A Random mutation: A small, random number of 2B Exchange mutation: Two components of 2C Greedy mutation: For a random component gj (gene) the allele with the highest value of the objective function obj( Some of the gene operators create children that do not agree with our constraints (for instance a duplicate usage of one SSE in the same structure violating the injectivity of the monomorphism). Those "lethal" children are discarded. Structure alignment on residue level (2nd level of the hierarchy)The result from the GA is a structure alignment on the SSE level. Often there are length differences among pairs of matched SSEs. In this case, the shorter SSE is shifted along the longer SSE to find an optimal arrangement with respect to residue pair contacts. Two methods are used to solve the problem. All possible combinations of residue assignments for each pair of SSEs from the structure alignment are considered to find the most similar residue pair contact map. Gaps in an individual SSE on the residue level would result in an SSE consisting rather of two instead of one SSE (if the gap is close to the center of the SSE) or in an effectively shorter SSE (if the gap is introduced on the edge of the SSE). These situations are considered on the SSE level as two independent SSEs or as a shorter SSE, respectively. Hence, no gaps in SSEs need to be considered. The residue contact map overlap where Cs and Ct are Cα contact maps of the source and target protein structures, respectively. The combination of SSE assignment (
GANGSTA scoreThe last step in the structure alignment procedure is a superposition of source and target protein structures minimizing the RMSD of the Cα atoms (RMSD(Cα)) with the Kabsch algorithm [28]. To rank the quality of multiple pairwise structure alignments the value of the objective function, eq. (6), is a crude method working on the SSE level, designed for fast screening of many individuals occurring in the GA. The residue contact map overlap This GANGSTA score is normalized by the number of aligned residues NalnRes and a small ε = 10-5 is added in the denominator to avoid division by zero. The smaller the GANGSTA score is, the larger is the structural agreement between the considered pair of proteins. ΔSSE, is defined as where the sums run over the number of SSE pairs Statistical significance of GANGSTA scoresTo assess the quality of pairwise protein-structure alignments we use a method described by Ortiz et al. [50] and Vesterstrøm et al. [14] following the work of Levitt and Gerstein [54] and Abagyan and Batalov [62]. To estimate the statistical significance of GANGSTA scores, eq. (10), we calculate a P-value describing the probability to get a better GANGSTA score than observed when aligning unrelated structures. This P-value can be obtained by fitting a Type I extreme value distribution function (Gumbel distribution) on the GANGSTA score distribution resulting from pairwise structure alignments of unrelated proteins. The Gumbel distribution possesses the probability density function [63]. with parameters a and b for location and width, respectively. To fit the GANGSTA score distribution with the Gumbel probability density function the parameters a and b in eq. (15) need to be determined. Since protein-structure alignments are of higher quality for smaller GANGSTA scores, this part of the Gumbel distribution is more relevant for the fit than the tail at larger GANGSTA scores [50]. Therefore, we evaluated the probability to obtain GANGSTA scores t lower than x. The corresponding expression of the Gumbel distribution reads Eq. 16 can be transformed into a linear expression by applying the logarithm twice yielding The parameters a and b can now easily be estimated by a linear fit between the probability of GANGSTA scores t ≤ x obtained from structure alignment of unrelated proteins and the corresponding probability PG(t ≤ x) form the Gumbel distribution. Once we have determined a and b, we can calculate the mean μ = a + γb, (18) where γ = 0.5772 is the Euler-Mascheroni constant and the standard deviation of this distribution. Using the linear transformation Database searchFor a database scan a reference structure is compared (aligned) with all sample structures in the database. In most applications the reference structure is also the source structure, i.e. the reference structure is smaller than the sample structure from the database. However, the reference structure can also be the target structure if the sample structure from the database is smaller than the reference structure. To speed up database searches a pre-filter is applied to limit the search for proteins that match certain criteria. These involve the number of SSEs, the structure diameter (i.e. maximum distance between any pair of SSE measured between the Cα atoms in the SSE centers) and the number of SSEs in contact (based on Cα atom distances). A protein structure from the database (sample structure) is only considered for structure alignment if the corresponding pair of source and target structures fulfill the following three basic criteria. (i) The target structure has at most one α-helix or one β-strand less than the source structure. (ii) The structure diameter of the source structure should be at most twice as large as the diameter of the target structure. (iii) The source structure should have no more than twice as many α-helix or β-strand pairs in contact as compared to the target structure. Additionally, for the computationally demanding second level of the method, the residue-based structure alignment step, a rough estimate for the contact map optimization is used. To estimate the contact overlap value q, eq. (9), we use a greedy-algorithm, which starts by finding the optimal offset (see Fig. 10) for the considered SSE pair yielding the largest number of contacts. Then the algorithm continues by finding the optimal offset for the pair having the second largest number of contacts and so forth. While the problem of finding a global optimal residue alignment cannot be solved with a such a local strategy, the estimated overlap values are in good agreement with optimal results. However, this estimate is sometimes up to 10,000 times faster than the method used for finding optimal structure alignments on the residue level as described above. Since we are using an estimated contact overlap q, eq. (9), the reported P-value for database scans is only an upper bound of the P-value for pairwise alignments. Protein-structure datasetsNon-redundant dataset of protein structures (SCOP40)We used a non-redundant subset of the ASTRAL SCOP compendium [64] version 1.67 including only SCOP [48] domain structures with at most 40% sequence identity. The SCOP40 dataset can be downloaded from the ASTRAL webpage [65]. To guarantee an appropriate performance of GANGSTA all structures in the dataset contain at least two SSEs and have more than 30% of their residues in SSEs resulting in 7158 domain structures. This dataset is used for all database searches with the web version of GANGSTA. Protein-structure datasets for statistical significance of classification tasksFrom the SCOP40 dataset we generated two additional datasets. SAME_SF40 consists of 4982 random pairs of SCOP domain structures taken from the same SCOP superfamilies. The protein pairs involve 672 different SCOP domains taken from 113 different SCOP superfamilies belonging to 99 different SCOP folds. DIFF_SF40 consists of 88909 random pairs of SCOP domain structures where for each pair the proteins are taken from different SCOP superfamilies. This dataset of protein pairs involves 500 different SCOP domains from 317 different SCOP superfamilies belonging to 243 different SCOP folds. The domain lists of DIFF40 and SAME40 are shown in Tables S6 and S7 [see 1]. The list of the corresponding domain pairs can be provided on request. Four-Helix-Bundle datasetThis dataset comprises ten proteins belonging to four different folds and six different superfamilies in the SCOP classification scheme. Table S3 [see 1] shows the dataset of ten proteins and their SCOP annotations [48]. This dataset was used before in [44]. TRAF datasetThe dataset consists of eight proteins that belong to two different folds in the all-β class of the SCOP database. Four proteins (PDB-IDs: 1czyA, 1kzzA, 1lb4, 1k2fA) belong to the "TRAF (TNF Receptor Associated Factor) domain-like" fold but are members of two different families: 1czyA, 1kzzA, and 1lb4 were taken from the "TRAF domain" family, 1k2fA belongs to the "SIAH" family. Four proteins (PDB-IDs: 1bmg, 1frtB, 1igtA, 1k8iA) of the TRAF dataset belong to the "C1 set domains" family of the "Immunoglobulin-like beta-sandwich" fold. This dataset was used before in [44]. Rossmann datasetThe dataset consists of seven protein domains that contain Rossmann and Rossmann-like structure motifs according to CATH or SCOP classification schemes. The proteins (target structures) are listed in Fig. 7. All proteins have less than 40% sequence similarity. Fischer datasetThis dataset consist of ten protein-structure pairs introduced by Fischer et al. [66] and used by Novotny et al. [53]. Novotny added the last pair (1g61/1jdw). The PDB ids of the protein pairs are: 1bgeB/2gmfA, 1cewI/1molA, 1cid/2rhe, 1crl/1ede, 1fxiA/1ubq, 1ten/1hhrB, 1tie/4fgf, 2azaA/1paz, 2sim/1nsbA, 3hlaB/2rhe, 1g61/1jdw. Novotny datasetThis dataset consists of representative proteins from four different CATH [46] classes (classes: mainly-α, mainly-β, mixed-α-β, few SSEs) and was applied in a recent performance test by Novotny et al. [53]. The protein domains and their corresponding CATH classifications are listed in Table S5 [see 1]. The whole Novotny dataset and the benchmark results are available on [67]. Authors' contributionsBK designed and carried out research and drafted manuscript. PM designed and carried out research and drafted manuscript. TSG discussed and applied GANGSTA methodology. TS coordinated research and helped to draft manuscript. EWK coordinated and designed research and drafted manuscript. All authors read and approved this final version. AcknowledgementsWe thank Dr. Dennis Diestler for critical reading of the manuscript. This project was supported by the DFG Forschergruppe Project KN 329/5-1/5-2, GRK 80/2, GRK 268/2, GKR 788/1 and by the BMBF (Germany), grant no. 031U209A (BCB). References
Have something to say? Post a comment on this article! |




on Google Scholar







author email
corresponding author email
Figure 1.
Figure 2.
Figure 3.
Figure 4.
Figure 5.
Figure 6.
Figure 7.
Figure 8.





Figure 9.











Figure 10.









