Mathematical and Statistical Computing Laboratory, DCB, CIT, NIH, DHHS, Bethesda, MD, USA

Laboratory of Molecular Biology, CCR, NCI, NIH, DHHS, Bethesda, MD, USA

Mathematique Informatique et Genome, INRA, Jouy-en-Josas, France

Abstract

Background

Current classification of protein folds are based, ultimately, on visual inspection of similarities. Previous attempts to use computerized structure comparison methods show only partial agreement with curated databases, but have failed to provide detailed statistical and structural analysis of the causes of these divergences.

Results

We construct a map of similarities/dissimilarities among manually defined protein folds, using a score cutoff value determined by means of the Receiver Operating Characteristics curve. It identifies folds which appear to overlap or to be "confused" with each other by two distinct similarity measures. It also identifies folds which appear inhomogeneous in that they contain apparently dissimilar domains, as measured by both similarity measures. At a low (1%) false positive rate, 25 to 38% of domain pairs in the same SCOP folds do not appear similar. Our results suggest either that some of these folds are defined using criteria other than purely structural consideration or that the similarity measures used do not recognize some relevant aspects of structural similarity in certain cases. Specifically, variations of the "common core" of some folds are severe enough to defeat attempts to automatically detect structural similarity and/or to lead to false detection of similarity between domains in distinct folds. Structures in some folds vary greatly in size because they contain varying numbers of a repeating unit, while similarity scores are quite sensitive to size differences. Structures in different folds may contain similar substructures, which produce false positives. Finally, the common core within a structure may be too small relative to the entire structure, to be recognized as the basis of similarity to another.

Conclusion

A detailed analysis of the entire available protein fold space by two automated similarity methods reveals the extent and the nature of the divergence between the automatically determined similarity/dissimilarity and the manual fold type classifications. Some of the observed divergences can probably be addressed with better structure comparison methods and better automatic, intelligent classification procedures. Others may be intrinsic to the problem, suggesting a continuous rather than discrete protein fold space.

Background

A protein fold is often defined by the number, direction in space and connectivity (or topology) of its secondary structural elements

The situation is complicated by the presence of domains in protein structures. Their identification and delineation are not straightforward. Nevertheless, to have a better understanding of the effect of discrete classification as a description of the fold space, we analyzed the SCOP domain classification using two structure comparison methods applied directly to these domains. Numerous structure comparison methods exist

Here we use two structure comparison methods which are based on different principles and with which we are familiar. One, VAST

Results

ROC curves

The ROC curves of each method show that both VAST and SHEBA are generally successful in detecting when two domains are in the same SCOP fold (Figure

ROC Curves

**ROC Curves**. ROC curves of VAST (dotted line) and SHEBA (solid line) obtained by plotting the True Positive Rate (

An optimal cutoff value for the binary decision of similarity can be determined from the ROC curve either by specifying the desired

Confusion matrix heat maps

Figure

Confusion matrix heat map

**Confusion matrix heat map**. Confusion matrix heat map for VAST with a _{i}(c) (eq. 4, see Methods) and fold-specific false positive rate _{i,j}(c) (eq. 3, see methods) respectively. To improve the visibility of the heat maps, rates between 0 and 0.2 are represented in grey scale where white corresponds to a rate of 0 and black to a rate at or above 0.2. For high resolution heat maps of VAST and SHEBA, [See

• VAST and SHEBA heat maps • Complete heat map of VAST and SHEBA, obtained for a _{i}(c) (eq. 4, see Methods) and fold-specific false positive rate _{i,j}(c) (eq. 3, see Methods) respectively.

Click here for file

Neither the VAST nor the SHEBA heat map is strictly symmetric (Figure

VAST uses an heuristic algorithm to find the maximal clique so the comparison of domain A with B may not select the same clique as the comparison of B with A when there are several near maximal cliques. The result is a slight asymmetry in the

False negatives

The true positive rate varies with fold class, as illustrated in Figure

Distribution of true positive rates

**Distribution of true positive rates**. Distribution of fold specific true positive rates within each SCOP class (A to G) for VAST and SHEBA. _{i }(eq. 4, see Methods) are obtained using same cutoff values as in Figure 2. The scale of the y axes for VAST and SHEBA distributions are the same within fold class. Histogram bar height represents the number of folds for a given range of _{i}. The x axis is divided in 20 bins. The class-specific average _{i }is reported within each subplot. For the list of _{i }obtained by each fold, with VAST and SHEBA, [See

About 40% of the folds (216) achieve a fold specific true positive rate (_{i}) above 0.85 for both methods. All classes are nearly proportionally represented in this set. For the exhaustive list of _{i }obtained by each SCOP fold with VAST and SHEBA, [See

• Fold-specific True Positive Rates (see Methods) at 1% False Positive Rate, for VAST and SHEBA, for 468 SCOP Folds in the order of the Heat Map. • Rows 1 to 7 correspond respectively to: the row number, the SCOP fold identifier, the number of domains within a fold, _{i }value obtained by the fold with VAST, _{i }value obtained by the fold with SHEBA, SCOP name of the fold, and SCOP description of the fold.

Click here for file

To investigate why some domain pairs in the same SCOP fold are not detected as similar, we look at such domain pairs that belong to the same SCOP fold and for which the

Folds having domain pairs with undetected similarity by both VAST and SHEBA.

**Class**

**List of folds**

A

a.4(1576/13572), a.118(777/2550), a.39(282/1640), a.60(238/812), a.138(166/272), a.24(77/930), a.1(62/930), a.2(47/272), a.100(39/90), a.25(37/182), a.3(37/992), a.29(25/132), a.26(20/650), a.23(10/20), a.28(10/72), a.69(9/20), a.7(9/342), a.93(8/42), a.102(7/600), a.112(4/20), a.127(4/30), a.35(4/110), a.61(4/30), a.5(3/90), a.55(3/20), a.74(3/272), a.116(2/20), a.126(2/30), a.133(2/20), a.137(2/6), a.64(2/20), a.128(1/42), a.144(1/12), a.27(1/72), a.48(1/6), a.6(1/42).

B

b.1(2973/57840), b.40(1382/7482), b.34(436/2652), b.82(341/930), b.10(323/1640), b.2(164/702), b.29(163/1056), b.85(91/156), b.43(69/702), b.84(49/182), b.30(32/110), b.50(16/132), b.18(14/552), b.13(12/110), b.35(11/72), b.7(11/182), b.19(10/30), b.6(8/1406), b.80(8/110), b.92(7/56), b.3(6/110), b.60(6/420), b.106(5/6), b.52(4/132), b.49(3/12), b.58(3/20), b.21(2/6), b.45(2/12), b.53(2/6), b.83(2/2).

C

c.37(6218/14762), c.1(1152/32942), c.55(929/2756), c.26(255/1722), c.52(228/506), c.2(197/9702), c.23(161/4160), c.69(92/2550), c.94(90/600), c.66(87/1190), c.56(38/552), c.47(17/2550), c.58(16/110), c.92(16/110), c.3(13/2070), c.10(12/306), c.53(12/72), c.8(12/90), c.14(9/110), c.51(9/156), c.72(6/210), c.43(4/42), c.61(3/272), c.36(2/342), c.19(1/6), c.63(1/20), c.78(1/132), c.87(1/30), c.9(1/2), c.97(1/12).

D

d.58(2052/17556), d.92(235/552), d.3(221/380), d.142(164/380), d.15(104/3080), d.169(74/552), d.26(74/306), d.17(59/552), d.81(54/210), d.153(49/600), d.166(42/90), d.211(40/132), d.144(33/650), d.110(26/306), d.129(23/182), d.68(23/90), d.2(22/132), d.14(14/240), d.79(14/210), d.108(12/210), d.16(12/182), d.87(10/156), d.4(8/12), d.104(5/210), d.109(4/182), d.122(4/110), d.143(4/6), d.41(4/90), d.67(4/20), d.10(3/20), d.50(3/72), d.184(2/2), d.52(2/90), d.18(1/2), d.74(1/56), d.82(1/6).

E

e.8(110/182), e.26(3/6)

F

f.1(58/110), f.4(46/182), f.21(12/42), f.23(5/20), f.7(4/6).

G

g.3(357/1406), g.41(96/420), g.15(5/90), g.17(4/132), g.39(2/132).

Folds from classes A, B, C, D, E, F and G are reported in rows labeled by the name of the class. Reported folds within a given class are ordered by decreasing number of domain pairs with undetected similarity they contain. The number of such pairs within a fold and the total number of pairs are indicated for each fold in parenthesis. Similarity between domains of a pair was considered undetected when their

Detailed analysis of these false negative pairs highlights some common factors which explain the varying success of automated methods in detecting the similarity among domains in a SCOP fold. Most of the false negatives can be explained by structural variation within a fold and to a lesser extend by structures made of repeating units.

Structural variation of the common core

In many cases, the structure of the

Structural variations within fold b.1

**Structural variations within fold b.1**. Domains (a) d1c5ch2, (b) d1akjd_, and (c) d1pama1 belong to the fold b.1 (Immunoglobulin-like beta-sandwich; 7 strands in 2 sheets greek-key, some members of the fold have additional strands). Domain pair (a) and (b) have

Structures made of repeating units

Automated similarity detection methods do not necessarily consider two structures similar if they contain the same simple structural motif but with a different number of repeats. The SCOP fold a.118 provides an extreme example. It is defined by domains that are comprised of repeated occurrences of a helix-loop-helix motif

Repeat of a structural motif within fold a.118

**Repeat of a structural motif within fold a.118**. The color scheme is the same as in Figure 4. Structures of domains (a) d1a17_, (b) d1kula_ and (c) d1qbkb_ from fold a.118 (alpha-alpha superhelix, multihelical; 2 (curved) layers: alpha/alpha; right-handed superhelix). Domains have 159 residues and 7 helices, 211 residues and 10 helices, and 888 residues and 48 helices, respectively. The VAST similarity score

Decoration of the common core by many secondary structure elements

Occasionally two proteins in the same SCOP fold share a common core but are different in overall shape. An extreme example is shown in Figure

Decoration of a common core

**Decoration of a common core**. Structures of domains d1e9ga_ (a) and d1enfa1 (b) of SCOP fold b.40 (barrel, closed or partly opened n = 5, S = 10 or S = 8; greek-key). Color scheme is the same as in Figure 4. Domain (a) has 284 residues, and (b) has 100 residues.

Miscellaneous cases

Some folds, such as fold d.184 or a.138, are described in SCOP as including a variety of structures. We also note the existence of several ambiguous fold definitions leading necessarily to a low _{i}. For instance, fold c.37 whose SCOP description is "3 layers: alpha/beta/alpha, parallel or mixed beta-sheets of variable sizes", can probably be split into at least 2 folds. We also spotted what appears to be a bookkeeping error by SCOP. Domains d1kkea2 and d1qiua2 of fold b.83 were not found to be similar either by VAST or SHEBA. The protein 1kke has two domains, which belong to two different folds. The N-terminal domain (residues 250–312) forms an extended structure belonging to the SCOP fold b.83 ("Triple beta-spiral"). The C-terminal domain (residues 313–455) forms a beta barrel belonging to the SCOP fold b.21 ("Virus attachment protein globular domain"). In SCOP and in the Astral database, the domain d1kkea1, which consists of the residues 250–312, is placed in the b.21 fold and d1kkea2, which consists of residues 313–455, is placed in the b.83 fold.

Differences between VAST and SHEBA

There are 27 folds with a _{i }below 0.05 by VAST yet above 0.9 by SHEBA. They are a.16, a.37, a.38, a.97, a.115, a.121, a.130, a.158, a.159, b.76, c.107, d.6, d.83, d.88, d.101, d.118, d.175, f.10, f.14, f.17, g.14, g.22, g.24, g.38, g.49, g.50, g.53. These are mainly small folds with only 2 domains each. No fold has been identified with a _{i }less than 0.05 by SHEBA but above 0.9 by VAST. Additionally, the class specific true positive rates reported in Figure

Some of the differences observed between VAST and SHEBA are related to the calculation of the scoring function in VAST (see Appendix, Calculation of _{i }averages only 0.2 for folds with 2,3 or 4 SSEs, but rises to 0.7 when the fold has about 9 or more SSEs (data not shown). But at least one case could not be explained by the issue of the

False positives

The off-diagonal pixels in the heat maps, on Figure _{i,j}. The confusion made by each method has different characteristics, shown by the difference in the distribution of the dark areas. There are a relatively small number of pixels between classes. In contrast, confusion within each class varies with the method and can be high.

The main confusion is within classes B, C and D, with respectively 37 folds out of 78 within B class, 80 folds out of 94 within class C, and 53 folds out of 139 within class D, involved in some type of confusion. VAST does not show a noticeable level of confusion within classes A, and F, although SHEBA does. The relatively high A-class confusion level for SHEBA is probably related to its use of the dynamic programming algorithm, without gap penalty, in finding the best alignment between a pair of superimposed structures

Besides these global observations, more specific confusion trends can be determined by analyzing the predominant confusion patterns shown by the heat maps.

Intraclass confusion

Confused folds occur mainly near the diagonal of the sorted heat map, as a result of the hierarchal clustering and re-ordering of the folds within each fold class (see Methods).

Table

Sets of folds confused by both VAST and SHEBA.

Sets of confused folds, S

Number of domains in S

Sheba _{S }(%)

Sheba _{S }(%)

**Sheba FPR**_{S}/**TPR**_{S }**(%)**

Vast _{S }(%)

Vast _{S }(%)

**Vast FPR**_{S}/**TPR**_{S }**(%)**

Explanation for confusion

1

a.28, a.39

50

29

57

**51**

10

16

**64**

4 helix bundle up-and-down (a.28), and 4 helix array of 2 hairpins folds. Confusion is caused by match of helices oriented similarly. Folds confused mostly by SHEBA.

2

a.46, a.52

9

45

97

**46**

7

36

**20**

4 helix bundle left and right-handed super helix (a.46), and 4 helix right-handed super helix folds. Confusion is caused by match of helices oriented similarly. Folds confused mostly by SHEBA.

3

a.47, a.7

24

87

88

**98**

8

20

**40**

3 helix bundle (a.7) and 4 helix bundle (a.47) folds. Confusion due to match of very similar structure. Folds confused mostly by SHEBA.

4

b.68, b.69, b.66, b.67, b.70

45

92

98

**94**

40

83

**48**

Beta-propellers (repetitive 4-stranded blades) folds, of 4, 5, 6, 7 or 8 blades depending on the fold. Confusion is caused by match of several 4-stranded blades among domains of these folds.

5

b.1, b.2, b.3, b.7, b.12.

297

19

66

**29**

32

68

**48**

Beta sandwich folds of 7, 8, 9 stranded-sheet, with Greek-key topology. The motif causing the confusion among folds is a sandwich, which is rather well matched between domains of these folds.

6

b.24, b.71

24

69

97

**72**

27

93

**29**

Sandwich fold, with 10 strands in 2 sheets, and "folded meander topology" fold (b.24), and folded sheet with Greek-key topology. Confusion is due to match of parts of the sheets of the common core of these folds.

7

b.60, b.61

30

63

90

**70**

57

78

**74**

Closed barrel, with meander topology. Confusion caused by good match of between barrel motifs of the common core.

8

b.43, b.49, b.58, b.44

39

42

71

**59**

32

72

**44**

Folds of closed barrel with Greek-key topology. Confusion is due the match of substantial part of the barrel common core, among domains of these folds.

9

b.107, b.4

4

100

100

**100**

25

100

**25**

Sandwich fold (b.4), and closed barrel fold (b.107). Confusion is caused by the good match between a deformed barrel motif and a sandwich motif.

10

b.34, b.38

62

69

67

**103**

19

49

**39**

Barrel folds, with meander topology. Confusion is caused by the match between the barrel common cores.

11

b.38, b.56

12

52

100

**52**

65

93

**70**

Open barrel (b.38) and closed barrel (b.56) folds. Confusion is caused by the match of the barrel.

12

b.10, b.19, b.13, b.18, b.22, b.23

91

42

76

**55**

16

54

**29**

Folds with common core motif of beta sandwich; the 2 sheets are made of 8, 9 or 10 strands depending on the fold, and with jelly roll topology. The confusion among these folds is caused by the match of the strands of the beta sandwich common core.

13

c.1, c.6

185

62

75

**83**

78

87

**90**

TIM barrel (c.1) and variant of beta/alpha barrel, with closed parallel beta-sheet barrel (c.6) folds. Confusion is caused by the match of almost the whole TIM barrel.

14

c.8, c.98

14

50

75

**68**

30

54

**56**

3 layer beta/beta/alpha (c.8) and 3 layer alpha/beta/alpha (c.98) folds. Confusion is caused by the match between common beta/alpha layers.

15

c.84, c.95

19

65

91

**71**

55

92

**60**

3 layer alpha/beta/alpha of 4 strands (c.84), and of 5 strands (c.95) folds. Match of the 3 layer alpha/beta/alpha common core causes the confusion.

16

c.101, c.73, c.27

7

11

100

**11**

49

100

**49**

3 layer alpha/beta/alpha folds, with 5, 6 or 8 strands depending on the fold. Confusion is caused by the match of the 3 layer alpha/beta/alpha common core.

17

c.100, c.28, c.25, c.24, c.30, c.78, c.108, c.116, c.31, c.114, c.3, c.4, c.49, c.59, c.16, c.57, c.44, c.48, c.2, c.33, c.32, c.34, c.23, c.62, c.65, c.5

334

24

80

**31**

51

92

**56**

3 layer alpha/beta/alpha folds, with beta sheet of 4, 5, 6 or 7 strands depending of the fold. 3 layer beta/beta/alpha with central of 5 strands for c.3. Confusion among 3 layer alpha/beta/alpha folds is caused by the match of the 3 layer alpha/beta/alpha common core. Confusion between 3 layer alpha/beta/alpha and beta/beta/alpha is caused by the match of the 2 layer beta/alpha.

18

d.13, d.173

7

26

93

**28**

43

86

**50**

Fold containing the 3 layer alpha/beta/alpha common core (d.130 and unusual fold containing a common core of beta-alpha-beta-alpha-beta-alpha-beta (d.173). Confusion caused by the match of some strands and helices.

19

d.65, d.67

7

47

46

**102**

60

64

**93**

2 layer alpha/beta sandwich fold. Confusion caused by the match of 2 layer alpha/beta sandwich common core.

20

d.181, d.212

5

50

60

**83**

17

60

**28**

Folds containing beta-alpha-beta units. Confusion caused by match on the alpha/beta layers.

21

d.10, d.50

14

34

66

**51**

40

61

**66**

2 layer alpha/beta folds. Confusion caused by match on the 2 layer alpha/beta common cores.

22

d.140, d.68

12

34

68

**51**

40

52

**77**

Fold with 2 layer beta/alpha sandwich common core. Confusion is caused by match of the 2 layer beta/alpha sandwich.

23

d.151, d.160

7

75

100

**75**

58

100

**58**

Beta-sandwich; duplication of alpha+beta (d.151), 4 layers: alpha/beta/beta/alpha; mixed beta sheets (d.160) folds. Confusion due to match of the alpha beta sandwich.

24

d.95, d.206, d.64

12

18

96

**18**

34

79

**43**

2 layer alpha/beta sandwich folds. Confusion caused by the match of the 2 layer alpha/beta sandwich.

25

d.11, d.40

5

100

100

**100**

67

100

**67**

2 layer alpha/beta sandwich folds. Confusion caused by match of the 2 layer alpha/beta sandwich.

26

d.130, d.80, d.52

19

53

90

**59**

51

62

**82**

2 layer alpha/beta sandwich folds. Confusion is caused by the match of the 2 layer alpha/beta sandwich.

27

d.45, d.74, d.58, d.51, d.94, d.141, d.105

160

43

58

**74**

48

59

**81**

2 layer alpha/beta sandwich, and two beta-sheets and one alpha-helix packed around single core (d.141) folds. Confusion caused by match of the sheet and strands of the 2 layer alpha/beta sandwich core motif.

28

e.24, c.16, c.57, c.44, c.23, c.5

79

47

73

**64**

68

85

**80**

A domain component of a "multi-domain" domain of fold e.24 can matches the full domain of another fold which does not belong to the E class

29

e.4, c.48, c.2, c.32, c.33, c.34, c.23

178

35

74

**48**

74

87

**85**

A domain component of a "multi-domain" domain of fold e.4 matches the full domain of another fold which does not belong to the E class

Clusters of confused folds in VAST and SHEBA heat maps are reported. Rows 1 to 27 are intra-class clusters of confused folds found along the diagonal of the heat map. Only confusions in classes A, B, C and D are reported. Rows 28 and 29 are two off-diagonal clusters involving multi domains. Clusters and confused folds are listed in the order of appearance in the heat map. The heat maps of both methods obtained at 1% overall _{S}, _{S }(see Methods) and their ratios (in bold), for SHEBA, respectively, similarly, columns 7 to 9, report _{S}, _{S }and their ratios (in bold), for VAST, respectively.

Confused folds in the A class include helix bundles of either identical or a similar number of helices in similar relative orientations. Examples are reported in Table

Figure

Confusion matrix for the B class

**Confusion matrix for the B class**. Confusion matrix heat map for VAST and SHEBA showing confusion among some SCOP folds of the class B, mainly beta domains. Fold identifiers appear on the x and y axis. Grey scale from white to black for positive rates from 0 to 1.

Similar structures in different SCOP folds

**Similar structures in different SCOP folds**. Structures of domains (a) d1gyha_ of fold b.67 and (b) d1loqa2 of fold b.69, with 318 residues and 295 residues respectively. They correspond to beta propeller domains with respectively 5 and 7 four-stranded blades. The

The next cluster of five folds in Figure

A large common confusion pattern among folds appears at the bottom right corner of the C class area of the heat map (Figure

Superposition of two structures

**Superposition of two structures**. Superposition by VAST of two structures from different 3 layers alpha/beta/alpha SCOP folds of class C. View of backbones of domains (a) d1a8p_2 and (b) d1a9xa2, from folds c.25 and c.24, respectively. The common parts of both structures superposed by VAST, are in red and the unmatched residues in green. The superposition aligned 71 residues; d1a8p_2 has 158 residues and d1a9xa2 has 138; RMSD = 2.7,

The C class also shows some small confused sets among folds with different architectures. For example, confused folds c.1 and c.6 (Table

Figure

Confusion matrix heat map for the D class

**Confusion matrix heat map for the D class**. Confusion matrix heat map for the D class for VAST and SHEBA showing clusters of confused SCOP folds. The fold identifiers appear on the x and y axis. Grey scale from white to black for positive rates from 0 to 1.

We have noticed confusions involving distinct motifs such as between the beta sandwich fold b.4 and beta barrel fold b.107, (Table

Confusion between SCOP folds of class B

**Confusion between SCOP folds of class B**. Color scheme is the same as in Figure 4. Domain (a) d1tvda_ and domain (b) d1pama1 belong to the same fold, b.1 (sandwich; 7 strands in 2 sheets; greek-key), and are found similar with

Interclass confusion

Finally, the heat maps also show off-diagonal grey or black pixels where members of a SCOP fold in one class are detected as similar to domains in another. Both heat maps present such confusion patterns. As apparent in Figure

The confusions involving the E-class ("Folds consisting of two or more domains belonging to different classes") are easily understandable. They all involve structures which contain a domain which shares similarity with another domain in a different class, mainly class C. Examples include fold e.24 confused with c.16, c.57, c.44, c.23 and c.5, (Table

Additionally, SHEBA confuses some folds from class A, with folds in classes D and F ("membrane proteins"). The most confused folds from the A and D classes, having more than 100 confused domain pairs, are: (a.118, d.211: 250 confused pairs), (a.60, d.58: 132), (a.1, d.58: 118), (a.77, d.58: 114), (a.6, d.58: 104), (a.4, d.95: 104). For confused folds a.118 and d.211, for example, even though VAST and SHEBA match a similar number of residues, the Sheba

Discussion

The combined use of the ROC curve and the confusion matrix heat map has been the key in making this large scale analysis of protein classification. Several authors

Aside from providing a global measure of the agreement, ROC curves are also useful because they provide a practical means to select a score cutoff value for deciding if a pair of structures is to be considered similar or not, by trading off true and false positive rates. Other approaches have used methods other than ROC analysis or have ignored that tradeoff entirely. In their comparison of several structure comparison methods with CATH, Sierk and Pearson

Although the ROC AUC varies somewhat by method, none of the reported values are high as desired. This raises a fundamental and important question: What mechanisms cause the automatic structural comparison methods to diverge so significantly from SCOP or CATH? To address this aspect of the problem, we need to descend from a global view of the database to a more detailed view of individual folds and finally of the domains comprising each fold. To investigate why structural comparison methods diverge from SCOP, we used the confusion matrix to distribute the 1% false positive comparisons to the individual fold pairs, resulting in a "false and true positive rates" map of the protein fold space. This can be distinguished from the map of the fold space constructed by Hou

In looking at a particular area of our heat map, we can calculate an index of how likely a method is to confuse those folds, as the ratio of the average of fold-specific false positive rates to the average fold-specific true positive rate in that area. A value near 1 indicates that the folds in this area cannot be distinguished by the structure comparison method, on the average. It is worth noting that this index is cutoff dependent, as expressed in terms of true and false positive rates, and can thus be obtained for more or less severe false positive rates. The index of confusion is related but distinct from the index of "gregariousness" in Harrison

Causes of false negatives and false positives

In the Results section we presented several examples of false negative and false positive cases related in one way or another to the common core. SCOP defines the common core of domains in the same fold to have the "same secondary structure elements in the same arrangement with the same topological connections" (Brenner et al

Variation of the common core of domains within a fold, considered insignificant by SCOP, may still be large enough to cause VAST and SHEBA to find the domains dissimilar, giving rise to false negatives as in Figures

When two domains share an apparent common core, but SCOP judges the core elements to be significantly different, SCOP places the domains in distinct folds. However, the automatic methods may find the domains similar, as in Figure

VAST and SHEBA decide on the similarity on the basis of the largest fraction of matching secondary structural elements or residues. However, visual inspection may allow the overall context of the matching and mismatching parts to play a role. If only a small part matches, but the matching part appears to be the core of each structure, then the match may appear more meaningful. If the number of repeats in a structure appears to be an important property of the structure, structures with different numbers of repeats may be placed in different folds. If, on the other hand, the precise number of repeats is not important for a structure, structures with different numbers of repeats are all placed in the same fold. If almost all parts match, but some important part, perhaps one critical beta-strand or even an irregular loop, is missing or placed differently in one structure, it may be placed in a different fold, etc.

It is possible that the problem is rooted in part, in the way structural alignment is currently conceived. Analogous to sequence alignment methodology, structural alignment maximizes the match between two structures, at the residue or secondary structure level, to infer a similarity relationship. On the other hand, the concept of similarity implicitly defined by SCOP, is focused on the sharing of higher level (above SSEs) motifs. This is in contrast to similarity measures based on the residue or SSE-level matches as defined by many structure comparison methods. We have shown examples (beta propellers, or alpha-solenoids) where occurrence of a motif is more appropriate for inferring similarity than is the maximum residue or SSE-level structural match. Although not evaluated directly here, we suspect that the structural comparison methods agree with SCOP when these two concepts agree, i.e. when the motif in question coincides with the maximum residue or SSE-level structural match, but disagree otherwise. Automatic structural similarity measures might thus be improved either by incorporating higher level structural motifs such as barrels or sheets, rather than remaining at the level of residues, strands or helices, or by weighting matching residues according to their structural context or functional importance.

Problems encountered by structural comparison methods might also be a reflection of intrinsic properties of the protein fold space. We have reported examples which tend to support the idea of structural drift

Conclusion

The results of this comprehensive comparison of VAST and SHEBA with the SCOP classification demonstrate that these two methods in their present form can reproduce at best 75% of the SCOP fold classification (for 1% false positive rate). Our detailed study of over 20 million pairs of protein domains underlines the difficulties encountered by automatic methods analyzing a classification of protein structures. A major difficulty arises from structural variation, which naturally accompanies amino acid sequence divergence, within the core of a defined fold. When severe enough, this can produce false negatives. When common cores of different folds are too similar, false positives result. Another, though less common, difficulty also arises when a motif is repeated several times within a single domain and in variable numbers. When the defining "common core" corresponds to only a small part of a whole structure, when the core is decorated extensively, automatic recognition of its similarity to other fold members becomes difficult. These divergences suggest a continuous rather than a discrete protein fold space, further complicating the problem of automatic classification. Clearly, improved algorithms of comparison must be developed and/or other types of classifications must be considered, and will be considered in future work.

Methods

Structural comparison methods

VAST is a method to superimpose and compare protein 3D structures. It consists of a two stage procedure. The first stage is based on a high-level description of protein structures. Secondary Structure Elements (SSEs) are represented by vectors and an algorithm based on a maximum clique search which finds the best one-to-one correspondence of a set of vectors in a query structure to a set of vectors in a target structure. Special care is paid to the significance of the one-to-one correspondence found between the two 3D structures. The method calculates the probability of generating a similar one-to-one correspondence by chance, and then correspondences are ranked and selected according to the value of this probability. Results of the first stage are used as seeds for the second stage.

In the second stage proteins are described using the alpha carbons (CAs) of the residues. The algorithm, based on a Gibbs-Monte Carlo procedure, tries to extend alignment of the initial seed to CAs belonging to the connecting loops. Usually, one wants to find the alignment that includes the maximum number of CAs yet with the smallest root mean square deviation (RMSD) value possible. Unfortunately, there is a correlation between the number of CAs included in the alignment and the value of RMSD: the larger the number of residues the higher the resulting RMSD value. The algorithm, in this second stage intended to solve this problem by answering questions such as: which alignment, one having 60 superimposed residues with a RMSD of 2.0, or one having 80 superimposed residues with a RMSD of 2.5 is the best one? This question is settled by choosing the alignment least likely to occur by chance, based on a

SHEBA is a protein structure comparison program which performs pairwise protein structure alignment in two steps. The initial alignment is made by maximizing the weighted sum of scores for the sequence homology, secondary structural similarity, and the similarity of the environment profile. The environment profile includes the solvent accessibility and polarity of the atoms around a given residue. The alignment is then iteratively refined in the second step, in which a new alignment is obtained from the three-dimensionally superimposed structures based on the current alignment, using a dynamic programming procedure that maximizes the number of residue pairs for which the CAs distance is less than 3.5 Å.

For each pair of proteins compared, SHEBA computes the

where <

Analysis of structure comparison methods

We consider a set of protein structural domains, _{i}} of the SCOP classification. Structural similarity of a query domain _{i }and _{i}, for some SCOP fold _{i}. Under this definition, structural similarity is an all-or-none phenomenon, as judged by SCOP, used as the reference.

Structure comparison methods are said to detect the structural similarity between a query domain

We proceed as follow. First, the similarity scores

ROC analysis

The four possible outcomes for a particular domain

The four possible outcomes of ROC analysis for a particular domain.

Domain

Domain **not **in the same fold as

Domain

**
True Positive
**

**
False Positive
**

Domain **not **detected as similar to

**
False Negative
**

**
True Negative
**

The True Positive Rate,

where _{i }>1), _{i }is the number of domains within a fold _{i}, and I(≺) is the indicator function, i.e. I(TRUE) = 1 and I(FALSE) = 0.

Likewise, the False Positive Rate, the rate at which a domain

The specificity of the method for SCOP is [1 -

The ROC curve and the area under the ROC curve

The ROC curve is obtained by plotting the True Positive Rate

Confusion matrix heat map

The performance of a similarity detection method can be studied within specific folds or fold pairs. Thus, we define a fold specific false positive rate between two different folds, _{i,j}(_{i }are detected to be similar to target domains in _{j}. We estimate this rate from our data as

We see this as a confusion in the similarity detection.

When _{i}, estimated as

The confusion matrix, defined by _{i}(_{i,j}(

The _{i}

The true positive rate averaged over a subset of folds,

where _{s }is the total number of domains represented in set

A _{S}/_{S }for a set

An alternate, but straightforward definition of the _{i}

Datasets

The set of SCOP domains considered here are drawn from ASTRAL_{i}, we study only the reduced data set of 468 folds containing 2 or more domains, which together contain

All domain pairs drawn from the reduced dataset were compared by both VAST and SHEBA, corresponding to a total number of pairs of

Abbreviations

AUC Area Under the ROC Curve

CA Carbon Alpha

CATH Hierarchical classification of protein domain structures, which clusters proteins at four major levels, Class(**C**), Architecture(**A**), Topology(**T**) and Homologous superfamily (**H**).

DALI Distance mAtrix aLIgnment

FPR False Positive Rate

NCBI National Center for Biotechnology Information

PDB Protein Data Bank

RMSD Root Mean Square Deviation

ROC Receiver Operating Characteristic

SCOP Structural Classification of Proteins

SHEBA Structural Homology by Environment-Based Alignment

SSE Secondary Structure Element

TPR True Positive Rate

VAST Vector Alignment Search Tool

Authors' contributions

VS, PM – execution of pairwise comparisons using VAST on Biowulf computer, application of ROC methodology, development of confusion matrix heat maps, statistical analysis JFG, JG – development of VAST program, interpretation of confused structure pairs CHT, BKL – development of SHEBA program, and pairwise comparisons, 3D structure visualization, interpretation of confused structure pairs

Appendix. VAST statistics: calculation of Pcli

In the first stage of VAST we consider a "high" level description of proteins. Proteins are represented by their secondary structure elements (SSEs), more specifically by the endpoints of vectors going through these SSEs. The basic task of the algorithm is to find the best 3D common substructure.

A 3D common substructure is formally defined as a one-to-one correspondence between a subset of SSE vectors in the first protein and a subset of the SSE vectors in the second protein. This correspondence respects the type of SSE (i.e., helices are only paired with helices and strands with strands) and the topology. A correspondence {(

Computation of the score for a common 3D substructure

The problem of searching for 3D common substructures is next transformed into a graph theory problem. A "comparison" graph is formed whose vertices are made of pairs of vectors, one from each protein to be compared. Two such vertices are connected by an edge if the two vectors in the first protein have the same relative orientation and spacing, within some tolerance, as the two vectors in the second. Each edge is labeled by a score

Computation of the score for a 2-clique

The 2-clique score, _{10 }(_{10}(.2) = 0.7. Therefore the smaller the RMSD between the 2 pairs of SSEs the larger the resulting score.

Computing the probability distribution for the best n-clique score

In the previous section, _{10}(

For example, assume that we found a common 3D substructure between 2 proteins, and it is a 6-clique with a score of 9.6. In order to determine the significance of this score one must compare it with a distribution of scores for randomly generated 6-cliques. The mean of the 6-clique score distribution is given by α·β = (6-1)/2.303 ≈ 2.171, 5 times larger than the mean of a 2-clique score, 0.434. As

Number of n-cliques that can be generated with a particular pair of proteins

For the sake of simplicity, we consider proteins having only one type of SSE, for instance, helices (when both proteins contain helices and strands, the problem of estimating

Calculation of Pcli for the best clique

Because C(

This approximation is valid when

Pcli = -log_{10}(EPcli).

Remark on the calculation of Pcli

Two types of problems occur. The first one is related to the number of elements of the clique with respect to the number of secondary structure elements (SSEs) found in the 2 domains being compared. To illustrate let us consider the 7-element clique that is generated when comparing domains d1amx_ and d1h6fa_ (fold b.2) having 14 and 19 SSEs, respectively. The score of this 7-clique, calculated according to Eq. 1 is 8.6. The probability of generating a 7-clique having a score ^{-5.3}. The number of 7-cliques that can be generated with the above two domains is C(^{+7}. This leads to an approximation of EPcli > 1, and hence

Acknowledgements

We thank Mr. Steve Fellini and Ms. Susan Chacko for their help with the Biowulf cluster, and Mr. Antej Nuhanovic for his contribution to make a version of the heat map publicly available. This research was supported in part by the Intramural Research Program of the NIH, Center for Information Technology and the National Cancer Institute.