Institute for Genome Sciences and Department of Biochemistry & Molecular Biology, University of Maryland School of Medicine, BioPark II, Room 617, 801 West Baltimore St, Baltimore, MD 21201, USA

National Center for Biotechnology Information; National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA

Abstract

Background

The NCBI Conserved Domain Database (CDD) consists of a collection of multiple sequence alignments of protein domains that are at various stages of being manually curated into evolutionary hierarchies based on conserved and divergent sequence and structural features. These domain models are annotated to provide insights into the relationships between sequence, structure and function via web-based BLAST searches.

Results

Here we automate the generation of conserved domain (CD) hierarchies using a combination of heuristic and Markov chain Monte Carlo (MCMC) sampling procedures and starting from a (typically very large) multiple sequence alignment. This procedure relies on statistical criteria to define each hierarchy based on the conserved and divergent sequence patterns associated with protein functional-specialization. At the same time this facilitates the sequence and structural annotation of residues that are functionally important. These statistical criteria also provide a means to objectively assess the quality of CD hierarchies, a non-trivial task considering that the protein subgroups are often very distantly related—a situation in which standard phylogenetic methods can be unreliable. Our aim here is to automatically generate (typically sub-optimal) hierarchies that, based on statistical criteria and visual comparisons, are comparable to manually curated hierarchies; this serves as the first step toward the ultimate goal of obtaining optimal hierarchical classifications. A plot of runtimes for the most time-intensive (non-parallelizable) part of the algorithm indicates a nearly linear time complexity so that, even for the extremely large Rossmann fold protein class, results were obtained in about a day.

Conclusions

This approach automates the rapid creation of protein domain hierarchies and thus will eliminate one of the most time consuming aspects of conserved domain database curation. At the same time, it also facilitates protein domain annotation by identifying those pattern residues that most distinguish each protein domain subgroup from other related subgroups.

Background

In order to provide rapid and sensitive annotation for protein sequences, including direct links to structural and functional information, the National Center for Biotechnology Information (NCBI) initiated the Conserved Domain Database (CDD)

The CDD is comprised of domain models either manually curated at the NCBI or imported from other alignment collections such as PFam

The automated annotation of functionally critical residues is an important outcome of these proposed procedures: Just as a large enzyme class conserves residues directly involved in catalysis, protein subgroups conserve residues likely involved in subgroup-specific biochemical properties and mechanisms. Our procedures use statistical criteria to glean this biochemical information from patterns of divergent residues among related sequences in a manner similar to the use, by classical geneticists, of statistical criteria to glean information from patterns of divergent traits among related individuals. (To ensure that pattern residues are functionally important, we focus on residues that are conserved across distinct phyla and thus for more than a billion years of evolutionary time). By mapping various categories of pattern residues to corresponding PSSMs, BLAST searches against these improved CD profiles can reveal those residues most likely responsible for the specific biochemical and biophysical properties of a query protein. This can accelerate the pace of biological discovery by enabling researchers to obtain valuable clues regarding as-yet-unidentified protein properties through routine web-based BLAST searches.

Other methods that may be similarly described as addressing the protein subfamily classification problem find sequence clusters either based on pairwise similarity

Because our approach identifies residues associated with protein functional divergence, it is also related to "functional subtype" prediction (FSP) methods

Problem definition and solution strategy

Here we address the following biological and algorithmic problem: We are given as input a (typically very large) multiple sequence alignment corresponding to a particular protein class. Our objective is to partition this alignment into a tree of sub-alignments, termed a CD hierarchy, each subtree of which corresponds to a sub-alignment of sequences sharing a certain pattern that most distinguishes them from those sequences associated with the parent node of the subtree and with any other subtrees attached to that parent node. We interpret these distinguishing residue signatures as associated with functional divergence of the protein class. As mentioned above, our focus is on obtaining an initial, suboptimal hierarchy that is comparable to current CDD curated hierarchies and that can serve as a starting point for further optimization using either manual or automated methods. Here we describe statistically-based heuristic procedures that, in conjunction with Bayesian sampling, can obtain such an initial hierarchy from a multiple sequence alignment.

Bayesian sampling over contrast alignment models

Our approach relies on Bayesian Markov chain Monte Carlo (MCMC) sampling

Schematic drawing of a contrast alignment and the corresponding probability model.

**Schematic drawing of a contrast alignment and the corresponding probability model.** Aligned sequences are assigned to either a ‘foreground’ or a ‘background’ partition (orange and gray horizontal bars, respectively). Partitioning is based on the conservation of foreground residues (blue vertical bars) that diverge from (or contrast with) the background residues at those positions (white vertical bars). Red vertical bar heights quantify the selective pressure imposed on divergent residue positions. Below this is given the logarithm of the corresponding probability distribution for the possible sequence partitions and corresponding discriminating patterns which together serve as the random variables over which sampling occurs. **X** is an _{i j} is a 20-dimensional vector of all 0’s except for a lone ‘1’ indicating the observed residue type; **R** is a vector indicating which rows (i.e., sequences) belong to the foreground (_{i}=1) or background (R_{i} = 0) partitions; **C** is a vector indicating which columns do (C_{j} =1) or do not (_{j} =0) differentiate the foreground from the background; **Θ** is an array of vectors representing the amino acid compositions at each column position for each partition; _{Aj} is a vector that specifies the pattern residues at position _{j} corresponds to the overall (foreground and background) composition. The third through sixth terms in the equation correspond to the logarithm of the product of the prior probabilities with **Θ)** defined by the beta and product Dirichlet distributions, respectively, and with **R**) and **C**) defined by independent Bernoulli distributions; prior definitions are as shown (in parentheses). The log-likelihood ratio (LLR) is computed by subtracting from the log-probability for the observed contrast alignment the log-probability for a ‘null’ contrast alignment, in which all of the sequences are assigned to the background partition.

Multiple category functional divergence models

More recently, a multiple category (mc)BPPS sampler was developed

Each row of a FD-table corresponds to a distinct functionally divergent subgroup of the input sequences and each column corresponds to a distinct contrast alignment whose foreground and background partitions are specified by the symbols in the table. Such a table is shown in Figure

A multiple category model optimized by the mcBPPS sampler

**A multiple category model optimized by the mcBPPS sampler.****(top)** A tree representing the hierarchical relationships between functionally-divergent protein subgroups. Color code: internal nodes, blue; leaf nodes, red. Each subtree within the tree (i.e., each node and its descendents) corresponds to a set of sequences that generally conserve a pattern that sequences in the rest of the tree generally lack. For example, node 5 could represent a subfamily whose family, superfamily and class are represented by the subtrees rooted at nodes 4, 2 and 1, respectively. **(middle)** The corresponding functional divergence (FD-)table. A tree is converted into a FD-table, as follows: The subtree rooted at each node of the tree corresponds to the foreground (‘+’ rows) for that column in the table, whereas the rest of the subtree rooted at the parent of that node corresponds to the background (‘-‘rows). (A set of randomly-generated sequences serves as the background for the root node.) Each internal node in the tree corresponds to a miscellaneous category—that is to sequences sharing a common pattern with, but lacking patterns specific to each of its descendent subtrees. **(bottom)** Contrast alignment corresponding to column 4 of the table. Each subgroup corresponding to a row with a ‘+’ or a ‘-‘symbol in that column is assigned to the foreground or background, respectively; subgroups with an ‘o’ symbol are omitted from that contrast alignment.

Here we describe and apply an automated multiple category (amc)BPPS program that generates its own FD-table and seed sequences automatically and therefore merely requires a multiple sequence alignment as input. The number and nature of the partitions and the patterns is completely determined by the program. When used in conjunction with procedures for viewing structural interactions involving pattern residues, the amcBPPS sampler automates and enhances the creation and annotation of CDD hierarchical alignments. And, when linked into web-based BLAST searches, this can make previously inaccessible molecular information widely available.

Results and discussion

In this section, we lay out the basic amcBPPS algorithm, illustrate an implementation of the algorithm as applied to P-loop GTPases, compare its performance against various manually-curated CD hierarchies, further evaluate its performance using both delete-half jackknifing and simulations, and apply it to several large protein classes for which existing hierarchies or alignments are currently unavailable.

Algorithm

The amcBPPS algorithm aims to identify the hierarchical relationships between functionally-divergent subgroups within an entire protein domain class based on the differentiating patterns present in that class. It does this by defining: (i) the number of sequence sets, (ii) the members of each set, (iii) the hierarchical relationships between sets and (iv) the corresponding functionally divergent patterns. This is accomplished in three steps. Steps 1 and 2 constitute the novel aspect of the program by providing input to the mcBPPS sampler in Step 3; these first two steps also speed up convergence in Step 3 by providing a better starting point for the mcBPPS sampler (the algorithmic details of which are described in

The amcBPPS procedural substeps used to obtain a hierarchy from a multiple alignment

**The amcBPPS procedural substeps used to obtain a hierarchy from a multiple alignment.** Starting from a multiple sequence alignment for a particular protein domain, the amcBPPS program applies the following substeps (‘a’ to ‘e’) to create a domain hierarchy. Note that substep (a) corresponds to Step 1 of the amcBPPS algorithm whereas the other substeps correspond to Step 2. (**a**) Use heuristic procedures to create distinct FD-tables, corresponding to a forest of simple (rooted, branchless) trees; each leaf of a given tree corresponds to a distinct subgroup within the protein class. (The mcBPPS sampler is used to optimally assign sequences to each leaf node; different prior probability settings can be used to favor convergence on subfamilies, families or superfamilies.) (**b**) Select leaf nodes from the forest corresponding to more or less distinct, functionally divergent subgroups; this is done by combining each set of nearly identical nodes into a single set. Define a root node (labeled R in the figure) corresponding to the universal sequence set. Larger superfamily nodes (labeled with red integers) also are created from related leaf nodes. The haze around nodes indicate the partially-overlapping nature (i.e., fuzziness) of the corresponding sequence sets. (**c**) Generate a directed acyclic graph (DAG) representing superset-to-subset relationships between nodes and with arcs weighted by (the negative of) the corresponding log-likelihood ratios (LLRs) associated with the BPPS statistical model. For clarity, nodes and arcs directly connected to the root are shown in orange whereas other (non-root) nodes are uniquely colored. (**d**) Obtain from the DAG a shortest path spanning tree using a breadth-first scanning algorithm **e**) Prune nodes that both are directly attached to the root and significantly overlap with other nodes and thus correspond to ill-defined sequence sets. For the remaining nodes, remove the overlap between their corresponding sequence sets (see text for details) and prune from the tree those nodes that lack a minimum number of sequences (30 by default). This typically yields a reduced hierarchy (as shown), which is converted into a FD-table (as illustrated in Figure

Identifying simple subgroups (Step 1)

Step 1 of the amcBPPS algorithm (represented by the arrows labeled ‘a’ and ‘b’ in Figure

For each of these FD-tables (and the corresponding seed sequences) the mcBPPS sampler assigns each of the multiply aligned input sequences to a subgroup (as specified by the rows in the table) and determines the differentiating conserved pattern for each contrast alignment (as specified by the columns in the table). To ensure that subgroups at different levels of the hierarchy are identified, the algorithm performs multiple runs using various numbers of leaf nodes and various prior probability settings for _{
l
} = 0.01), by setting the (beta distributed) prior probability, _{
0
} = 9 and _{
0
} = 1) and by raising the (beta distributed) prior probability that a column corresponds to a pattern position, _{
j
} = 0.01). The rationale for choosing these settings is that, for subfamilies, membership is more exclusive, sequences are more highly conserved and, consequently, conserved patterns more extensive. (Note, however, that, in the absence of such a rationale, non-informative priors are used by default (e.g., uniform beta and Dirichlet distributions) in order to maximize the influence of the data on model optimization.) Convergence on a super-family is favored by specifying a single subgroup and by altering these prior parameter settings accordingly (where by default, _{
l
} = 0.2, _{
0
} = 1, _{
0
} = 1 and, _{
j
} = 0.0001). Default settings are based on applications to actual protein sequences, though it should be noted that the influence of these prior settings is minor. Hence these priors primarily function as tuning parameters to help gently guide the sampler into finding a variety of functionally divergent subgroups. To avoid finding the same subgroup repeatedly, sequences assigned to a subgroup in a previous run are prohibited from being used as seeds in subsequent runs. Subfamilies can also be identified recursively; that is, by rerunning the program on a single subgroup in order to find subgroups within subgroups (though this approach is not used here). The pseudocode for this step of the amcBPPS algorithm is given in Methods.

Defining a hierarchy for the protein class (Step 2)

Once individual subgroup sets are identified in Step 1 (see arrow labeled ‘b’ in Figure

The mcBPPS sampler (Step 3) and further refinements

The output from Step 2 provides a starting point for mcBPPS sampling, which optimizes the patterns and partitions corresponding to the FD-table. The basic statistical and algorithmic aspects of the mcBPPS sampler were previously described

Implementation and testing

The amcBPPS algorithm was implemented in C++ (executables are available from the corresponding author), applied to various protein classes and the output compared to manually-curated CDD alignment hierarchies (when available). A wide range of CDD hierarchies—from preliminary to well-developed releases (as well as some out-of-date versions)—were examined in this way. Input multiple alignments were obtained by using the NCBI hierarchy of CD alignments as input to the MAPGAPS program

Illustrative example: P-loop GTPases. To familiarize the reader we begin by illustrating our approach with an analysis of P-loop GTPases. Using an input alignment of 198,624 P-loop GTPases, the amcBPPS program returned the FD-table shown in Figure

**Additional figures referred to in the main article as Figures S1–S6.**

Click here for file

FD-table for P-loop GTPases

**FD-table for P-loop GTPases.** The number of sequences in each subgroup are given in parentheses. Major subtrees are color coded.

**CDD**

**Protein superfamily**

**number**

**length**

**Manually curated**

**Automatically generated**

**Ident.**

**seqs**
^{
‡
}

**nodes**
^{
*
}

**LLR**
^{
†
}

**nodes**
^{
*
}

**LLR**
^{
†
}

**time**
^{
§
}

^{
‡
} After removing identical sequences and sequences that fail to align with at least 75% of the domain.

^{
*
} Numbers in parentheses indicate the nodes retained after insignificant nodes were removed by the mcBPPS program.

^{
†
} The log-likelihood ratio in nats.

^{
§
} The time (in minutes) is for Steps 2 and 3 of the algorithm only; Step 1 can be parallelized to run in less than 10% of the time shown.

**cd00030**

C2

23,452

102

106 (103)

236574

78(73)

223857

19.4

**cd00138**

PLDc_SF

16,765

119

105 (102)

241766

36(34)

192876

10.0

**cd00142**

PI3Kc_like

2,409

219

22

34129

16

34563

4.5

**cd00159**

RhoGAP

4,815

169

39(38)

55604

32

53540

7.97

**cd00173**

SH2

5,917

79

111 (101)

49274

39

40075

3.5

**cd00180**

Protein kinases

104,912

215

280(260)

1378273

107(104)

1536991

241.0

**cd00229**

SG NH_hydrolase

14,635

187

30

180667

29

183822

14.95

**cd00306**

S8/S53 peptidase

10,960

241

36

161685

45(44)

173693

30.90

**cd00368**

Molybdopterin-Binding

9,540

374

26

177569

44

209704

39.3

**cd00397**

DNA_BRE_C

25,824

164

27 (26)

187382

39(37)

211739

16.9

**cd00761**

Glycosyltransferase A (GT-A)

66,260

156

71 (70)

944727

123(110)

1048396

193.8

**cd00768**

Class II aaRS-like core

37,160

211

17

674454

31

833691

54.3

**cd00838**

MPP_superfamily

33,753

131

61

402297

55(54)

399553

65.1

**cd00900**

PH-like

22,593

99

81

211812

99(98)

274945

52.3

**cd01067**

Globin_like

9,933

117

4 (1)

11133

26 (25)

73808

4.3

**cd01391**

Periplasmic_Binding_Protein_1

36,330

269

142(140)

619713

68(65)

580753

169.1

**cd01494**

AAT_I (Pyrodoxal-PO4-binding)

114,781

170

16

1086328

92(84)

2027660

249.67

**cd01635**

Glycosyltransferase GTB

44,366

229

45

723443

95(93)

881414

232.7

**cd02156**

Class I aaRS-like core √

53,605

105

34

522962

61(57)

698273

41.4

**cd02883**

Nudix_Hydrolase

32,046

123

55 (54)

321636

61(60)

367819

43.2

**cd03128**

GAT-1 (mcBPPS vs pmcBPPS)

46,514

92

34(32)

319515

64(62)

388621

42.2

**cd03440**

hot_dog

30,162

100

22(18)

141990

70 (69)

345298

39.1

**cd03873**

Zinc peptidases

24,455

237

81

596408

69(66)

590521

43.9

**cd05466**

Periplasmic_Binding_Protein_2

45,287

197

76(73)

523941

49(41)

411445

31.7

**cd06587**

Glo_EDI_BRP_like

36,165

112

60 (58)

335848

94(91)

479522

54.8

**cd06663**

Biotinyl-lipoyl

25,013

73

4

53038

25(18)

66571

4.53

**cd06846**

Adenylation_DNA_ligase_like

3,833

182

14

43276

20

48,475

4.8

**cd08555**

PI-PLCc_GDPD_SF

8,707

179

74 (73)

143201

37(32)

123075

6.9

**cd08772**

GH43_62_32_68 (β propellers)

6,760

286

28

111336

51(50)

176701

30.0

**cl09931**

Rossmann fold proteins

424,764

93

361 (347)

4110907

145(130)

4029120

757.2

**Average**

44,057

167.7

66.4

486696

56.9

556884

83.6

Criteria for comparing hierarchies

To assess how well the amcBPPS program performs relative to curated CD hierarchies, we compared its output against 30 manually curated CDD hierarchies (Table

Lack of gold standards

CDD hierarchies have been carefully constructed by expert curators and therefore come the closest to a benchmark set for evaluating the amcBPPS sampler. However, as this study reveals, certain aspects of CDD hierarchies lack statistical support or are incomplete or incorrect for various reasons: For example, CDD hierarchies are typically at different stages in an ongoing refinement process, and, for protein domain classes consisting of tens or hundreds of thousands of sequences, the number of possible hierarchies to consider is astronomical, which makes optimization through manual curation extremely difficult. Furthermore, due to the stochastic nature of and the inability to directly observe evolutionary divergence, it is impossible to eliminate the inherent uncertainties associated with protein classification. Hence, for the present study our aim is merely to replicate the current manual curation process by generating hierarchies of comparable quality automatically, thereby dramatically speeding up the current labor-intensive curation process.

Comparison criteria for this analysis

Despite the absence of a gold standard, the statistical criteria used by the amcBPPS program provide a way to compare two hierarchies for the same conserved domain. It does this by determining objectively whether or not (and, if so, to what degree) the sequences in each protein subgroup have diverged from the evolutionarily related subgroups indicated by a specific hierarchy. This measure is expressed as a log-likelihood ratio (LLR), where non-positive values indicate a lack of statistical support for a functionally divergent event within the hierarchy. Such a comparison is performed as follows: We are given two heuristic methods for obtaining a (presumably suboptimal) hierarchy: one manual and one automatic. To compare the two methods, we first use each hierarchy (along with a corresponding multiple sequence alignment) as input to the mcBPPS sampler, which then optimizes the patterns and sequence partitions associated with that hierarchy and returns an optimized log-likelihood ratio (LLR). Because this optimizes the automatically-generated and manually-curated hierarchies in the same way based on the same statistical criteria, the only difference is that the hierarchies and seed alignments were obtained either automatically or through manual curation. Thus, by comparing their optimized LLR scores, we can obtain a measure of the relative performance of the two methods. In addition, we also determine the degree of overlap between the two hierarchies as a qualitative indication of the similarity of the two hierarchies. The results of such comparisons are summarized in Table

Comparison of curated and automatically-generated hierarchies.

**Comparison of curated and automatically-generated hierarchies.** Hierarchies are shown as circular trees.

Evaluation of the amcBPPS program

To evaluate the amcBPPS program over a wide variety of input, we chose the 30 conserved domains given in Table

Comparisons with manually curated CDD hierarchies

Based on the LLR statistic the automatically-generated hierarchies (column 8) are comparable to the corresponding manually-curated hierarchies (column 6) and are, in fact, slightly better on average (556,884 nats versus 486,696 nats for the curated hierarchies). Manual and amcBPPS hierarchies (Figure

Most of the differences between the manual and automated hierarchies are due to fundamental differences between the two approaches (as revealed by examination of comparative analyses like the one shown in Additional File

Unsurprisingly, our analysis also indicates that the hierarchies obtained both manually and automatically are typically suboptimal. For example, manual and amcBPPS hierarchies for the S8/S53 peptidase domain (cd00306) had LLRs of 161,685 and 173,693 nats, respectively, whereas a hybrid hierarchy containing features of both of these has a LLR of 177,727 nats. Figure

Improving a hierarchy by merging features of curated and amcBPPS hierarchies

**Improving a hierarchy by merging features of curated and amcBPPS hierarchies.** Shown are hierarchies for cd02156 in Table **A**) The original CDD hierarchy. (**B**) The automatically generated hierarchy. (**C**) A hybrid hierarchy created by incorporating features of both (**A**) and (**B**).

Delete-half jackknife analyses

A bootstrap or jackknife

For these analyses we found that, among the leaf node sets in one tree that share at least one sequence in common with a leaf node set in the other tree, on average 47% share precisely the same set of sequences (i.e., among those sequences present in both trees) and 74% share more than 90% of their sequences in common. Moreover, in most cases where an identical sequence set is not found, the missing sequences were typically assigned, not to unrelated leaf nodes, but either to a parent node further up the tree or to the rejected sequence set. Among the remaining cases, a node in one hierarchy is either split into multiple nodes or (in the worst case) split between nodes in the other hierarchy. At times a hierarchy could end up omitting certain nodes due to the delete-half jackknife procedure removing sequences belonging to certain phyla resulting in insufficient phylogenetic diversity to seed the formation of a subgroup. Of course the topologies (shapes) of the jackknife trees found by the sampler also differ, which is a common problem associated with evolutionary trees consisting of large numbers of distantly related sequences. This is presumably due in large part to the amcBPPS algorithm failing to find the optimal topology—an issue that, in the future, we will address by sampling over alternative topologies. Of course, both this future sampler and the jackknife procedure applied here will be useful for identifying the most reliable features of a hierarchy. Taken together, these results confirm the observation we made in the previous section, namely that the amcBPPS program generally finds a suboptimal hierarchy that, nevertheless, provides a good starting point both for curation and further automation. Output from these jackknife analyses are available at

Simulations

As an additional check, we implemented a procedure to generate simulated sequences from profile HMMs where each such profile corresponds to a node from one of the 24 domain hierarchies used in the jackknife analysis. The rationale for doing this was to determine how well the amcBPPS program identifies sequences corresponding to predefined subgroups. Note that this procedure captures sequence features of each subgroup, but not how those subgroups are hierarchically arranged. For each node of each hierarchy we generated the same number of aligned sequences as were assigned to that node in the original hierarchy. After running the amcBPPS program on each of these simulated alignments, we determined the degree to which each set of related simulated sequences were correctly modeled as belonging to a single subgroup. An example output file in (Additional File

Time complexity

The computationally most intensive routine in Step 1 of the amcBPPS program is an all-versus-all pairwise comparison of pre-aligned sequences (with indels ignored). This has a time complexity of O(^{2}) = O(

The time complexity of Steps 2–3 is unclear based on the underlying algorithm. Therefore, using a plot of the run times for the amcBPPS analyses in Table ^{
1.2
}) (see Figure ^{
1.2
}), which admittedly may not be the case given our empirically-based approach, then whether or not O(^{
1.2
}) is better than O(^{
4
}. (Step 1 and Steps 2–3 are asymptotically identical when ^{1.2} which implies that ^{4}.) Step 1, which is O(^{4} and Steps 2–3, which is O(^{
1.2
}), is worse when ^{4}. Since for essentially all protein domains ^{4} the time complexity of the amcBPPS program (i.e., Steps 1–3) appears to be O(

Time complexity of Steps 2 and 3 of the amcBPPS program

**Time complexity of Steps 2 and 3 of the amcBPPS program.** (**A**) Plot of run times versus the number of aligned residues in the input multiple alignment. Shown are data points from Table ^{k}, it follows that log**B**) Plot of run times versus the number of aligned residues times the number of nodes in the hierarchy created in Step 2. This plot results in a slightly better fit (r = 0.98). The slope of the trend line is

Because the run times for Steps 2–3 are also likely to depend on the size of the hierarchy generated by the program in Step 2, Figure

Analysis of protein domains lacking CD hierarchies

There are a significant number of protein domains for which a CDD hierarchy has not yet been constructed. In some (though not all) cases a single curated alignment is available as a starting point. To test the performance of the amcBPPS program in such cases, we chose 10 domains, for which curated alignments were available, and two domains, for which we first constructed an alignment using Bayesian multiple alignment methods

**identifier**

**Protein superfamily name**

**# seqs**

**# nodes amcBPPS**

**LLR**

**Run time**
^{
§
}

^{
‡
} For these non-CDD curated alignments were used as input.

^{
§
} The time (in minutes) is for Steps 2 and 3 of the algorithm only.

Unaligned sequences were aligned using the multiple alignment procedures cited in Methods to generate an input alignment for the amcBPPS program.

**Started from curated alignments:**

**cd00075**

Histidine kinase-like ATPase c

87,258

95(62)

**518062**

119.27

**cd00130**

PAS

50,200

117(115)

416375

103.95

**cd00174**

SH3

13,890

44(35)

26971

3.83

**cd00590**

RRM

107,488

63(56)

557782

63.75

**cd01427**

HAD-like hydrolases

41,818

85(73)

324699

59.77

**cd02440**

AdoMet_MTases

150,872

112(99)

1417985

250.27

**cd04301**

NAT-SF

43,486

71

244420

23.30

**cl02566**

SET (pfam00856)

8,946

21

54230

2.58

**cl10444**

P-loop GTPases^{
‡
}

198,624

115 (109)

3826672

464.67

**none**

AAA + ATPases^{
‡
}

84,695

86(85)

1779227

173.73

**Started from unaligned sequences:**

**none**

α,β- hydrolase fold

50,811

109(104)

752259

139.82

**none**

Helicases

86,287

117 (111)

1935380

342.10

Conclusions

Currently the construction and annotation of CD hierarchies relies on the labor intensive process of manual curation. This has created a bottleneck hindering the CDD

Of course, starting from the procedures described here, the CDD pipeline can be further automated and improved in various ways along similar lines. For example, we have demonstrated that our Bayesian alignment methods can be used to generate, for major protein classes (such as the AAA + ATPases, α,β-hydrolase fold enzymes and helicases in Table

Having such a comprehensive set of well annotated, high quality CD profiles will summarize what is known about each type of domain. Through application of the MAPGAPS program, these CD hierarchies could be used to obtain up-to-date, very large and highly accurate multiple sequence alignments of an entire protein class for in-depth computational analyses. And by mapping various categories of pattern residues to corresponding structures, BLAST searches against these improved CD profiles can reveal those residues most likely responsible for the specific biochemical properties of a query protein. This can accelerate the pace of biological discovery by enabling researchers to obtain valuable clues regarding as-yet-unidentified protein biochemical and biophysical properties.

Methods

Protein sequences were obtained from the NCBI nr and env_nr databases and from translated EST sequences within the NCBI est_others database (for which only open reading frames of at least 100 residues in length were retained). The phylum and kingdom to which each of these sequences belonged were determined using the NCBI taxonomy database dump. For those protein classes in Table

Evaluation procedures

The amcBPPS program was evaluated (see Table

Pseudocode for Step 1

The following pseudocode, which focuses on Step 1, corresponds to the main amcBPPS function, after which routines implementing Steps 2 and 3 are called. The output from Step 1 is used to create (in Step 2) a FD-table and a set of seed sequences for mcBPPS sampling (in Step 3). Note that this Step 1 pseudocode creates single category FD-tables, but it can be easily modified to create multiple category FD-tables.

**function** amcBPPS(

**input:** a multiple alignment of protein sequences (

**output:** a hierarchy (tree) and corresponding contrast alignments (CHA).

//

**for each** sequence **do**
**end for**

dheap

**for each** sequence pair < _{
1
}, _{
2
} > **do**:

**if** the sequences are from the same phylum **then**

**if** sequences ≥ 95% identical then merge their disjoint sets **end if**

**else if** sequences ≥ 40% identical **then**

_{
1
}
_{
2
});//

Insert(_{
1
}, _{
2
} >,

**end if**

**end for**

//

**while** < _{
1
}, _{
2
} > := deleteMax(**do**

_{
1
}.rank := min(_{
1
}.rank); _{
2
}.rank := min(_{
2
}.rank);

**if** Â¬ _{
1
}.labeled ⋀ Â¬ _{
2
}.labeled ⋀ Set(_{
1
}) ≠ Set(_{
2
}) **then**

Set(_{
1
}) := Set(_{
2
}) := Set(_{
1
}) ∩ Set(_{
2
})

**if** NumPhyla(Set(_{
1
})) ≥ _{
min
} then//_{
min
}

**for each**
_{
1
}) **do**
**end for**

**for each**
_{
1
})**do**:

//

_{
1
})⋀

**end for**

//

**end if**

**end if**

**end while**

**return** mcBPPS(

**end function**

Pseudocode for Step 2. Step 2 (i.e., the CreateFullHierarchy() routine) is subdivided into three sub-steps. For Step 2a, the MergeSimilarSets() function finds cliques of similar sequence sets by applying the Bron-Kerbosch algorithm

**function** MergeSimilarSets(

**input:** sequence sets (

**output: **a reduced, non-redundant collection of sets and associated patterns.

//

Create a node for each input set

**for each** pair of sets **do**

**if** the smaller set intersects with < 80% of the larger set **then** continue;

Find pattern optimally discriminating sequences in sets I and J from other sequences;

//

**if** the two patterns intersect by < 33% or by < 5 pattern positions **then** continue;

LLR_{i,j} := LLR with foreground = set I, background = , Â¬(set J ∩ set I) & set J pattern.

LLR_{j,i} := LLR with foreground = set J, background =, Â¬(set J ∩ set I) & set I pattern.

**if** LLR_{i,j} ≥ 80% of LLR_{j,i} ⋀ LLR_{j,i} ≥ 80% of LLR_{i,j}**then** AddEdge(**end if**

**end for**

Find the cliques in the graph using the Bron-Kerbosch algorithm

**for each** clique **do**

Create a consensus set of those sequences present in ≥ 50% of the clique sets.

Compute pattern optimally discriminating consensus set from other sequences.

Replace the sets belonging to the clique with the consensus set and pattern.

**end for**

**end function**

By determining whether the sets substantially overlap, are roughly equal in size, and have similar discriminating patterns, the first two ‘if’ statement within MergeSimilarSets() merely prune the search by skipping over sets that are unlikely to correspond to the same protein subgroup. (Note that, if missed, sets corresponding to the same subgroup are likely to be detected in subsequent steps). To determine whether two different yet overlapping sets correspond to the same functionally-divergent subgroup, the procedure computes the BPPS log-likelihood using the pattern from one set with the partition defined by the other set and vice versa. If the patterns are more or less interchangeable between sets then an edge is added between the nodes corresponding to these sets. Next the Bron-Kerbosch algorithm is used to identify set cliques, each of which is then merged into a single (consensus) set. MergeSimilarSets() is applied iteratively to the modified sets from the previous iteration until it fails to identify and combine any additional similar sets.

Step 2b combines subgroup sets into larger supersets using the following FindSuperSets() function:

**function** FindSuperSets(

**input:** sequence sets (

**for each** Set **do** Assign it to a unique disjoint set **end for**

**for each** pair of sets **do**//

**if** the intersection of the smaller set ≥ 66% of the larger set **then**

Assign both sets to the same disjoint set;

**endif**

**end for**

**for each** Disjoint set ‘dset’ containing at least 2 subsets **do**

Superset := the union of the subsets;

Superpattern := the pattern optimally discriminating the Superset from Â¬ Superset;

**if** Any subsets in dset fail to contribute their **then**

Remove these subsets from dset and repeat from the start of this ‘for’ loop

**else** Save the superset and superpattern **endif**

**end for**

**return:** The saved supersets and superpatterns.

**end function**

FindSuperSets() first identifies collections of (possibly minimally) overlapping sequence sets as possible candidates for merging into supersets. Next, it combines into a superset those sets that contribute their ‘fair share’ to the optimum LLR for the proposed superset—where the ‘fair share’ is defined as contributing at least 80% of the estimated average contribution of each sequence to the LLR times the number of sequences in the subset. (Based on the statistical formulation

Next the function CreateSuperSets() is called to create additional supersets from the current sets that fail to overlap or that overlap only moderately. As long as new supersets are created, this function is called repeatedly (this merges subsets into supersets that might otherwise have been overlooked).

**function** CreateSuperSets(

**input**: sequence sets (

**output**: new supersets.

**for each** set I **do**

SuperSet := set I; SuperPattern := Ø;

**for each** set J that at least slightly overlaps with set I **do**

Set X := SuperSet ∩ set J;

Pattern X := the pattern optimally discriminating set X from Â¬ X;

**if** both SuperSet & set J contribute their **then**

Superset := Set X; SuperPattern := pattern;

**endif**

**end for**

**if** set I ⊂ SuperSet **then** save the current Superset **endif**

**end for**

**end function**

Step 2c uses the sets obtained in the previous steps to construct a tree hierarchy, from which a FD-table is then obtained—, along with corresponding seed alignments and initial partitions—as follows:

**function** CreateTree(

**input:** sequence sets (

**output:** a FD-table + corresponding starting subgroup sets, patterns, and seed alignments

wdiGrph := RtnDiGraph(

Tree := ShortestPathTree(wdiGrph);//

Tree := RefineTree(Tree);//

FD-Table := TreeToFDtable(Tree);

sma := CreateSeedAlignments(Tree); //

**end function**

where the RtnDiGraph() functions is defined as:

**function** RtnDiGraph (

**input:** sequence sets (

**output:** a weighted directed acyclic graph representing the set relationships.

Create a weighted directed graph where each set is a node

**for each** pair of sets **do**//

**if** setI is smaller than setJ **then** continue

**else if** setI ∩ setJ < 50% of setI **then** continue

**else if** setJ < 33% larger than setI ∩ setJ **then** continue **endif**

Compute the optimum pattern and LLR for set I versus set J – set I;

**if** LLR is

- not

**if** Set I fails to contribute its **then** continue;

Add an arc pointing from node J to node I & weighted by –LLR;

**end for**

**for each** set that lacks a Superset **do**

Compute the optimum pattern and LLR for the set versus the complementary set;

Add an arc pointing from the root to the corresponding node & weighted by –LLR

**end for**

**end function**

Note that the RtnDiGraph () function returns a directed acyclic graph (DAG), for which the ShortestPathTree() algorithm

The sequence sets corresponding to the tree returned by the ShortestPathTree() algorithm are still fuzzily defined and thus typically contain sequences that belong to one or more distinct protein subgroups and thus that are not proper subsets of their respective supersets. The following RefineTree() function eliminates inappropriate overlap between sets while also eliminating nodes from the tree that, as a result of the refinement process, are no longer statistically significant:

**function** RefineTree (

**input:** a tree where each node corresponds to a sequence set

**output:** refined tree

**do**

**do**//

Find the arc with the lowest weight (i.e., with the lowest subset-to-superset LLR);

**if** this LLR is not significant **then**

Remove the arc and the child (subset) node from the tree;

Connect the children of the removed node to the parent of that node;

Merge the set corresponding to the removed node into the parent set;

**end if**

**while** an arc has been removed;

**do** //

Label the leaf nodes as ‘candidates’ and leave other nodes unlabeled.

**for each** pair of nodes **do**

**if**

- both

**else if** one node is the root **then** continue;

**else if** one node is ‘fixed’ and the other is a ‘candidate’ **then**

remove all overlapping sequences from ‘candidate’ node;

**else if** both nodes are candidates **then**

**for each** sequence **do**

remove

**end for**

**end if**

**end for**

Label all current ‘candidate’ nodes as ‘fixed’;

Label as ‘candidates’ all nodes whose subtree consists entirely of labeled nodes;

**while** some nodes were newly labeled as candidates;

Define the root node set as containing all sequences absent from the other node sets;

Merge each leaf node with only a few sequences into its parent node;

Merge nodes with a single child into their parent nodes; //

Relocate nodes that, due to previous step, are no longer properly placed in the tree.

**while** the tree has been changed in any way;

**end function**

The tree returned by the RefineTree() function is output as a Newick-format character string (a formal language specification for trees), which is then parsed and translated into a FD-table within the CreateTree() routine. This routine also creates a seed alignment for each row in the FD-table using a few of the most characteristic sequences in each set. These, along with the corresponding patterns (one for each column), are then used as input to the mcBPPS procedure (Step 3).

Abbreviations

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

AFN designed and implemented the algorithm, performed the jackknife analyses and simulations, generated the multiple sequence alignments used as input to the amcBPPS program, ran the programs and wrote the initial draft of the manuscript. CL and AMB converted CDD alignments and hierarchies into appropriate formats for analysis and provided additional CDD information as required for this study. All authors evaluated the output files and read, revised and approved the manuscript.

Acknowledgements

We thank Art Delcher for critical reading of the manuscript. Funding for AFN provided by the University of Maryland and the NIH Division of General Medicine Grant GM078541. Funding for CL and AMB provided by the Intramural Research Program of the National Library of Medicine at National Institutes of Health/DHHS. Funding to pay the Open Access publication charges for this article was provided, in part, by the Intramural Research Program of the National Library of Medicine at the National Institutes of Health/DHHS.