Figure 1.

Information-preserved protein clustering example. Once a consensus sequence is selected, members of a cluster are merged into the consensus one-by-one. This figure illustrates how the information of a member sequence is merged into the consensus sequence. Amino acid followed by two zeros indicates an annotated SAP. Every annotated PTM has a two-digit positive integer that is used to distinguish different modifications. The difference in the primary sequences between a member and the consensus introduces cluster-induced SAPs. In this example, the residues Q and A (in red) in the consensus are different from the residues K and V (in blue) in the member sequence. As a consequence, K becomes a cluster-induced SAP associated with Q and V becomes a cluster-induced SAP associated with A. The annotated SAP, ⟨{W00}⟩, associated with residue R in the member sequence is merged into the consensus sequence, see the updated consensus sequence in the figure. Note that the annotated PTM, ⟨(N11)⟩, associated with N in the member sequence is merged with a different annotated PTM, ⟨(N08)⟩, at the same site of the consensus sequence. In this figure, all the merged information from the member sequence are shown in blue color to indicate that during the searches we can choose to respect the correlated information from each member sequence separately. To respect the correlated information means that when scoring the peptide segment LQ ⟨{K00}⟩ RLVA ⟨{V00}⟩ DR of the consensus sequence RAId_DbS only considers the combinations L(red Q)RLV(red A)DR and L(blue K)RLV(blue V)DR, but not L(red Q)RLV(blue V)DR and L(blue K)RLV(red A)DR. Having the choice to distinguish the SAPs/PTMs originated from individual member sequences, RAId_DbS can target on documented SAP/PTM combinations associated with certain disease (if it exists) and can avoid scoring unnecessary SAP/PTM combinations when there are several variable sites occurring within a peptide. However, currently we find almost no incidence of multiple variable sites within a short peptide in all our databases constructed. Therefore, the feature of respecting correlated information is only implemented in our in-house version, not yet in the web version. Furthermore, not forcing the integrity of correlated information also allows for novel SAP discovery in a controlled fashion, meaning that one is looking for SAPs with local precedence. Finally, let us emphasize that although the SAPs, PTMs are merged each annotation's origin and disease associations are kept in the processed definition file, allowing for faithful information retrieval at the final reporting stage of the RAId_DbS program.

Alves et al. BMC Genomics 2008 9:505   doi:10.1186/1471-2164-9-505
Download authors' original image