|
Resolution: standard / high Figure 1.
Information-preserved protein clustering example. Once a consensus sequence is selected, members of a cluster are merged into the
consensus one-by-one. This figure illustrates how the information of a member sequence
is merged into the consensus sequence. Amino acid followed by two zeros indicates
an annotated SAP. Every annotated PTM has a two-digit positive integer that is used
to distinguish different modifications. The difference in the primary sequences between
a member and the consensus introduces cluster-induced SAPs. In this example, the residues Q and A (in red) in the consensus are different
from the residues K and V (in blue) in the member sequence. As a consequence, K becomes
a cluster-induced SAP associated with Q and V becomes a cluster-induced SAP associated
with A. The annotated SAP, ⟨{W00}⟩, associated with residue R in the member sequence
is merged into the consensus sequence, see the updated consensus sequence in the figure.
Note that the annotated PTM, ⟨(N11)⟩, associated with N in the member sequence is
merged with a different annotated PTM, ⟨(N08)⟩, at the same site of the consensus
sequence. In this figure, all the merged information from the member sequence are
shown in blue color to indicate that during the searches we can choose to respect
the correlated information from each member sequence separately. To respect the correlated information
means that when scoring the peptide segment LQ ⟨{K00}⟩ RLVA ⟨{V00}⟩ DR of the consensus
sequence RAId_DbS only considers the combinations L(red Q)RLV(red A)DR and L(blue
K)RLV(blue V)DR, but not L(red Q)RLV(blue V)DR and L(blue K)RLV(red A)DR. Having the
choice to distinguish the SAPs/PTMs originated from individual member sequences, RAId_DbS
can target on documented SAP/PTM combinations associated with certain disease (if
it exists) and can avoid scoring unnecessary SAP/PTM combinations when there are several
variable sites occurring within a peptide. However, currently we find almost no incidence
of multiple variable sites within a short peptide in all our databases constructed.
Therefore, the feature of respecting correlated information is only implemented in
our in-house version, not yet in the web version. Furthermore, not forcing the integrity
of correlated information also allows for novel SAP discovery in a controlled fashion,
meaning that one is looking for SAPs with local precedence. Finally, let us emphasize that although the SAPs, PTMs are merged each
annotation's origin and disease associations are kept in the processed definition
file, allowing for faithful information retrieval at the final reporting stage of
the RAId_DbS program.
Alves et al. BMC Genomics 2008 9:505 doi:10.1186/1471-2164-9-505 |