Table 2

A text mining approach using an entropy-based scoring function rediscovers the molecular function of proteins sharing PROSITE motifs

Motif # of proteins # of documents

Terms


EF_HAND

ef-hand

36

calcium-bind

183

calcium

ca 2+

calcium-bind protein

ca

2+ bind

2+

ef-hand motif

calmodulin


TRYSIN_SER

serin proteinas

11

proteinas

108

chymotrypsin

serin

serin proteas

elastase

ser-195

his-57

proteinas especially

proteolyt


PROTEIN KINASE_ST

protein kinas

15

catalyt domain

107

phosphoryl

substrat

autophosphoryl

phosphoryl site

kinas

threonin

catalyt

constitutively active


The method extracts text from the abstracts of references annotated in each protein's Swiss-Prot record, pre-processes the text (tokenization into terms, removal of non-content words, and basic stemming to normalize word forms), and scores terms based on their distribution across proteins and their relative significance in the entire corpus of Swiss-Prot referenced documents. With no additional normalization, concept and word redundancy may be observed. Although still very preliminary, the method is able to capture the molecular function for each cluster of proteins shown: "ef-hand" and "calcium binding" for EF_HAND; "serine proteinase", "proteolysis", and the active site residues "ser-195" and "his-57" for TRYPSIN_SER; and "protein kinase", "phosphorylation", "catalytic domain" and the substrate residue "threonine" for PROTEIN_KINASE_ST.

Halperin et al. BMC Genomics 2008 9(Suppl 2):S2   doi:10.1186/1471-2164-9-S2-S2

Open Data