|
Resolution: standard / high Figure 4.
Steps involved in constructing the catalog of protein references. Terms are shown enclosed in rectangular boxes. Terms may occur in the context of sentences
(when on a horizontal line, left), or in an article (right). Step 1: Articles are
split into sentences, and sentences are split into tokens. Tokens roughly correspond
to words (see text for details). Tokens with high frequency that are not eliminated
by the exclusion lists (see Figure 1) are grouped into n-grams. On the figure, APE1/ref-1
is a n-gram that consists of two tokens: APE1 and ref-1, and can be recognized if
the two terms co-occur frequently in sequence in a full length article. When the terms
are recognized, each occurrence of a term in sentences of the article is identified.
Step 2: Machine learning features are calculated from the context of the term in the
article (see text for details) and the support vector machine (SVM) model classifies
the context of the term. We obtain the score for each context of a term. In our experimental
setup, smaller scores suggest that the context provides little evidence that the term
refers to a protein, while larger scores (in absolute values) indicate more support.
Step 3: We calculate the combined score (Sc) as the sum of the scores for each occurrence
of a given term in a given article. The final catalog consists of a table with one
row per term and article. Each row has three columns: PubMedID, term, and Sc.
Shi and Campagne BMC Bioinformatics 2005 6:88 doi:10.1186/1471-2105-6-88 |