Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation
-
* Corresponding author: Paul W Sternberg pws@caltech.edu
1 Division of Biology, California Institute of Technology, Pasadena, CA 91125, USA
2 Howard Hughes Medical Institute and Division of Biology, California Institute of Technology, Pasadena, CA 91125, USA
3 California Department of Transportation, San Bernardino, California 92401, USA
BMC Bioinformatics 2009, 10:228 doi:10.1186/1471-2105-10-228
Published: 21 July 2009Additional files
Additional file 1:
Training set corpus and true positive sentences for Cellular Component category development. This file contains sentences selected as true positives (the gold- standard set), for Cellular Component curation category development. Sentences are listed according to the unique PubMed identifier of the publication and the sentence number as assigned during the Textpresso PDF-to-text conversion.
Format: TXT Size: 310KB Download file
Additional file 2:
Category terms. This file lists terms included in the first draft of the three new Textpresso categories used for GO Cellular Component curation: Cellular Components, Assay Terms, and Verbs.
Format: TXT Size: 3KB Download file
Additional file 3:
Annotation test corpus. This file contains a list of PudMed identifiers for papers included in the test set used to evaluate the new Textpresso categories.
Format: TXT Size: 1KB Download file
Additional file 4:
Curation efficiency test corpus. This file contains a list of PubMed identifiers for papers included in the curation efficiency test.
Format: TXT Size: 1KB Download file
Additional file 5:
WormBase Cellular Component curation form. This file shows a screenshot of the web-based curation form used for Textpresso-based GO curation at WormBase. The identified C. elegans protein is listed in the left-most box, with the component term(s) from the sentence listed in the middle box. Suggested GO annotations, based upon previous curation, are listed in the right-most box. The sentence from which the protein and component are derived are shown on the right, along with the possible actions that can be taken by a curator, including curating the information (and adding to GO), marking the information as already curated, or marking the returned sentence as 'scrambled', false positive, or not GO curatable for additional reasons, e.g. the sentence describes localization in a mutant background.
Format: PPT Size: 54KB Download file
This file can be viewed with: Microsoft PowerPoint Viewer
