BMC Bioinformatics

official impact factor 3.03

Open Access Highly Access Methodology article

Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation

Kimberly Van Auken1, Joshua Jaffery1,3, Juancarlos Chan1, Hans-Michael Müller1 and Paul W Sternberg1,2*

Author Affiliations

1 Division of Biology, California Institute of Technology, Pasadena, CA 91125, USA

2 Howard Hughes Medical Institute and Division of Biology, California Institute of Technology, Pasadena, CA 91125, USA

3 California Department of Transportation, San Bernardino, California 92401, USA

For all author emails, please log on.

BMC Bioinformatics 2009, 10:228 doi:10.1186/1471-2105-10-228

Published: 21 July 2009

Additional files

Additional file 1:

Training set corpus and true positive sentences for Cellular Component category development. This file contains sentences selected as true positives (the gold- standard set), for Cellular Component curation category development. Sentences are listed according to the unique PubMed identifier of the publication and the sentence number as assigned during the Textpresso PDF-to-text conversion.

Format: TXT Size: 310KB Download file

Open Data

Additional file 2:

Category terms. This file lists terms included in the first draft of the three new Textpresso categories used for GO Cellular Component curation: Cellular Components, Assay Terms, and Verbs.

Format: TXT Size: 3KB Download file

Open Data

Additional file 3:

Annotation test corpus. This file contains a list of PudMed identifiers for papers included in the test set used to evaluate the new Textpresso categories.

Format: TXT Size: 1KB Download file

Open Data

Additional file 4:

Curation efficiency test corpus. This file contains a list of PubMed identifiers for papers included in the curation efficiency test.

Format: TXT Size: 1KB Download file

Open Data

Additional file 5:

WormBase Cellular Component curation form. This file shows a screenshot of the web-based curation form used for Textpresso-based GO curation at WormBase. The identified C. elegans protein is listed in the left-most box, with the component term(s) from the sentence listed in the middle box. Suggested GO annotations, based upon previous curation, are listed in the right-most box. The sentence from which the protein and component are derived are shown on the right, along with the possible actions that can be taken by a curator, including curating the information (and adding to GO), marking the information as already curated, or marking the returned sentence as 'scrambled', false positive, or not GO curatable for additional reasons, e.g. the sentence describes localization in a mutant background.

Format: PPT Size: 54KB Download file

This file can be viewed with: Microsoft PowerPoint Viewer

Open Data