Table 3

Concept annotation attributes of corpora
corpus/corpora total # words/tokens # & type of documents domain(s) annotation concept schema(s) total # concept annotations
CRAFT Corpus (full/initial release) ~790,000/~560,000 97/67 articles sources of MGI annotations of mouse genes/gene products Open Biomedical Ontologies (CL, ChEBI, SO, PRO, GO BP/CC/MF, NCBITaxon), Entrez Gene ~140,000/~100,000
ABGene 4,265 sentences n/a ~8,200
BioInfer ~34,000/~30,000f 1,100 sentences protein-protein interactions ~100 entity classes, ~100 relationships ~6,300 named entities, ~2,700 relationshipsg
CALBC corpus ~16,000,000 150,000 abstracts immunology UniProt, NCBITaxon, UMLSh ~2,700,000
CLEF Corpus variousi clinical/cancer data 6 concept types
FetchProt Corpus 200 articles protein tyrosine kinase activity 10 concept types, UniProt ~3,800
4th i2b2/VA Challenge Corpus ~750 discharge summaries clinical data 3 concept types ~2,000
GENETAG ~548,000 20,000 sentences n/a ~25,000 genes/proteins, ~19,000 alternative lexical forms
GENIA 3.0 ~440,000 2,000 abstracts human blood-cell transcription factors 35 entity classes, 34 process classes ~93,000 entities, ~36,000 events
GREC 240 abstracts E. coli gene regulation 433 classes ~5,000
ITI TXM PPI/TE Corpora ~2,000,000/ ~1,900,000 217/238 articles protein-protein interactions/tissue expression 9/13 concept types, Entrez Gene, RefSeqj, ChEBI, MeSH, NCBITaxonk ~160,000/~164,000
MedPost ~156,000
OntoNotes 2.0 ~500,000 1,000 newswire documents English & Chinese news 1000 s of WordNet senses, 50 concept typesl ~58,000 verbsm
PennBioIE Oncology/CYP v1.0 Corpora ~381,000 (~327,000)/~313,000 (~274,000) 1,414/1,100 abstracts medical genetics of oncology/inhibition of cytochrome P450 enzymes n/a
Yapex Corpus 200 abstracts protein-protein interactions n/a ~3,700

fBioInfer has ~34,000 tokens total, and ~30,000 excluding punctuation.

gBioInfer has ~6,300 named-entity annotations and ~2,700 annotations of what are termed relationships but that might more properly be conceptualized as process or state classes and thus are included here, totaling ~9,000 concept annotations.

hIn the CALBC corpus, NCBI Taxonomy and UMLS concepts were respectively used to mark up species and disease mentions.

1The CLEF Corpus is composed of many types of medical documents: 2 entire patient records (themselves composed of 9 narratives, 1 imaging report, 7 histopathology reports, and associated data) and 50 each of clinical narratives, histopathology reports, and imaging reports.

jThe annotators of the ITI TXM Corpora attempted to assign Entrez Gene IDs to gene annotations and RefSeq IDs to annotations of proteins, mRNAs, and cDNAs (although it is admitted that this assignment was very time-consuming and thus was not performed on the training subset of the PPI Corpus).

kThe annotators of the ITI TXM Corpora used ChEBI, MeSH, and NCBI Taxonomy concepts for drug, tissue, and sequence mentions.

lIn OntoNotes, the 700 most frequent polysemous verbs and 1,100 most frequent polysemous nouns have been annotated with the appropriate senses of WordNet 2.0, so the size of the schema (i.e., the total number of senses of these 1,800 words) likely numbers in the thousands; however, they note that this is different from their ontological annotation, for which only approximately 50 concept types are being used to subsume the annotated word senses.

mIn addition to ~58,000 annotated verbs, OntoNotes has an unstated but presumably large count of annotated nouns.

A summary of counts of words/tokens, of counts and types of component documents, of domains, and of counts of concept annotations for the CRAFT Corpus and related corpora.

Bada et al.

Bada et al. BMC Bioinformatics 2012 13:161   doi:10.1186/1471-2105-13-161

Open Data