|Concept annotation attributes of corpora|
|corpus/corpora||total # words/tokens||# & type of documents||domain(s)||annotation concept schema(s)||total # concept annotations|
|CRAFT Corpus (full/initial release)||~790,000/~560,000||97/67 articles||sources of MGI annotations of mouse genes/gene products||Open Biomedical Ontologies (CL, ChEBI, SO, PRO, GO BP/CC/MF, NCBITaxon), Entrez Gene||~140,000/~100,000|
|BioInfer||~34,000/~30,000f||1,100 sentences||protein-protein interactions||~100 entity classes, ~100 relationships||~6,300 named entities, ~2,700 relationshipsg|
|CALBC corpus||~16,000,000||150,000 abstracts||immunology||UniProt, NCBITaxon, UMLSh||~2,700,000|
|CLEF Corpus||variousi||clinical/cancer data||6 concept types|
|FetchProt Corpus||200 articles||protein tyrosine kinase activity||10 concept types, UniProt||~3,800|
|4th i2b2/VA Challenge Corpus||~750 discharge summaries||clinical data||3 concept types||~2,000|
|GENETAG||~548,000||20,000 sentences||n/a||~25,000 genes/proteins, ~19,000 alternative lexical forms|
|GENIA 3.0||~440,000||2,000 abstracts||human blood-cell transcription factors||35 entity classes, 34 process classes||~93,000 entities, ~36,000 events|
|GREC||240 abstracts||E. coli gene regulation||433 classes||~5,000|
|ITI TXM PPI/TE Corpora||~2,000,000/ ~1,900,000||217/238 articles||protein-protein interactions/tissue expression||9/13 concept types, Entrez Gene, RefSeqj, ChEBI, MeSH, NCBITaxonk||~160,000/~164,000|
|OntoNotes 2.0||~500,000||1,000 newswire documents||English & Chinese news||1000 s of WordNet senses, 50 concept typesl||~58,000 verbsm|
|PennBioIE Oncology/CYP v1.0 Corpora||~381,000 (~327,000)/~313,000 (~274,000)||1,414/1,100 abstracts||medical genetics of oncology/inhibition of cytochrome P450 enzymes||n/a|
|Yapex Corpus||200 abstracts||protein-protein interactions||n/a||~3,700|
fBioInfer has ~34,000 tokens total, and ~30,000 excluding punctuation.
gBioInfer has ~6,300 named-entity annotations and ~2,700 annotations of what are termed relationships but that might more properly be conceptualized as process or state classes and thus are included here, totaling ~9,000 concept annotations.
hIn the CALBC corpus, NCBI Taxonomy and UMLS concepts were respectively used to mark up species and disease mentions.
1The CLEF Corpus is composed of many types of medical documents: 2 entire patient records (themselves composed of 9 narratives, 1 imaging report, 7 histopathology reports, and associated data) and 50 each of clinical narratives, histopathology reports, and imaging reports.
jThe annotators of the ITI TXM Corpora attempted to assign Entrez Gene IDs to gene annotations and RefSeq IDs to annotations of proteins, mRNAs, and cDNAs (although it is admitted that this assignment was very time-consuming and thus was not performed on the training subset of the PPI Corpus).
kThe annotators of the ITI TXM Corpora used ChEBI, MeSH, and NCBI Taxonomy concepts for drug, tissue, and sequence mentions.
lIn OntoNotes, the 700 most frequent polysemous verbs and 1,100 most frequent polysemous nouns have been annotated with the appropriate senses of WordNet 2.0, so the size of the schema (i.e., the total number of senses of these 1,800 words) likely numbers in the thousands; however, they note that this is different from their ontological annotation, for which only approximately 50 concept types are being used to subsume the annotated word senses.
mIn addition to ~58,000 annotated verbs, OntoNotes has an unstated but presumably large count of annotated nouns.
A summary of counts of words/tokens, of counts and types of component documents, of domains, and of counts of concept annotations for the CRAFT Corpus and related corpora.
Bada et al.
Bada et al. BMC Bioinformatics 2012 13:161 doi:10.1186/1471-2105-13-161