Table 4

Summary of mistakes and curator comments following the task 2 evaluation.



Predicting obsolete GO terms

Strip obsolete GO terms, i.e. children of obsolete molecular function (GO:0008369), obsolete cellular component (GO:0008370), obsolete biological process (GO:0008370) [25]

Predicting GO terms from Materials and Methods e.g. 'pH' value yielded 'pH domain binding' (GO:0042731), 'CHO cell line' yielded numerous GO terms containing 'acetylcholine'.

Only look in certain sections of the paper for features. See Table 1 for GOA.

Predicting plant GO terms to human proteins e.g. germination (GO:0009844)

Look at GO Documentation on sensu [24] and strip out unnecessary GO terms.

Highlighting too much text

Set limit on evidence text highlight to be useful for curators. Limit to <5 lines.

Over-predicting GO terms from one line of text

More important to curator to choose a higher level term that is correct than to be too specific and incorrect.

Common GO terms predicted out of context e.g. text 'mapped to chromosome 3q26' yielded GO component term 'chromosome' GO:0005694. Text indicates chromosome number, not where the protein functions. e.g. text '249 amino acid' yielded multiple GO terms i.e. 'amino acid activation' GO:0043038.

Most papers will mention chromosome location and the amino acid length of a sequence.

Do not predict GO terms from text if words 'chromosome' or 'amino acid' in evidence text is accompanied by a number.

Choosing first paragraph of paper as supporting text

Although a lot of information can be found in introduction of paper, the task was to choose the highlight which supported the GO term.

Whole paragraph highlights do not speed up the curation process. Limit to <5 lines.

Difficulty in interpreting word order e.g. 'RNA binding protein' yielded the incorrect GO prediction 'protein binding'

Difficulty in predicting correct taxonomic origin of protein.

This can also be difficult for a curator, given lack of evidence in text.

Too many low confidence runs

Only submit data with high confidence level for evaluation. Limit participants to their best run/technique. (little difference between runs, repeat evaluations)

Camon et al. BMC Bioinformatics 2005 6(Suppl 1):S17   doi:10.1186/1471-2105-6-S1-S17

Open Data