Schematic representation of the construction of the SWISS-PROT dataset used to assess the performance of automated protein function assignment methods. Test proteins (dark grey boxes) were initially selected by extraction from the entire SWISS-PROT database. The extraction protocol involved the removal of non-bacterial entries, the removal of entries created ahead of 2005, the removal of entries with words 'UPF' or 'uncharacterized', the selection of entries added directly to SWISS-PROT or that had undergone revision since their storage in TrEMBL, the removal of similarly annotated entries and the removal of entries showing sequence similarity to each other. The construction of the sequence similarity search results (light grey boxes) for functional inference included a sequence comparison against UniProt with BLAST, the removal of BLAST hits to sequences for which the creation date was newer or equal to than the annotation date of the query sequence and the restoration of the annotations of the remaining BLAST hits to their status just before the annotation date. Barrels show the number of entries and BLAST hits that passed each filtering step, and the intensity of the red colour indicates the corresponding fractions. White boxes in the crossing area show the annotation (DE) and the annotation date (DT) for two test sequences. Red crosses indicate BLAST hits that were removed.
Kankainen et al. BMC Bioinformatics 2012 13:33 doi:10.1186/1471-2105-13-33