Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Highly Accessed Software

The Protein Identifier Cross-Referencing (PICR) service: reconciling protein identifiers across multiple source databases

Richard G Côté, Philip Jones, Lennart Martens, Samuel Kerrien, Florian Reisinger, Quan Lin, Rasko Leinonen, Rolf Apweiler and Henning Hermjakob*

BMC Bioinformatics 2007, 8:401  doi:10.1186/1471-2105-8-401

PubMed Commons is an experimental system of commenting on PubMed abstracts, introduced in October 2013. Comments are displayed on the abstract page, but during the initial closed pilot, only registered users can read or post comments. Any researcher who is listed as an author of an article indexed by PubMed is entitled to participate in the pilot. If you would like to participate and need an invitation, please email info@biomedcentral.com, giving the PubMed ID of an article on which you are an author. For more information, see the PubMed Commons FAQ.

Re: The Protein Identifier Cross-Reference (PICR) service: reconciling protein identifiers across multiple source databases

Eric Jain   (2007-10-30 12:23)  Swiss Institute of Bioinformatics email

"Redundant databases may even assign multiple identifiers to the same sequence."

Keep in mind that some databases such as UniProtKB/Swiss-Prot are "redundant" on purpose, i.e. sequences are considered specific to organisms. At the same time a single identifier may be used to describe several splice variants etc. If the goal is to create a true "general purpose" mapping service, you'd have to allow people to map both at the conceptual level as well as at the sequence level. PICR looks like it could be real useful for people who need to do database mapping using exact sequence matches, but it should not be assumed that that's what most people want to do!

"Unified identifier schemes have been proposed in the past, such as Life Science Identifiers (LSID) and Sequence Globally Unique Identifiers (SEGUID), but their adoption remains limited."

Identifier schemes address issues such as how to avoid collisions and how to resolve (or not) identifiers, but they do not address the mapping issue! The LSID scheme, for example, has no mechanism to prevent several organizations from assigning different identifiers to the same sequence.

"The ID Mapping service offered by Protein Information Resource (PIR) has limited functionality in that it can only map between two sources per request, meaning that if the user wishes to map proteins from SGD, IPI and Genbank to UniProt, three requests must be made"

PIR's mapping service does support mapping from multiple sources (though the mapping is always *to* a single source, and I'm not sure the web form supports this).

"Also, not all mappings are available. For example, it is possible to map from SGD to UniProt [..] but not from SGD to Genbank."

This is supported, but since the mapping is provided by UniProtKB (in collaboration with SGD) it may not be complete (but note that a pure sequence-based mapping is likely to miss mappings as well, unless of course what you want really is a pure, sequence-based mapping).

May also be worth pointing out that while PICR lists 21 databases, PIR's mapping service supports more than 100! (see interface at http://beta.uniprot.org/mapping/).

One shortcoming of PIR's mapping services is performance, especially when mapping large sets of several thousand identifiers. Here it would be interesting to see some benchmark numbers!

"We are in communication with the NCBI to obtain daily up-to-date gi number to UniProtKB accession number mapping files, which will be incorporated into the UniParc data warehouse and made available via PICR."

GI numbers have been in UniParc for a while now?

Competing interests

None declared

top

Post a comment