Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Highly Accessed Correspondence

Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics

Barry R Zeeberg, Joseph Riss, David W Kane, Kimberly J Bussey, Edward Uchio, W Marston Linehan, J Carl Barrett and John N Weinstein*

BMC Bioinformatics 2004, 5:80  doi:10.1186/1471-2105-5-80

PubMed Commons is an experimental system of commenting on PubMed abstracts, introduced in October 2013. Comments are displayed on the abstract page, but during the initial closed pilot, only registered users can read or post comments. Any researcher who is listed as an author of an article indexed by PubMed is entitled to participate in the pilot. If you would like to participate and need an invitation, please email info@biomedcentral.com, giving the PubMed ID of an article on which you are an author. For more information, see the PubMed Commons FAQ.

MS should pick this up

Richard Jackson   (2011-05-12 10:09)  Independent email

I believe a large part of bioinformatics is about providing a conduit between experts in different fields, as well as novel discovery. Often, people have their own preferences for data manipulation packages, and frequently scientists with less technical expertise tend towards Excel. Moving data back and forth between individuals in such ways give ample opportunities for errors like this to arise.

Hence, I think the situation is ubiquitous and serious enough to warrant intervention by Microsoft. I don't know if they've picked up on this article yet. Sadly, they don't seem to have anything in terms of a suggestion box on their website (I spent an hour looking!)

Competing interests

None declared

top

And the lesson is...

Neil Saunders   (2008-04-11 17:39)  University of Queensland

And that's why bioinformaticians don't use Excel for this purpose. Or more generally, don't use spreadsheets as "databases".

Competing interests

None declared

top

19 probe sets in Affymetrix's human U133Plus2.0

Chao Lu   (2004-07-28 14:55)  Hospital for Sick Children, Toronto

A good point. Many people did not pay attention to this 'small' error.

Here is a list of 19 probe sets with errors in their gene symbol (June 23, 04 annotation, Affymetrix) when opened in Excel:

1570394_at ===> 1-Sep

200902_at ===> 15-Sep

208999_at ===> 8-Sep

209000_s_at ===> 8-Sep

212413_at ===> 6-Sep

212414_s_at ===> 6-Sep

212415_at ===> 6-Sep

212698_s_at ===> 10-Sep

213666_at ===> 6-Sep

214298_x_at ===> 6-Sep

214720_x_at ===> 10-Sep

220781_at ===> 1-Dec

221129_at ===> 2-Apr

223362_s_at ===> 3-Sep

225814_at ===> 1-Sep

226627_at ===> 8-Sep

227034_at ===> 10-Sep

227552_at ===> 1-Sep

233632_s_at ===> 1-Sep

Competing interests

None declared

top

Good point.

Carol Bult   (2004-07-27 16:18)  The Jackson Laboratory

The article raises a very good point. I've experienced similar behavior in excel for other data types. I would add that it is always a good idea to carry along a unique numeric database id along with gene names/symbols. Database accession ids may be less likely to be munged by Excel (unless the ids are alpha-numeric!) and since they are usually unique and permanent they can be used to restore and/or update lists of gene names/symbols (which change all of the time).

Competing interests

No competing interests

top

Special Interest group on spreadsheet risks

Patrick OBeirne   (2004-07-26 18:36)  Eusprig email

The European Spreadsheet Risk Interest Group (EUSPRIG) discusses the prevention and detection of spreadsheet errors. You can read about the emergence of the discipline of Spreadsheet Engineering and other related information at our website <a href="http://www.eusprig.org">www.eusprig.org</a>. We have just completed our fifth international conference and now have a corpus of approximately 100 peer reviewed papers in our subject domain.

For more reports of spreadsheet errors, see

<a href="http://www.eusprig.org/stories.htm">our stories</a>

We're not specifically a group to discuss Excel bugs and workarounds, the <a href="http://peach.ease.lsoft.com/archives/excel-l.html">Excel-L list</a> is a very busy source of information on these, as well of course as the MS Knowledgebase.

We are very interested in hearing from users about how you mitigate spreadsheet risks, what good practices they adopt, and so on. We are working with the ECDL Foundation for a syllabus of good practice for end users.

Patrick O'Beirne, chair, Eusprig

Competing interests

none

top

Well spotted

Andrew Clegg   (2004-07-21 17:21)  Birkbeck

One to pin up on lab walls everywhere. I shudder to think how many pieces of work this might have affected.

Competing interests

None declared

top

not only excel

Heikki Lehvaslaiho   (2004-06-30 15:18)  European Bioinformatics Institute email

I quickly tested a few common open source spreadsheet programs, openoffice.org calc, gnumeric and kspread, for this automatic symbol mutation ability.

The following crude text table indicates if the conversions happens by default in these programs. "date" means that DEC1 type string gets converted, "float" means that RIKEN identifiers of type "2310009E13" get converted.

.................."date"...."float"

calc................yes........yes

gnumeric........no........yes

kspread.........no........yes

Be careful out there!

Competing interests

None declared

top

Post a comment