Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics
-
* Corresponding author: John N Weinstein weinstein@dtpvx2.ncifcrf.gov
- Equal contributors
BMC Bioinformatics 2004, 5:80 doi:10.1186/1471-2105-5-80
- MS should pick this up
- And the lesson is...
- 19 probe sets in Affymetrix's human U133Plus2.0
- Good point.
- Special Interest group on spreadsheet risks
- Well spotted
- not only excel
And the lesson is...
Neil Saunders (2008-04-11 17:39) University of Queensland
And that's why bioinformaticians don't use Excel for this purpose. Or more generally, don't use spreadsheets as "databases".
Competing interests
None declared
19 probe sets in Affymetrix's human U133Plus2.0
Chao Lu (2004-07-28 14:55) Hospital for Sick Children, Toronto
A good point. Many people did not pay attention to this 'small' error.
Here is a list of 19 probe sets with errors in their gene symbol (June 23, 04 annotation, Affymetrix) when opened in Excel:
1570394_at ===> 1-Sep
200902_at ===> 15-Sep
208999_at ===> 8-Sep
209000_s_at ===> 8-Sep
212413_at ===> 6-Sep
212414_s_at ===> 6-Sep
212415_at ===> 6-Sep
212698_s_at ===> 10-Sep
213666_at ===> 6-Sep
214298_x_at ===> 6-Sep
214720_x_at ===> 10-Sep
220781_at ===> 1-Dec
221129_at ===> 2-Apr
223362_s_at ===> 3-Sep
225814_at ===> 1-Sep
226627_at ===> 8-Sep
227034_at ===> 10-Sep
227552_at ===> 1-Sep
233632_s_at ===> 1-Sep
Competing interests
None declared
Good point.
Carol Bult (2004-07-27 16:18) The Jackson Laboratory
The article raises a very good point. I've experienced similar behavior in excel for other data types. I would add that it is always a good idea to carry along a unique numeric database id along with gene names/symbols. Database accession ids may be less likely to be munged by Excel (unless the ids are alpha-numeric!) and since they are usually unique and permanent they can be used to restore and/or update lists of gene names/symbols (which change all of the time).
Competing interests
No competing interests
Special Interest group on spreadsheet risks
Patrick OBeirne
(2004-07-26 18:36) Eusprig 
The European Spreadsheet Risk Interest Group (EUSPRIG) discusses the prevention and detection of spreadsheet errors. You can read about the emergence of the discipline of Spreadsheet Engineering and other related information at our website <a href="http://www.eusprig.org">www.eusprig.org</a>. We have just completed our fifth international conference and now have a corpus of approximately 100 peer reviewed papers in our subject domain.
For more reports of spreadsheet errors, see
<a href="http://www.eusprig.org/stories.htm">our stories</a>
We're not specifically a group to discuss Excel bugs and workarounds, the <a href="http://peach.ease.lsoft.com/archives/excel-l.html">Excel-L list</a> is a very busy source of information on these, as well of course as the MS Knowledgebase.
We are very interested in hearing from users about how you mitigate spreadsheet risks, what good practices they adopt, and so on. We are working with the ECDL Foundation for a syllabus of good practice for end users.
Patrick O'Beirne, chair, Eusprig
Competing interests
none
Well spotted
Andrew Clegg (2004-07-21 17:21) Birkbeck
One to pin up on lab walls everywhere. I shudder to think how many pieces of work this might have affected.
Competing interests
None declared
not only excel
Heikki Lehvaslaiho
(2004-06-30 15:18) European Bioinformatics Institute 
I quickly tested a few common open source spreadsheet programs, openoffice.org calc, gnumeric and kspread, for this automatic symbol mutation ability.
The following crude text table indicates if the conversions happens by default in these programs. "date" means that DEC1 type string gets converted, "float" means that RIKEN identifiers of type "2310009E13" get converted.
.................."date"...."float"
calc................yes........yes
gnumeric........no........yes
kspread.........no........yes
Be careful out there!
Competing interests
None declared
MS should pick this up
Richard Jackson (2011-05-12 10:09) Independent
I believe a large part of bioinformatics is about providing a conduit between experts in different fields, as well as novel discovery. Often, people have their own preferences for data manipulation packages, and frequently scientists with less technical expertise tend towards Excel. Moving data back and forth between individuals in such ways give ample opportunities for errors like this to arise.
Hence, I think the situation is ubiquitous and serious enough to warrant intervention by Microsoft. I don't know if they've picked up on this article yet. Sadly, they don't seem to have anything in terms of a suggestion box on their website (I spent an hour looking!)
Competing interests
None declared
top