A thousand genomes and then some
29 Apr 2014
The 1000 Genomes Project was established as a way of cataloging human genetic variation, but it seems to have inadvertently cataloged something else as well - Mycoplasma genomes - this is according to a study published in the open access journal BioData Mining.
Raw data from the 1000 Genomes Project were analyzed by the single author, William Langdon, a computer scientist from University College London. He downloaded 50 billion DNA measurements from The 1000 Genomes Project and found that some of the data did not match human genomes. When further analysis was carried out, including BLAST searches, it was confirmed that these genomes were Mycoplasma and occur in at least 7% of the samples.
Mycoplasma contamination is a common problem in labs usually caused by infection of the medium that cells are grown in. When contamination is detected in data it is quite easy to clean up subsequent scans taking this into account.
Stephan Beck, professor of medical genomics at University College London, says: "When scientists download these raw data from The 1000 Genomes website, or any similar project, they should be aware of the caveat that this data is exactly that - raw. It is not surprising that contamination was found, but this should act as a warning to the community that they need to more vigilant and filtering out this contamination."
The 1000 Genomes Project was launched in 2008 as way of cataloging the different variants that result in genetic diversity in humans. It hopes to achieve its goal by sequencing the genomes of more than 1000 people. These data are then uploaded to the Internet where other researchers are free to use them. The data are used in one of two ways, firstly, scientists check how common a variation is in a certain condition across a population instead of comparing to one human reference genome. Secondly, it is possible to check the variation within different ethnicities to study human population history, as well as what impact ethnic variation may have upon disease.
This unwanted appearance of Mycoplasma genomes could also be seen as an opportunity. Once these genomes are removed they could then be analyzed by specialists in the field of Mycoplasma research. This could then give a better understanding of how Mycoplasma acts in growth medium.
William Langdon says: "As scientists use publicly available datasets rather than collecting their own samples, there is a risk of people using data and taking it as gospel. Mycoplasma contamination is a common problem, but it's just a case of catching it and annotating data."
T: +44 (0)20 3192 2243
M: +44 (0)78 2598 4543
Notes to Editor
1. Mycoplasma contamination in the 1000 Genomes Project
William B Langdon
Article available at journal website here.
Please name the journal in any story you write. If you are writing for the web, please link to the article. All articles are available free of charge, according to BioMed Central's open access policy.
2. BioData Mining is an open access, peer reviewed, online journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data.
3. BioMed Central is an STM (Science, Technology and Medicine) publisher which has pioneered the open access publishing model. All peer-reviewed research articles published by BioMed Central are made immediately and freely accessible online, and are licensed to allow redistribution and reuse. BioMed Central is part of Springer Science+Business Media, a leading global publisher in the STM sector.