A 1000 human genomes…and some mycoplasma too

Posted by Biome on 14th May 2014 - 0 Comments


The revolution in rapid and cost-effective, high-throughput sequencing technologies have set a new trend in large-scale biomedical research. With such vast amounts of data being produced, the control of basic sequence quality downstream can present several challenges, with sequence contamination being one key area of concern. Mycoplasma are one of the most common contaminants of cell cultures. These minute bacteria lack cell walls and are particularly problematic due to difficulty in detecting their presence, even using light microscopes. But to what extent is contamination by mycoplasma corrupting downstream sequence databases? William Langdon, from the Department of Computer Science at University College London, UK, sought to address this question in his study in BioData Mining.

Langdon analysed the 1000 Genomes Project database – a large, highly respected, international study that has made its data publically available to researchers worldwide and aims to produce a detailed catalogue of human DNA variation. By downloading and scanning a random sample of more than 50 billion DNA sequences from diverse data sources, and mapping them against published genomes, he found tens of thousands of sequences that may have come from mycoplasma contamination. While many matches were of low quality, NCBI BLAST searches confirmed that some high quality, low entropy sequences matched mycoplasma strains. Overall, these results suggested that at least seven percent of public data provided by the 1000 Genome Project may be contaminated with mycoplasma.

The results probably come as little surprise to those who are already aware of the troublesome mycoplasma in molecular biology laboratories. While presenting a cause for concern, cross-species contamination in single-species databases such as the 1000 Genomes Project is relatively easy to screen for as compared to contamination from other members of the same species. However, as ever-increasing amounts of genomic data become available in the public domain and in silico research in biology grows, Langdonā€™s study highlights the need for further independent studies into sequence contamination of large databases.