Log on / register
Feedback | Support | My details

Comments(1)

An analysis of the Sargasso Sea resource and the consequences for database composition

Michael L Tress email, Domenico Cozzetto email, Anna Tramontano email and Alfonso Valencia email

BMC Bioinformatics 2006, 7:213doi:10.1186/1471-2105-7-213

Good to see criticism of this type of data

Neil Saunders   (20 April 2006)  University of Queensland

It's good to see someone take a critical look at environmental sequence data. One thing that the authors don't mention is the questionable validity of many protein sequences that are annotated as "hypothetical". The Sargasso sequences in GenBank do not appear to have been annotated for 23S rRNA genes and many of the short, hypothetical ORFs are in fact just translated 23S regions. You can see this for yourself if you BLAST a 23S sequence (e.g. from E. coli) versus the env_nt dataset, note the sequence coordinates of the 23S hit then visit the GenBank entry for that hit (e.g. gi 44249358). In many cases the so-called hypothetical ORFs lie in a 23S rDNA gene.

Perhaps NCBI and the other databases should consider segregation of environmental data from the bulk of the nr dataset to avoid contamination with junk.

Competing interests

None declared

top

Have something to say? Post a comment on this article!


© 1999-2009 BioMed Central Ltd unless otherwise stated. Part of Springer Science+Business Media.