Log on / register
Feedback | Support | My details
Open AccessHighly AccessMethodology article

Quantitative measures for the management and comparison of annotated genomes

Karen Eilbeck email, Barry Moore email, Carson Holt email and Mark Yandell email

Department of Human genetics, Eccles Institute of Human Genetics, University of Utah and School of Medicine, Salt Lake City, Utah, USA

author email corresponding author email

BMC Bioinformatics 2009, 10:67doi:10.1186/1471-2105-10-67

Published: 23 February 2009

Additional files

Additional file 1:

Release Dates, Gene and Transcript Counts. Columns: release name, release date, gene count, transcript count and number of genes annotated with multiple transcripts. Data shown for each release analyzed in this study for H. sapiens, M. musculus, D. melanogaster, A. gambiae and C. elegans. The number in the 'Genes' column represents the number of records tagged as a gene in either the GenBank or GFF3 files for that organism and release. For the GFF3 files this is limited to protein-coding genes, as variability in early GFF3 formats precluded the inclusion of non-coding RNA genes. The number in parenthesis is the number of genes used in our analyses. There are a variety of reasons for the differences between the raw gene count and the number of genes that we analyzed. In general if there were annotations to support, or if we could infer, a valid gene model from the contents of the gff3 or GenBank file, with at least one transcript and exon then we analyzed the record. GenBank records for human and mouse annotate some pseudogenes with transcripts (but not all) and we have included those genes with transcripts in our analyses. Finally, there are some records for which – due to incomplete or corrupt annotations – a valid gene model cannot be inferred. We have excluded them from our analyses. The numbers in the 'Transcripts' column represent a count of records in GenBank files that have a transcript_id tag, and in fly and worm GFF3 files, records that have a type field with mRNA. The values in the 'Alt. Spliced Genes' column represent the number of genes included in our analyses which had more than one transcript associated with them.

Format: PDF Size: 55KB Download file

This file can be viewed with: Adobe Acrobat Reader

Additional file 2:

Genes with annotations that may need review. Top ten problematic genes from the most recent release for each genome in our dataset. Genes were prioritized first on the basis of having SO-classifications indicative of problems, and second on Splice Complexity. These criteria identified only seven genes in D. melanogaster

Format: PDF Size: 317KB Download file

This file can be viewed with: Adobe Acrobat Reader

Additional file 3:

Number of version pairs with assembly induced coordinate changes. The number of genes for each release pair that were excluded from Annotation Edit Distance calculations due to sequence changes within the gene region.

Format: PDF Size: 50KB Download file

This file can be viewed with: Adobe Acrobat Reader


© 1999-2009 BioMed Central Ltd unless otherwise stated. Part of Springer Science+Business Media.