Evaluating AED as a metric for annotation quality control. Annotation Edit Distance (AED) provides a measurement for how well an annotation agrees with overlapping aligned ESTs, mRNA-seq and protein homology data. AED values range from 0 and 1, with 0 denoting perfect agreement of the annotation to aligned evidence, and 1 denoting no evidence support for the annotation. We evaluated the use of AED as a quality control metric by comparing MAKER2 produced AED scores for release 30 (2003) of the M. musculus genome to the AEDs for release 37.1 (2007). These data show how AED can be used to quantify improvements to the annotations between each release. (A) The Pfam domain content of M. musculus release 30 for genes found in each quartile of the MAKER2 AED distribution. Note that genes with low AEDs are highly enriched for domains. (B) The fraction of M. musculus genes from release 30 maintained/removed from subsequent release 37.1 for each MAKER2 AED distribution quartile. These data show how AED mirrors the independent curation decisions made by the mouse research community between 2003 and 2007. (C) The cumulative AED distributions of M. musculus release 30 and 37.1 demonstrate how AED quantifies improvements made between releases. The subset of genes with NM prefixes assigned by RefSeq (which indicates the highest level of annotation quality) is plotted separately to show that these independently identified 'gold-standard' gene annotations tend to have lower AED values in comparison to the genome as a whole.
Holt and Yandell BMC Bioinformatics 2011 12:491 doi:10.1186/1471-2105-12-491