cDNA sequences reveal considerable gene prediction inaccuracy in the Plasmodium falciparum genome
1 Laboratory of Malaria and Vector Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland, USA
2 Department of Parasitology, Zhongshan School of Medicine, Sun Yat‑sen University, Guangzhou, Guangdong 510080, PRoC
3 Bioinformatics Unit, Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, Maryland, USA
BMC Genomics 2007, 8:255 doi:10.1186/1471-2164-8-255Published: 27 July 2007
The completion of the Plasmodium falciparum genome represents a milestone in malaria research. The genome sequence allows for the development of genome-wide approaches such as microarray and proteomics that will greatly facilitate our understanding of the parasite biology and accelerate new drug and vaccine development. Designing and application of these genome-wide assays, however, requires accurate information on gene prediction and genome annotation. Unfortunately, the genes in the parasite genome databases were mostly identified using computer software that could make some erroneous predictions.
We aimed to obtain cDNA sequences to examine the accuracy of gene prediction in silico. We constructed cDNA libraries from mixed blood stages of P. falciparum parasite using the SMART cDNA library construction technique and generated 17332 high-quality expressed sequence tags (EST), including 2198 from primer-walking experiments. Assembly of our sequence tags produced 2548 contigs and 2671 singletons versus 5220 contigs and 5910 singletons when our EST were assembled with EST in public databases. Comparison of all the assembled EST/contigs with predicted CDS and genomic sequences in the PlasmoDB database identified 356 genes with predicted coding sequences fully covered by EST, including 85 genes (23.6%) with introns incorrectly predicted. Careful automatic software and manual alignments found an additional 308 genes that have introns different from those predicted, with 152 new introns discovered and 182 introns with sizes or locations different from those predicted. Alternative spliced and antisense transcripts were also detected. Matching cDNA to predicted genes also revealed silent chromosomal regions, mostly at subtelomere regions.
Our data indicated that approximately 24% of the genes in the current databases were predicted incorrectly, although some of these inaccuracies could represent alternatively spliced transcripts, and that more genes than currently predicted have one or more additional introns. It is therefore necessary to annotate the parasite genome with experimental data, although obtaining complete cDNA sequences from this parasite will be a formidable task due to the high AT nature of the genome. This study provides valuable information for genome annotation that will be critical for functional analyses.