Evaluation of the chicken transcriptome by SAGE of B cells and the DT40 cell line
- Equal contributors
1 Institute of Molecular Radiobiology, GSF, Ingolstädter Landstr. 1, D-85764 Neuherberg, Germany
2 Laboratory of Systems Biology Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Pawinskiego 5a, 02-106 Warszawa, Poland
3 Research Group of Biomedical Informatics, IMIM/Universidad Pompeu Fabra/Centre de Regulacio Genomica, E08003 Barcelona, Spain
4 Institute of Biomathematics, GSF, Ingolstädter Landstr. 1, D-85764 Neuherberg, Germany
5 Stowers Institute for Medical Research, 1000 E. 50th Street, Kansas City, MO 64110, USA
BMC Genomics 2004, 5:98 doi:10.1186/1471-2164-5-98Published: 21 December 2004
The understanding of whole genome sequences in higher eukaryotes depends to a large degree on the reliable definition of transcription units including exon/intron structures, translated open reading frames (ORFs) and flanking untranslated regions. The best currently available chicken transcript catalog is the Ensembl build based on the mappings of a relatively small number of full length cDNAs and ESTs to the genome as well as genome sequence derived in silico gene predictions.
We use Long Serial Analysis of Gene Expression (LongSAGE) in bursal lymphocytes and the DT40 cell line to verify the quality and completeness of the annotated transcripts. 53.6% of the more than 38,000 unique SAGE tags (unitags) match to full length bursal cDNAs, the Ensembl transcript build or the genome sequence. The majority of all matching unitags show single matches to the genome, but no matches to the genome derived Ensembl transcript build. Nevertheless, most of these tags map close to the 3' boundaries of annotated Ensembl transcripts.
These results suggests that rather few genes are missing in the current Ensembl chicken transcript build, but that the 3' ends of many transcripts may not have been accurately predicted. The tags with no match in the transcript sequences can now be used to improve gene predictions, pinpoint the genomic location of entirely missed transcripts and optimize the accuracy of gene finder software.