Log on / register
Feedback | Support | My details
Open AccessHighly AccessResearch article

Generation, annotation, analysis and database integration of 16,500 white spruce EST clusters

Nathalie Pavy1 email, Charles Paule2 email, Lee Parsons2 email, John A Crow2 email, Marie-Josee Morency3 email, Janice Cooke1,5 email, James E Johnson2 email, Etienne Noumen1 email, Carine Guillet-Claude1 email, Yaron Butterfield4 email, Sarah Barber4 email, George Yang4 email, Jerry Liu4 email, Jeff Stott4 email, Robert Kirkpatrick4 email, Asim Siddiqui4 email, Robert Holt4 email, Marco Marra4 email, Armand Seguin3 email, Ernest Retzel2 email, Jean Bousquet1 email and John MacKay1 email

ARBOREA and Canada Research Chair in Forest Genomics, Pavillon Charles-Eugène-Marchand, Université Laval, Ste.Foy, Québec G1K 7P4, Canada

Center for Computational Genomics and Bioinformatics, University of Minnesota, 420 Delaware St. S.E., MMC 43, Minneapolis, MN 55455, USA

Laurentian Forestry Center (Canadian Forestry Service), Natural Resources Canada, 1055 rue du PEPS, Québec, Québec, G1V 4C7, Canada

Genome Sciences Center, BC Cancer Agency, 675 West 10 th Avenue, Vancouver, BC, V5Z 1L3, Canada

Department of Biological Sciences, University of Alberta, Edmonton, Alberta, T6G 2E9, Canada

author email corresponding author email

BMC Genomics 2005, 6:144doi:10.1186/1471-2164-6-144

Published: 19 October 2005

Abstract

Background

The sequencing and analysis of ESTs is for now the only practical approach for large-scale gene discovery and annotation in conifers because their very large genomes are unlikely to be sequenced in the near future. Our objective was to produce extensive collections of ESTs and cDNA clones to support manufacture of cDNA microarrays and gene discovery in white spruce (Picea glauca [Moench] Voss).

Results

We produced 16 cDNA libraries from different tissues and a variety of treatments, and partially sequenced 50,000 cDNA clones. High quality 3' and 5' reads were assembled into 16,578 consensus sequences, 45% of which represented full length inserts. Consensus sequences derived from 5' and 3' reads of the same cDNA clone were linked to define 14,471 transcripts. A large proportion (84%) of the spruce sequences matched a pine sequence, but only 68% of the spruce transcripts had homologs in Arabidopsis or rice. Nearly all the sequences that matched the Populus trichocarpa genome (the only sequenced tree genome) also matched rice or Arabidopsis genomes. We used several sequence similarity search approaches for assignment of putative functions, including blast searches against general and specialized databases (transcription factors, cell wall related proteins), Gene Ontology term assignation and Hidden Markov Model searches against PFAM protein families and domains. In total, 70% of the spruce transcripts displayed matches to proteins of known or unknown function in the Uniref100 database (blastx e-value < 1e-10). We identified multigenic families that appeared larger in spruce than in the Arabidopsis or rice genomes. Detailed analysis of translationally controlled tumour proteins and S-adenosylmethionine synthetase families confirmed a twofold size difference. Sequences and annotations were organized in a dedicated database, SpruceDB. Several search tools were developed to mine the data either based on their occurrence in the cDNA libraries or on functional annotations.

Conclusion

This report illustrates specific approaches for large-scale gene discovery and annotation in an organism that is very distantly related to any of the fully sequenced genomes. The ArboreaSet sequences and cDNA clones represent a valuable resource for investigations ranging from plant comparative genomics to applied conifer genetics.


© 1999-2010 BioMed Central Ltd unless otherwise stated. Part of Springer Science+Business Media.