Systematic identification of stem-loop containing sequence families in bacterial genomes
- Equal contributors
1 CEINGE Biotecnologie Avanzate scarl, Via Comunale Margherita 482, 80145 Napoli, Italy
2 S.E.M.M. – European School of Molecular Medicine – Naples site, Italy
3 DBBM Dipartimento di Biochimica e Biotecnologie Mediche, Universita' di Napoli FEDERICO II, Via S. Pansini 5, 80131 Napoli, Italy
4 DBPCM Dipartimento di Biologia e Patologia Cellulare e Molecolare, Universita' di Napoli FEDERICO II. Via S. Pansini 5, 80131 Napoli, Italy
BMC Genomics 2008, 9:20 doi:10.1186/1471-2164-9-20Published: 17 January 2008
Analysis of non-coding sequences in several bacterial genomes brought to the identification of families of repeated sequences, able to fold as secondary structures. These sequences have often been claimed to be transcribed and fulfill a functional role. A previous systematic analysis of a representative set of 40 bacterial genomes produced a large collection of sequences, potentially able to fold as stem-loop structures (SLS). Computational analysis of these sequences was carried out by searching for families of repetitive nucleic acid elements sharing a common secondary structure.
The initial clustering procedure identified clusters of similar sequences in 29 genomes, corresponding to about 1% of the whole population. Sequences selected in this way have a substantially higher aptitude to fold into a stable secondary structure than the initial set. Removal of redundancies and regrouping of the selected sequences resulted in a final set of 92 families, defined by HMM analysis. 25 of them include all well-known SLS containing repeats and others reported in literature, but not analyzed in detail. The remaining 67 families have not been previously described. Two thirds of the families share a common predicted secondary structure and are located within intergenic regions.
Systematic analysis of 40 bacterial genomes revealed a large number of repeated sequence families, including known and novel ones. Their predicted structure and genomic location suggest that, even in compact bacterial genomes, a relatively large fraction of the genome consists of non-protein-coding sequences, possibly functioning at the RNA level.