Searching for evolutionary distant RNA homologs within genomic sequences using partition function posterior probabilities
1 Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA
2 Department of Computer Science and Bioinformatics Research Center, University of North Caroline at Charlotte, Charlotte, NC, USA
BMC Bioinformatics 2008, 9:61 doi:10.1186/1471-2105-9-61Published: 28 January 2008
Identification of RNA homologs within genomic stretches is difficult when pairwise sequence identity is low or unalignable flanking residues are present. In both cases structure-sequence or profile/family-sequence alignment programs become difficult to apply because of unreliable RNA structures or family alignments. As such, local sequence-sequence alignment programs are frequently used instead. We have recently demonstrated that maximal expected accuracy alignments using partition function match probabilities (implemented in Probalign) are significantly better than contemporary methods on heterogeneous length protein sequence datasets, thus suggesting an affinity for local alignment.
We create a pairwise RNA-genome alignment benchmark from RFAM families with average pairwise sequence identity up to 60%. Each dataset contains a query RNA aligned to a target RNA (of the same family) embedded in a genomic sequence at least 5K nucleotides long. To simulate common conditions when exact ends of an ncRNA are unknown, each query RNA has 5' and 3' genomic flanks of size 50, 100, and 150 nucleotides. We subsequently compare the error of the Probalign program (adjusted for local alignment) to the commonly used local alignment programs HMMER, SSEARCH, and BLAST, and the popular ClustalW program with zero end-gap penalties. Parameters were optimized for each program on a small subset of the benchmark. Probalign has overall highest accuracies on the full benchmark. It leads by 10% accuracy over SSEARCH (the next best method) on 5 out of 22 families. On datasets restricted to maximum of 30% sequence identity, Probalign's overall median error is 71.2% vs. 83.4% for SSEARCH (P-value < 0.05). Furthermore, on these datasets Probalign leads SSEARCH by at least 10% on five families; SSEARCH leads Probalign by the same margin on two of the fourteen families. We also demonstrate that the Probalign mean posterior probability, compared to the normalized SSEARCH Z-score, is a better discriminator of alignment quality. All datasets and software are available online.
We demonstrate, for the first time, that partition function match probabilities used for expected accuracy alignment, as done in Probalign, provide statistically significant improvement over current approaches for identifying distantly related RNA sequences in larger genomic segments.