Department of Bioengineering, University of California, Berkeley CA 947201762, USA
Abstract
Background
Pairwise stochastic contextfree grammars (Pair SCFGs) are powerful tools for evolutionary analysis of RNA, including simultaneous RNA sequence alignment and secondary structure prediction, but the associated algorithms are intensive in both CPU and memory usage. The same problem is faced by other RNA alignmentandfolding algorithms based on Sankoff's 1985 algorithm. It is therefore desirable to constrain such algorithms, by preprocessing the sequences and using this first pass to limit the range of structures and/or alignments that can be considered.
Results
We demonstrate how flexible classes of constraint can be imposed, greatly reducing the computational costs while maintaining a high quality of structural homology prediction. Any scoreattributed contextfree grammar (e.g. energybased scoring schemes, or conditionally normalized Pair SCFGs) is amenable to this treatment. It is now possible to combine independent structural and alignment constraints of unprecedented general flexibility in Pair SCFG alignment algorithms. We outline several applications to the bioinformatics of RNA sequence and structure, including WatermanEggert Nbest alignments and progressive multiple alignment. We evaluate the performance of the algorithm on test examples from the RFAM database.
Conclusion
A program, Stemloc, that implements these algorithms for efficient RNA sequence alignment and structure prediction is available under the GNU General Public License.
Background
As our acquaintance with RNA's diverse functional repertoire develops
Many programs for comparative analysis of RNA require the sequences to be prealigned
A powerful, general dynamic programming algorithm for simultaneously aligning and predicting the structure of multiple RNA sequences was developed by David Sankoff
The purpose of this paper is to report our progress on general pairwise constrained versions of Sankoff's algorithm (or, more precisely, constrained versions of some related dynamic programming algorithms for SCFGs). The overall aim is the simultaneous alignment and structure prediction of two RNA sequences,
Our system of constraints is quite general. Previous constrained versions of Sankofflike algorithms, such as the programs DYNALIGN
The algorithms described here can reproduce nearly all such banding constraints and, further, can take advantage of more flexible sequencetailored constraints. Specifically, the
The stemloc program also implements various familiar extensions to pairwise alignment, including local alignment
Results
To investigate the comparative resource usage of the various different kinds of constraint that can be applied using fold and alignment envelopes, stemloc was tested on 22 pairwise alignments taken from version 6.1 of RFAM
The EMBL accession numbers and coordinates of all sequences are listed in Table
The alignment envelope containing the
A parse tree for the grammar of Table 1
A parse tree for the grammar of Table 1. Each internal node is labeled with a nonterminal (Stem or Loop); additionally, the subsequences (
Parsing a pair of sequences (
Parsing a pair of sequences (
Bifurcation rules allow a subsequencepair (
Bifurcation rules allow a subsequencepair (
These fold envelopes (triangular grids) limit the maximum length of subsequences (black dots), while the alignment envelope (rectangular grid) limits the maximum deviation of cutpoints (short diagonal lines) from the main diagonal
These fold envelopes (triangular grids) limit the maximum length of subsequences (black dots), while the alignment envelope (rectangular grid) limits the maximum deviation of cutpoints (short diagonal lines) from the main diagonal.
These fold envelopes (triangular grids) and alignment envelope (rectangular grid) limit the subsequences (black dots) and cutpoints (short diagonal lines) to those consistent with a given alignment and consensus secondary structure (shown)
These fold envelopes (triangular grids) and alignment envelope (rectangular grid) limit the subsequences (black dots) and cutpoints (short diagonal lines) to those consistent with a given alignment and consensus secondary structure (shown). The alignment path is also shown on the alignment envelope as a solid black line, broken by cutpoints.
Fold envelope size is highly correlated with
Fold envelope size is highly correlated with
Alignment envelope size is highly correlated with
Alignment envelope size is highly correlated with
Alignment sensitivity as a function of envelope size parameter
Alignment sensitivity as a function of envelope size parameter
Alignment specificity as a function of envelope size parameter
Alignment specificity as a function of envelope size parameter
Fold sensitivity as a function of envelope size parameter
Fold sensitivity as a function of envelope size parameter
Fold specificity as a function of envelope size parameter
Fold specificity as a function of envelope size parameter
Total running time of stemloc (including envelope generation phases) as a function of envelope size parameter
Total running time of stemloc (including envelope generation phases) as a function of envelope size parameter
Peak memory usage of stemloc (i.e. the size of the principal CYK matrix) as a function of envelope size parameter
Peak memory usage of stemloc (i.e. the size of the principal CYK matrix) as a function of envelope size parameter
The unconstrained alignment envelope, with the fold envelopes containing the
The alignment envelope containing the 100 best primary sequence alignments, with the fold envelopes containing the
In the first two tests,
A range of different values for the parameter
These performance indicators are averaged over all 22 pairwise alignments and plotted for the three test regimes in Figure
Three main conclusions can be drawn from these data. First, allowing the search to consider more than a single alignment greatly improves structure prediction (the red curve). Second, constraining the alignment search while exhaustively scanning fold space (the red curve) outperforms constraining the fold search while exhaustively scanning alignment space (the green curve). Third, the hybrid strategy (the blue curve), which partially constrains both searches, approaches the alignmentconstrained, foldunconstrained strategy (the red curve) in performance, with a significant saving in CPU and memory resources. Memory is the limiting factor in pairwise RNA alignment, and the primary motivation for constraints. For example, without constraints, alignment of two 16S ribosomal subunits using the stemloc grammar would take approximately 500 terabytes. (Using fold envelope constraints with structures fully specified, it can be done in under 5 gigabytes.)
Based on the results of these tests, the default envelope options for stemloc were chosen to be the 100best alignment envelope and the 1000best fold envelope. The performance of stemloc with these envelopes on each of the pairwise test alignments is given in Table
Discussion
The algorithms presented here include constrained versions of PairSCFG dynamic programming algorithms that run in significantly reduced space and time. The primary advance over previous work is the simultaenous imposition of fold and alignment constraints, including alignment constraints that are more general than others previously described. Thes constraints lead to significant reductions in requirements for processor and memory usage, which will increase the length of RNA sequences that can be analyzed on mainstream computer hardware.
These algorithms have been used to implement stemloc, a fast, efficient software tool for multiple RNA sequence alignment implementing numerous extra features such as local alignment, WatermanEggert
The results given here should be regarded as preliminary. For example, we have only tested the pairwise alignment functionality; full evaluation/optimisation of the multiple alignment algorithm remains. Rather than using the CYK algorithm, one could use the InsideOutside algorithm with a decisiontheoretic dynamic programming step to maximize expected performance
Conclusion
RNA sequence analysis has generated considerable interest over recent years, as many new roles for RNA in the cell have come to light. RNA genes and regulatory elements are components of many molecular systems and comparative genomics is a powerful way to probe this function, perhaps even more so for RNA than for protein (due to the "wellbehaved" statistical correlations found in RNAs with conserved secondary structure). Furthermore, statistical modeling of RNA evolution continues to play a fundamental role in the phylogenetic classification of new forms of life.
These biological motives have driven a demand for RNA sequence analysis tools that are faster, slimmer and more scaleable. It is hoped that the algorithms and approaches described here, together with development and analysis of RNA evolutionary models
Methods
We begin our description of the envelope method with an explanatory note regarding our decision to present these constraints in terms of SCFGs, rather than other scoring schemes such as those based solely on energies
The reason for our choice of SCFGs is simple: stochastic grammars are, in our opinion, the most theoretically welldeveloped of the scoring schemes used for RNA. They come with welldocumented algorithms for sequence alignment, structure prediction, parameterization by supervised learning from various kinds of training data, and calculation of posterior probabilities
Despite these arguments, many people continue to find calories preferable to bits as a unit of score. For such readers, we note that the system of constraints described here is entirely applicable to the general scoreattributed grammar. This includes energybased and heuristic scoring schemes as well as (for example) grammars whose rule "probabilities" actually represent logodds ratios, or which are conditionally normalized with respect to one sequence.
Notation
To implement SCFG dynamic programming algorithms efficiently for RNA, it is convenient to define a simplified (but universal) template for grammars, similar in principle to "Chomsky normal form"
A pairwise stochastic contextfree grammar
Let
Terminations: rules of the form
Transitions: rules of the form
Bifurcations: rules of the form
Emissions: rules of the form
The particular RNA normal form described in this section is chosen for ease of presentation. The implementation in the dart library uses the slightly more restrictive form for Pair SCFGs defined in an earlier paper
For presentational purposes, we will generally omit allgap columns from the pairwise alignment and the grammar. For example, an emission rule having the form
Table
A stochastic contextfree grammar for generating pairwise alignments of RNA structures.
→
R
Stem
→
stemExtend (1  stemGap) basepairSubstitution [

stemExtend (stemGap/2) basepairIndel [

stemExtend (stemGap/2) basepairIndel [

(1  stemExtend)(1  bifurcate) baseSubstitution [

Stem Stem
(1  stemExtend) bifurcate
Loop
→
loopExtend (1  loopGap) baseSubstitution [

loopExtend (loopGap/2) baseIndel [

loopExtend (loopGap/2) baseIndel [

1  loopExtend
The parse tree and the sequence likelihood
The grammar
The
Dynamic programming algorithms for Pair SCFGs
The following section describes the constrained and unconstrained dynamic programming (DP) algorithms used for Pair SCFGs.
The Inside algorithm
The Inside algorithm
An asymptotically faster step involves summing contributions from matching emission rules of the form
(Recall that
The intermediate probabilities of the Inside algorithm can then be expressed as
Termination of the recursion is provided by matching end rules,
The sequence likelihood is obtained as
In pseudocode, the Inside algorithm is
• Inputs:
• For
• For
• For
• For
• For each nonterminal
• Set
• For
• For
• Calculate
• Calculate
• Return
The timelimiting step of the Inside algorithm (computing the
In RNA normal form, the emission rules (
Imposing constraints
The high time and memory cost of the Inside and related algorithms motivate the development of slimmer, faster versions. To begin with, we impose constraints that narrow the search space. For example, we might want to preparse the sequences individually (using a singlesequence SCFG, or other
We can combine these various strategies into a generalized constraint on basepairs, alignmentcolumns or both. We stipulate that
Here
If equality holds in all three cases, then we recover the unconstrained Inside algorithm.
Note that the coordinates (
There are (
As an alternative to the unconstrained Inside algorithm, we can partially initialize the envelopes to limit the maximum subsequence length and/or the maximum deviation of the alignment from the main diagonal (Figure
Further possible constraints
The constraints given here allow the independent imposition of alignment or fold constraints. One can imagine further, even more general constraints. For example, one could exclude subsequencepairs (
Accelerating the iteration
Simply setting some intermediate probabilities to zero is not sufficient to accelerate the Inside algorithm. We also need to redesign the iteration to avoid visiting zeroprobability subsequencepairs (
• Inputs:
• For
• For each
• For each
• For each
• If (
• For each nonterminal
• Set
• For each
• For each
• If (
• Calculate
• Calculate
• Return
(†) These ordered subsets of
Alternative designs for the algorithm are possible, and indeed different circumstances may affect the choice of optimal design (e.g. depending on which envelopes are most constrained).
Slimming the container
Memory is the most prohibitively expensive resource demand of the Inside algorithm. In its simplest form, the algorithm stores the intermediate probabilities
This design decision involves a close tradeoff between CPU and memory usage. Initially, we tested various combinations of generic containers with
For fold envelope
Our DP matrix then uses an inner twodimensional array nested inside an outer twodimensional array.
The outer array has dimensions (
This particular configuration is efficient when the alignment envelope is densely populated and the fold envelopes are sparsely populated. As with the redesigned iterator, there may be alternative designs that are resourceoptimal under various different circumstances, depending on the nature of the envelopes.
The CYK algorithm
The CockeYoungerKasami (CYK) dynamic programming algorithm
The Outside and KYC algorithms
The Outside and KYC algorithms widen the applications of probabilistic inference with SCFGs. The Outside algorithm, together with the Inside, can be used to recover posterior probabilities of given basepairs/columns, which can be used as alignment reliability indicators or as update counts in Expectation Maximization parameter training
These algorithms use dynamic programming recursions that are related to Inside and CYK. The Outside algorithm calculates intermediate probabilities of the form
representing the sumoverprobabilities of all partial parse trees rooted at
As CYK is related to Inside, the KYC algorithm is related to the Outside algorithm: the intermediate probabilities
As with the Inside algorithm, we sum contributions to
Next we consider emission rules,
The intermediate Outside probabilities are thus
Note that the Inside probabilities
In terms of the underlying iteration, the key difference between the Inside and Outside algorithms is as follows. Suppose subsequencepair INNER = (
• Inputs:
• Initialize
• For
• For each
• For each
• For each
• If (
• For each nonterminal
• Set
• For each
• For each
• If (
• Calculate
• For each
• For each
• If (
• Calculate
• Calculate
(†) These ordered subsets of
The reducedspace dynamic programming matrix that was developed above for the constrained Inside algorithm can be reused for the constrained Outside algorithm.
Implementation
The abovedescribed algorithms were implemented in the C++ dart library. One dart program in particular, stemloc, is an efficient generalpurpose RNA multiplesequence alignment program that can be flexibly controlled by the user from the Unix command line, including reestimation of parameters from training data as well as a broad range of alignment functions.
The dart libraries provide Inside, Outside, CYK, KYC, traceback and training algorithms for any pairwise SCFG in RNA normal form, whose rule probabilities can be expressed as algebraic expressions of some set of probability parameters (with associated normalization constraints). The operatoroverloading features of C++ are utilized in full, so that the syntax of initializing a grammar object involves very few function calls and is essentially declarative.
dart source code releases can be downloaded under the terms of the GNU Public License, from the following URL (which also gives access to the latest development code in the CVS repository)
The grammars and algorithms described in this paper specifically refer to release 0.2 of the dart package, dated October 2004 (although the algorithms are also implemented in release 0.1, dated 10/2003).
Selecting appropriate fold and alignment envelopes
This section offers a nonexhaustive list of possible strategies for choosing appropriate
• Choose some appropriately simplified grammar, such as a
• As above, choose some appropriate
• As above, choose some appropriate
The latter two strategies have been implemented in the stemloc package described below. Empirically, the stochastic strategy appears to be less reliable than the deterministic strategies (although in theory the stochastic strategy will eventually find the globally optimal alignment given sufficiently many random repetitions, which may be a useful property).
Multiple sequence alignment
A heuristic algorithm for performing multiple alignmentandfolding of RNA sequences with a pairwise SCFG by progressive singlelinkage clustering runs as follows
• Start by making pairwise alignments (with predicted secondary structures) for all pairs of input sequences.
• Mark the highestscoring pair, and extract the two marked sequences
• While some sequences remain unmarked:
• For each newlymarked sequence:
• Align the marked sequence,
• Select the highestscoring of the pairwise (markedtounmarked) alignments. Use this alignment to merge the unmarked sequence into the seed alignment, and mark this sequence as newly aligned.
• Return the seed alignment.
The above algorithms have been implemented in stemloc. The multiple alignments produced by this algorithm lack welldefined probabilistic scores unless the pair SCFG is conditionally normalized. It is also straightforward to retrieve the
A grammar for pairwise RNA alignment and structure prediction
After some empirical experimentation, we developed the grammar of Tables
The stemloc grammar, part 1 of 3: stem and loop structures.
→
R
Start
→
Stem
startInStem

LBulge
(1  startInStem) postStem [2]/ (2

RBulge
(1  startInStem) postStem[2]/ (2

LRBulge
(1  startInStem) postStem [3]/ (

Multi
(1  startInStem) postStem [4]/ (
Stem
→
^{xy}StemMatch
1  stemGapOpen

^{y}StemIns
stemGapOpen/2

^{x}StemDel
stemGapOpen/2
StemMatch
→
^{xy}StemMatch
(1  stemGapOpen) stemExtend

^{y}StemIns
stemGapOpen/2

^{x}StemDel
stemGapOpen/2

StemExit
(1  stemGapOpen)(1  stemExtend)
StemIns
→
^{xy }StemMatch
(1  stemGapExtend)(1  stemGapSwap) stemExtend

^{y}StemIns
stemGapExtend

^{x}StemDel
(1  stemGapExtend) stemGapSwap

StemExit
(1  stemGapExtend) (1  stemGapSwap)(1  stemExtend)
StemDel
→
^{xy}StemMatch
(1  stemGapExtend)(1  stemGapSwap) stemExtend

^{x}StemDel
stemGapExtend

^{y}StemIns
(1  stemGapExtend) stemGapSwap

StemExit
(1  stemGapExtend) stemGapSwap (1  stemExtend)
StemExit
→
Loop
postStem [1]

LBulge
postStem [2]/2

RBulge
postStem [2]/2

LRBulge
postStem [3]

Multi
postStem [4]
Multi
→
LMulti RMulti
1
LMulti
→
LBulge
multiBulgeOpen

Stem
(1  multiBulgeOpen)
RMulti
→
Multi
multiExtend

Stem
(1  multiExtend)(1  multiBulgeOpen)^{2}

LBulge
(1  multiExtend)(1  multiBulgeOpen) multiBulgeOpen

RBulge
(1  multiExtend)(1  multiBulgeOpen) multiBulgeOpen

LRBulge
(1  multiExtend) multiBulgeOpen^{2}
Loop
→
^{xy}LoopMatch
(1  loopGapOpen)

^{y}LoopIns
loopGapOpen/2

^{x}LoopDel
loopGapOpen/2
LoopMatch
→
^{xy}LoopMatch
(1  loopGapOpen) loopExtend

^{y}LoopIns
loopGapOpen/2

^{x}LoopDel
loopGapOpen/2

(1  loopGapOpen) (1  loopExtend)
LoopIns
→
^{xy}LoopMatch
(1  loopGapExtend)(1  loopGapSwap) loopExtend

^{y}LoopIns
loopGapExtend

^{x}LoopDel
(1  loopGapExtend) loopGapSwap

(1  loopGapExtend)(1  loopGapSwap) (1  loopExtend)
LoopDel
→
^{xy}LoopMatch
(1  loopGapExtend)(1  loopGapSwap) loopExtend

^{x}LoopDel
loopGapExtend

^{y}LoopIns
(1  loopGapExtend) loopGapSwap

(1  loopGapExtend)(1  loopGapSwap)(1  loopExtend)
The stemloc grammar, part 2 of 3: bulges.
→
R
LBulge
→
^{xy}LBulgeMatch
(1  loopGapOpen)

^{y}LBulgeIns
loopGapOpen/2

^{x}LBulgeDel
loopGapOpen/2
LBulgeMatch
→
^{xy}LBulgeMatch
(1  loopGapOpen) loopExtend

^{y}LBulgeIns
loopGapOpen/2

^{x}LBulgeDel
loopGapOpen/2

Stem
(1  loopGapOpen)(1  loopExtend)
LBulgeIns
→
^{xy}LBulgeMatch
(1  loopGapExtend)(1  loopGapSwap) loopExtend

^{y}LBulgeIns
loopGapExtend

^{x}LBulgeDel
(1  loopGapExtend) loopGapSwap

Stem
(1  loopGapExtend)(1  loopGapSwap) (1  loopExtend)
LBulgeDel
→
^{xy}LBulgeMatch
(1  loopGapExtend)(1  loopGapSwap) loopExtend

^{x}LBulgeDel
loopGapExtend

^{y}LBulgeIns
(1  loopGapExtend) loopGapSwap

Stem
(1  loopGapExtend)(1  loopGapSwap) (1  loopExtend)
RBulge
→
^{xy}RBulgeMatch
(1  loopGapOpen)

^{y}RBulgeIns
loopGapOpen/2

^{x}RBulgeDel
loopGapOpen/2
RBulgeMatch
→
^{xy}RBulgeMatch
(1  loopGapOpen) loopExtend

^{y}RBulgeIns
loopGapOpen/2

^{x}RBulgeDel
loopGapOpen/2

Stem
(1  loopGapOpen) (1  loopExtend)
RBulgeIns
→
^{xy}RBulgeMatch
(1  loopGapExtend)(1  loopGapSwap) loopExtend

^{y}RBulgeIns
loopGapExtend

^{x}RBulgeDel
(1  loopGapExtend) loopGapSwap

Stem
(1  loopGapExtend)(1  loopGapSwap) (1  loopExtend)
RBulgeDel
→
^{xy}RBulgeMatch
(1  loopGapExtend)(1  loopGapSwap) loopExtend

^{x}RBulgeDel
loopGapExtend

^{y}RBulgeIns
(1  loopGapExtend) loopGapSwap

Stem
(1  loopGapExtend)(1  loopGapSwap) (1  loopExtend)
LRBulge
→
^{xy}LRBulgeMatch
(1  loopGapOpen)

^{y}LRBulgeIns
loopGapOpen/2

^{x}LRBulgeDel
loopGapOpen/2
LRBulgeMatch
→
^{xy}LRBulgeMatch
(1  loopGapOpen) loopExtend

^{y}LRBulgeIns
loopGapOpen/2

^{x}LRBulgeDel
loopGapOpen/2

RBulge
(1  loopGapOpen) (1  loopExtend)
LRBulgeIns
→
^{xy}LRBulgeMatch
(1  loopGapExtend)(1  loopGapSwap) loopExtend

^{y}LRBulgeIns
loopGapExtend

^{x}LRBulgeDel
(1  loopGapExtend) loopGapSwap

RBulge
(1  loopGapExtend)(1  loopGapSwap) (1  loopExtend)
LRBulgeDel
→
^{xy}LRBulgeMatch
(1  loopGapExtend)(1  loopGapSwap) loopExtend

^{x}LRBulgeDel
loopGapExtend

^{y}LRBulgeIns
(1  loopGapExtend) loopGapSwap

RBulge
(1  loopGapExtend)(1  loopGapSwap) (1  loopExtend)
The Stemloc grammar, part 3 pf 3: emissions.
→
R
^{xy}LoopMatch
→
baseSubstitution [
^{y}LoopIns
→
baseIndel [
^{x}LoopDel
→
baseIndel [
^{xy}LBulgeMatch
→
baseSubstitution [
^{y}LBulgeIns
→
baseIndel [
^{x}LBulgeDel
→
baseIndel [
^{xy}RBulgeMatch
→
baseSubstitution [
^{y}RbulgeIns
→
baseIndel [
^{x}RBulgeDel
→
baseIndel [
^{xy}LRBulgeMatch
→
baseSubstitution [
^{y}LRBulgeIns
→
baseIndel [
^{x}LRBulgeDel
→
baseIndel [
^{xy}StemMatch
→
basepairSubstitution [
^{y}StemIns
→
basepairIndel [
^{x}StemDel
→
basepairIndel [
The subset of RFAM used to test the constrained SCFG algorithms.
RFAM family
Seauence (EMBL.ID / startpointendpoint)
Alignment
Basepair
sens.
spec.
sens.
spec.
S15
AE004150.1/71237243
AE004888.1/27852659
0.65
0.752
0.462
0.353
S15
AE005545.1/37973683
AE004888.1/27852659
0.652
0.701
0.615
0.4
U3
U27297.1/2180
AF277396.1/3126
0.252
0.248
0.0833
0.087
glmS
AL935254.1/9444994600
AE010557.1/24169
0.603
0.599
0.667
0.595
glmS
AE010557.1/24169
AE013165.1/26162459
0.532
0.587
0.545
0.667
glmS
AL596166.1/5073450929
AE013165.1/26162459
0.873
0.873
0.757
0.7
glmS
AC078934.3/3262132405
AE010557.1/24169
0.869
0.863
0.756
0.689
glmS
AL935254.1/9444994600
AE013165.1/26162459
0.715
0.715
0.811
0.769
Purine
AE007775.1/35583459
AL591981.1/205922205823
0.869
0.869
0.773
0.81
Purine
AL591981.1/205922205823
AP004595.1/160373160472
0.838
0.838
0.591
0.5
Purine
AE007775.1/35583459
AE010606.1/46804581
0.67
0.699
0.636
0.875
Purine
AP003194.2/163700163601
AE016809.1/202496202595
0.84
0.866
0.857
0.75
U5
M16510.1/245451
AF095839.1/890777
0.584
0.579
0.667
0.8
U5
X63789.1/22362349
AF095839.1/890777
0.716
0.69
0.8
0.8
IRE
AY112742.1/1241
S57280.1/391417
0.667
0.667
0.6
0.75
IRE
AF266195.1/1443
X01060.1/39503976
0.963
0.963
0.9
0.9
IRE
S57280.1/391417
X13753.1/14341460
1
1
0.6
0.6
IRE
AY112742.1/1241
X13753.1/14341460
0.778
0.778
0.8
0.727
IRE
AF266195.1/1443
AF171078.1/14161442
0.963
0.963
0.7
0.7
IRE
AF171078.1/14161442
X01060.1/39503976
1
1
0.7
0.875
6S
Y00334.1/77254
AL627277.1/108623108805
0.869
0.869
0.811
0.754
6S
AE004317.1/56265807
AL627277.1/108623108805
0.777
0.777
0.736
0.709
The starting nonterminal is Start. The nonterminals representing higherlevel units of RNA structure are Loop, Stem, LBulge, RBulge and LRBulge. Each of these has associated Match, Ins and Del states (e.g. StemMatch, StemIns and StemDel) and each of these states has an associated emission state, prefixed with
Tables
To summarize, the grammar models homologous stems, loops, multiloops and bulges in pairwise RNA alignments, with covariant substitution scores and affine gap penalties (geometric indel length distributions). It has the property that any combined alignment and structure prediction for two RNA sequences has a single, unambiguous parse tree. In our investigations, this unambiguity appeared to improve the accuracy of alignment and structure prediction substantially; see also writings on this topic by Giegerich
The stemloc grammar does not model basepair stacking effects due to
Parameterization
Under the SCFG framework, the probability parameters for the grammar can be estimated directly from data using the InsideOutside algorithm with appropriate constraints, which are easy to supply (e.g. to sum over all parses consistent with a given alignment, one simply uses an appropriate alignment envelope). The parameters were trained from 56376 (nonindependent) pairwise alignments from RFAM
stemloc allows the user to reestimate all parameters from their own personal training set of trusted alignments. This may be a useful feature, since the training procedure described above is probably biased. Since training was performed using all kinds of sequence available in RFAM, including RNA sequences with computationally predicted secondary structure as well as those for which structures were experimentally confirmed, it is possible that the stemloc parameters may be skewed by the parameters of the computational methods used by the RFAM curators to predict structure. These include the homology modeling program INFERNAL
Authors' contributions
IH designed, programmed, tested and documented the algorithms.
Acknowledgements
The author thanks Sean Eddy for inspiring discussions and three anonymous reviewers for their helpful suggestions. The work was conceived during an NIHfunded workshop on RNA informatics organised by Elena Rivas and Eric Westhof in Benasque, Spain, 2003.