Abstract
Background
With the advancement of nextgeneration sequencing and transcriptomics technologies, regulatory effects involving RNA, in particular RNA structural changes are being detected. These results often rely on RNA secondary structure predictions. However, current approaches to RNA secondary structure modelling produce predictions with a high variance in predictive accuracy, and we have little quantifiable knowledge about the reasons for these variances.
Results
In this paper we explore a number of factors which can contribute to poor RNA secondary structure prediction quality. We establish a quantified relationship between alignment quality and loss of accuracy. Furthermore, we define two new measures to quantify uncertainty in alignmentbased structure predictions. One of the measures improves on the “reliability score” reported by PPfold, and considers alignment uncertainty as well as basepair probabilities. The other measure considers the information entropy for SCFGs over a space of input alignments.
Conclusions
Our predictive accuracy improves on the PPfold reliability score. We can successfully characterize many of the underlying reasons for and variances in poor prediction. However, there is still variability unaccounted for, which we therefore suggest comes from the RNA secondary structure predictive model itself.
Background
RNA secondary structure prediction is still an important problem in computational biology. With the advent of next generation sequencing and RNAseq technologies, many RNA structural changes are being found to play important roles in regulating gene expression [1,2]. Gene regulation studies can now be done on a genomewide scale. In some cases RNA secondary structures can be experimentally determined on a genomewide level [3], but these methods require RNA isolation and many not preserve in vivo structures. RNA secondary structure prediction programs are still often used to predict structures across the genome [4]. The predicted secondary structures, and predicted structural changes, are being used to find relationships and suggest mechanisms in gene regulatory networks.
Some methods for RNA secondary structure prediction only consider a single sequence as input. However, prediction quality can be improved by using multiple sequences, assuming that RNA secondary structure is conserved through evolution. Even without a complex evolutionary model, these additional structural constraints provide valuable information on folding. Comparative methods for RNA secondary structure prediction are based on this observation, and use evolutionary information from multiple alignments to improve prediction quality.
Methods for RNA secondary structure prediction generally fall into two categories. Thermodynamic models make use of freeenergy functions, which take experimentally determined energy parameters for individual structural elements. Dynamic programming is then used to find the secondary structure with the minimum free energy, which is reported as the predicted structure. This has been successfully implemented in programs such as RNAfold [5] and UNAfold [6]. Thermodynamic methods typically deal with the singlesequence prediction problem, but extensions such as RNAalifold [7] and PETfold [8] allow for comparative prediction.
Stochastic contextfree grammars (SCFGs), on the other hand, define a probability distribution over the space of RNA secondary structures. Posterior decoding techniques are typically used to determine, for example, the maximum expected accuracy structure [9]. SCFGs have been used for the singlesequence prediction problem [10], but their advantage comes through coupling with a molecular evolution model. The first comparative SCFGbased approach was developed in Pfold [11,12], where alignment column probabilities were determined through single and paired column evolution models, calculated via the Felsenstein pruning algorithm [13]. For a more complete review on RNA secondary structure prediction, see [14].
In genomewide predictions of RNA secondary structure, the accuracy of the secondary structure prediction program is rarely factored into analysis. Typically only the mean accuracy is reported in RNA secondary structure prediction benchmarks [14], with the variance in the accuracy given little thought. Variance in accuracy is particularly problematic in the case of singlesequence prediction. Figure 1 shows the cumulative density function of predictive accuracy for two singlesequence applications of RNA secondary structure prediction, RNAfold and PPfold (a recent implementation of Pfold, [15]), on 443 sequences taken from the RNASTRAND database [16]. Additionally, a uniform (0,1) cumulative density function is shown for comparison. The figure illustrates that for sequences in this data set, the predictive accuracy of RNAfold and PPfold is not very much different from a random number between 0 and 1. When genomewide RNA secondary structure prediction is done on a large number of single sequences, many predictions will be poor ones.
Figure 1. Cumulative density of secondary structure prediction accuracy on 433 RNASTRAND structures. Performance accuracy on 433 sequences from RNASTRAND with known secondary structure by RNAfold and PPfold.
Consequently, it is important to understand more about the variability of RNA secondary structure prediction programs. In comparative prediction, there are many sources of variability: alignment quality, the number of sequences chosen and which alignment samples have been taken, the evolutionary relationships between sequences, as well as the illconditioned nature of the folding model itself. Understanding and quantifying these variances is key for biological applications that rely on these folding programs. Additionally, other bioinformatics software that utilize these folding programs– for example inverse RNA folding algorithms [17]– may often experience a fundamental limitation in performance due to variance in structure prediction quality.
Sequence alignment is a fundamental step in most comparative sequence analysis pipelines. The typical approach is to create a single, trusted multiple alignment of the sequences using methods based on an artificial scoring scheme and heuristics to find a highly scoring alignment [18,19]. Although this methodology is successful when the alignment is well resolved, it has been shown in the context of downstream analyses that the end result can be highly sensitive to the choice of alignment [2022]. RNA secondary structure prediction methods take a variety of approaches with respect to possible errors in RNA alignments. Some methods (e.g. [23]) invoke a foldandalign approach directly, where alignment is done simultaneously with structure prediction. Pfold, instead, takes a fixed alignment as input, but allocates a nite probability to a nucleotide being any other nucleotide; this makes the model more robust to poor alignment quality. Most modern methods (e.g. [24]) still consider prediction from a single, fixed, alignment. Recently, alignmentfree methods have also been proposed [25]. However, even after considering poor alignment quality, there are many additional variances associated with poor comparative RNA secondary structure prediction.
Sequence selection is another important variable. Alignment methods may produce poor alignments due to poor individual sequences, which will in turn produce poor structure predictions. There have been methods developed to select homologous sequences, particularly [26], which is based on evolutionary models and structural constraints. This is implemented in [27], which shows strong results in RNA secondary structure prediction can be found by selecting useful sequences.
In this paper we consider the problem of variances in comparative RNA secondary structure prediction. We present statistical analyses of different variances including the relationship between structure prediction quality and chosen alignment distance from the reference alignment, and a predictive algorithm for accuracy is provided. Factors like the number of sequences in the alignment and the evolutionary distance of the sequences are considered. Finally, a novel method is presented which extends information entropy for stochastic contextfree grammars [28] to consider variation over alignments.
Statistical alignment
Statistical alignment [29] provides a solution to many of the issues encountered with the traditional approach of sequence alignment. It models sequence evolution as a stochastic process involving sequence insertions, deletions and character substitutions, which defines a probability distribution over the alignments of the sequences. Using techniques such as Expectation Maximisation or Markov Chain Monte Carlo (MCMC), it is possible to either maximize the likelihood of the alignment, or generate a representative set of putative alignments by sampling from the alignment distribution.
Several statistical alignment implementations have emerged in past years [3033], some of which allow cosampling other entities such as the evolutionary tree or the locations of cisregulatory motifs. Such methods can highlight homoplasy and alignment uncertainty with high accuracy or can be used to decrease alignment uncertainty effects in downstream analyses, for instance protein secondary structure prediction [34].
StatAlign is a statistical alignment software package [32] that allows joint Bayesian analysis of multiple alignments, phylogenetic trees and evolutionary parameters. The background model for insertions and deletions is a modified version of the TKF92 model [35] as described in [34]. The indel model can be coupled with an arbitrary substitution model (many of which are distributed with the software, both for protein and nucleotide sequence data). The Bayesian analysis is based on MCMC, the transition kernels are improved versions of those in [34]. StatAlign generates random samples from the joint posterior distribution of sequence alignments, evolutionary trees and model parameters. This highdimensional joint distribution can be analysed in several ways, ranging from the simple statistics of marginalised single dimensions (e.g. the posterior distribution of a single rate parameter) to the application of other tools to the alignment samples.
Alignment and RNA secondary structure accuracy metrics
To analyse variances of RNA secondary structure as alignment quality varies, we calculate a similarity score that measures how close a sample alignment is to the reference alignment. We use an alignment metric, taken from [36], which is generalised to an alignment method in [37].
Let be an alignment of a sequence s_{1} of length n to a sequence s_{2} of length m. Each column of can be expressed as pairs of the form , , and . We define
sets which represent ‘homology’, ‘insertion’ and ‘deletion’ respectively. Given these sets, we define the distance between two alignments and of two sequences s_{1} and s_{2} to be
For example, consider the case Then we have as there are no deletions, and as there are no insertions. This gives the distance between the alignments as zero, as would be expected.
Equation 1 can be generalised to sequence alignments with more than two sequences. Assuming now that a^{k} and are alignments of m sequences s_{1}, …, s_{m} of lengths n_{1}, …, n_{m}, we have
that is, summing all the pairwise alignment distances from Equation 1. This alignment metric if is then normalized and subtracted from 1 to produce a similarity score
The denominator of the fraction, , is the normalizing constant, the maximum that the alignment distance can be. The similarity score is bounded by 0 and 1, with 1 indicating that the sample alignment is identical to the reference alignment.
RNA secondary structure metrics
There are a wealth of available metrics on RNA secondary structure [14,38]. Here we use sensitivity, positive predictive value (PPV), and Fscore (the harmonic mean of sensitivity and PPV). Defining true positives (TP) as the number of basepairs correctly predicted, false positives (FP) as the number of true basepairs not predicted, and false negatives (FN) as the number of basepairs predicted which are incorrect, we have
The strength of these RNA secondary structure accuracy metrics is that they are easy to interpret, and make it straightforward to compare methods across different datasets. An Fscore of 1 would represent a structure prediction that was completely correct and an Fscore of 0 a structure prediction that only predicted incorrect basepairs.
Information entropy
As we later develop calculations for information entropy for a set of alignments, here we outline the computation of information entropy for a single alignment. The information entropy H of a probability distribution P with a set of events is defined as:
Information entropy is a measure for the “spread” of the probability distribution, and has welldefined lower and upper bounds. The minimum entropy of 0 occurs when there is only one outcome with probability 1, and the maximum entropy of log_{2} (n) occurs when there are n possible outcomes, each with probability 1/n, that is the uniform distribution. When the base of the logarithm is 2, the entropy is measured in bits. For a probability distribution, an entropy of k bits indicates that the expected value of the information content of observing a single outcome is k bits. In the context of secondary structure prediction, a low entropy therefore indicates that few secondary structures dominate the probability space, whereas a high entropy indicates a more even probability distribution over possible secondary structures. Thus, information entropy is a useful single quantity to characterize the underlying probability distribution of secondary structures.
The information entropy of the probability distribution over the possible secondary structures generated by a phyloSCFG can be obtained using expected rule frequencies, which can be computed using the insideoutside algorithm [28]. This is outlined here.
Let the set of all derivations for the input alignment be Φ. Since the probability of a derivation d can be written as the product of the SCFG production rule probabilities and the phylogenetic probabilities, we can write the total probability T of the grammar producing the input string as
where P_{G}[d] denotes the prior probabilities obtained from the SCFG part of the model, and P_{T} [d] are the likelihood factors obtained from the phylogenetic model. Conditioning on producing the input string, the normalized probability of a derivation d is . Consequently, we have that the information entropy of the input alignment under the phyloSCFG model is
which can be written using Equation 5 as
that is, separating out the SCFG contribution and the phylogenetic contribution. To calculate the entropy in practice, firstly we use a simplified form of the SCFG contribution. For a SCFG with set of production rules R, we can write the SCFG contribution in terms of the expected production rule frequency,
Secondly, we can simplify the phylogenetic contribution. Let r_{a} ∈ R be a SCFG rule which produces base pairs, and r_{b} ∈ R a SCFG rule which produces unpaired bases. Define 1^{d} (i, j), the indicator function for whether the column pair (i, j) is emitted from a rule r_{a} (i.e. position i and j form a pair), and 1^{s} (i), the indicator function for whether column i is emitted from a rule r_{b} (i.e. position i is unpaired). Finally, define f_{d} (r_{a}) as the frequency that rule r_{a} is used in derivation d. Then
As is the total probability under the model that positions i and j are emitted from a rule r_{a}, and is just the total probability under the model that position i is emitted from a rule r_{b}, the quantity can be computed using the expected rule frequencies obtained through the insideoutside algorithm [39].
Methods
Data
StatAlign Dataset
To test factors relating to alignment quality and secondary structure prediction quality, a large number of alignment samples from trusted reference alignments with known secondary structures are needed. We have created a curated RNA dataset based on the Rfam database [40] for the purposes of evaluating the framework. Alignments of homologous RNA sequences with known consensus secondary structure were extracted from Rfam seed alignments. From these, 50 RNA families with at least 50 sequences were randomly selected (see Additional file 1) in the section “StatAlign Dataset”. From each family, in a prefiltering step we removed divergent sequences with long indels, as follows. We defined insertion as consecutive nongap characters of a sequence in the reference alignment which appear in columns where over 80% of the sequences have gaps. Deletions were defined analogously. Columns with fraction of gaps between 20% and 80% were regarded ambiguous and ignored. To overpenalize long indels, we applied the super linear score function l × log_{2} (l + 1) for indels of length l, indels being defined as above. Then, a sequence was removed from a family if its total indel score was beyond 20 and the difference between its indel score and the mean indel score in the family was beyond 3.7 times the standard deviation of the indel scores in family, i.e. if the sequence had significantly more and/or longer indels than what is representative of the rest of the family. Then, 50 sequences were selected at random, and further random selection was done to get pairs, triplets etc. up to 15 sequences in alignments. From these samples of known reference alignment, we could produce many different alignment samples using StatAlign [32]. For each RNA alignment, 200 alignment samples were taken, and the reference alignments were also kept to for comparison. We refer to this dataset throughout as the StatAlign dataset.
Additional file 1. Information on the 50 RNA families randomly selected from Rfam.
Format: TXT Size: 4KB Download file
Random alignment data
We also wanted to measure the effect of alignment accuracy on secondary structure independently of alignment method. Therefore we created a dataset based on the RNA families of the StatAlign dataset, where alignments were sampled uniformly at various fixed distances from the reference alignment. Using the alignment distance measure in Eq. 2, we created a Metropoliscoupled MCMC framework that runs several parallel MCMC chains to take alignment samples from the target distribution
where a^{r} is the reference alignment, d is the target distance to get samples from and t is the temperature of the chain. To improve the mixing properties of the chains we allowed each chain to explore alignments that do not exactly match the specified target distance (with an exponentially decreasing probability, as described by Eq. 9) but then rejected nonexact matches when taking samples from the cold chain (t = 1).
The state space of the Markov chains is the set of all possible multiple alignments of the input sequences. Alignments that represent the same set of homology statements, and only differ by the order of the alignment columns, are treated as different (e.g. alignments and of the sequences A and B). The following basic alignment rearrangement moves are iterated:
1. breaking an alignment column into two columns by moving one of its characters into a new column, placing gaps in every other row:
2. the exact reverse of the previous, i.e. joining two compatible adjacent columns – one having gaps in all but one row, the other having a gap in that row – to form a single column (see above)
3. relocating one character of a column to a gap position of an adjoining column:
As the space of alignments is vast and the moves are very local, to get a good approximation of uniform sampling, the number of steps that have to be done is on the order of millions, even for small inputs when a single chain is run. To this end, we created a special alignment representation where the above rearrangements can be carried out very efficiently (essentially constant time, i.e. in time proportional to column size but independent of the alignment length, even when the cost of randomly selecting the column to alter is included). For moderate input sizes (5–10 sequences of length 300–500) this representation allowed us to take 1.5 million rearrangement steps in a second (using a Java 6 implementation on one core of a 2.4 GHz Intel i3 processor).
A single chain is sufficient for small alignments, but very slow mixing becomes an issue for practical alignment sizes. We have found that standard parallel tempering techniques effectively speed up mixing if the chain temperatures are chosen correctly. Because the optimal temperatures vary significantly depending on reference alignment characteristics and target distribution, a simple acceptance optimization routine was added that tunes chain temperatures to achieve chain swap acceptance ratios between neighboring chains around a preset value. We found ratios 0.7–0.8 to be the most effective. With this framework, running 8–10 parallel chains was best to maximize the speed of convergence to the uniform distribution as compared to a single chain with sampling times 8–10fold.
The above described framework was utilized to create a dataset consisting of the RNA families of StatAlign dataset, where for each family and each selection of 5 representative sequences from the family, 10 samples were taken at a distance corresponding to a similarity of 0.98 to the reference alignment (see Eq. 3), then 10 samples at a similarity of 0.96 etc., down to a similarity ratio of 0.6. We refer to this dataset as the Random Alignment dataset.
Extending information entropy to alignment space
The information entropy defined for a single alignment contains the length of the alignment as a parameter. Attempting to extend the measure to the probability mass over RNA secondary structure space, variable alignment length is a concern. For example, if we have two alignments and corresponding secondary structures
we would not want to suggest that these alignments give two different secondary structures. Consequently, we use a projection method to give alignment column pairing probability matrices the same dimension, so that the matrices can be averaged.
For a given set of input sequences, the sequence containing the greatest number of nongap characters was chosen as the reference sequence. Each pairing probability matrix is projected by deleting columns and rows of the matrix corresponding to gap positions in the reference sequence, thus ensuring each matrix corresponding to an alignment sample has the same dimensions (a square matrix, with dimensions equal to the number of nongap characters in the reference sequence). For example, we might start with an (n + 1) (n + 1) matrix, delete row i and column i due to a gap in position i in the reference sequence, then end up with an n × n matrix as required.
To calculate information entropy over alignments, we need to be able to calculate the probability of each alignment. However, we cannot calculate the information entropy explicitly, since the probability of a given secondary structure ss is
and there is no known efficient way to recurse over all possible alignments. Instead, we create an information entropy measure based on samples from the alignment space, and show that, in the samplesize limit, the alignmentsample information entropy tends to the true information entropy.
Consider alignment samples a_{1}, …, a_{n} from the space of all alignments of m sequences, sampled according to their probability. If we are using statistical alignment, as in StatAlign, we will be sampling alignments in this fashion. Then we have for a column c, once alignments have been projected to the same length, the probability of being unpaired
with an analogous result holding for paired columns. We now define a sample phylogenetic probability P_{S} as the average of the sample phylogenetic probabilities:
To show this sample probability converges to the true probability as sample size tends to infinity, we first note that, rearranging Equation 12:
with 1 being the indicator function. Hence
Taking the limit n → ∞, by the weak law of large numbers
as required.
Now, we have from above that the entropy H_{Φ} (P) of grammar derivations Φ of a grammar G is
Since tree probabilities are the product of unpaired and paired column probabilities in the derivation, the tree probabilities can be recalculated from the sample of alignments. These can then be substituted into the above equation to get an approximation to the information entropy over the space of sampled alignments as well. We refer to this entropy of more than one alignment as the alignment consensus entropy.
Results and discussion
Alignment quality and predictive accuracy
A common question in comparative RNA secondary structure prediction is how many sequences are required to get a good structure prediction. This is briefly addressed in [11], but the sample size is quite small, and only total accuracy is considered. With 15 sequences in the alignment, we assume that no more evolutionary information can be gained by adding further sequences, but when fewer sequences are present, lack of information might yield a poorer structure prediction.
Instead of considering total accuracy, we wanted to quantify relatively how much accuracy is lost when fewer sequences are present. For example, if an alignment with 15 sequences is predicted with an average Fscore of 0.5, and an alignment of the same family with 3 sequences is predicted with an average Fscore of 0.4, then 80% of the accuracy will have been retained, that is the alignment with 3 sequences has a relative Fscore of 0.8.
To investigate how many sequences are needed for a good structure prediction, we took the StatAlign dataset and considered the relative Fscore for each family. Almost 100% of the possible Fscore was achieved when the alignment contained 5 sequences, for both PPfold and RNAalifold. Interestingly, the accuracy of RNAalifold decreased slightly as the number of sequences was increased. This is due to the increased number of noncanonical basepairs in the alignments, which the thermodynamic method could not predict. PPfold, on the other hand, has a small probability for noncanonical basepairs, so is not affected by these in the same way. Overall these results suggest that 5 sequences are sufficient for approaching maximal predictive accuracy.
To consider the effect alignment quality has on RNA secondary structure prediction, we took the StatAlign and Random Alignment datasets and measured their similarity to the reference alignment using the similarity score above. Again, percentage of accuracy retained was calculated by normalizing against the accuracy achieved on the reference alignment. Logscale heatmaps showing the accuracy and percentage of accuracy retained for the StatAlign dataset (A) and Random Alignment dataset (B) for PPfold and RNAalifold can be seen in Figure 2. As expected, decreasing alignment quality decreases the accuracy of structure predictions. However, other patterns also emerge from these graphs.
Figure 2. Logscale heatmaps for mean accuracy, and for percentage of accuracy retained, when alignment quality varies. Performance accuracy on the StatAlign dataset (A) and Random Alignment dataset (B) for PPfold and RNAalifold when alignment quality is varied. Percentage of accuracy is determined by, for each family, normalising by the average accuracy on the reference alignment. The R^{2} correlations given are determined by a linear regression model.
Firstly, in the StatAlign dataset (Figure 2A), we observe a weaker correlation between alignment distance and accuracy than for the Random Alignment dataset (Figure 2B). This suggests that predictions are better for the StatAlign dataset on alignments far from the reference alignment. StatAlign looks to produce alignments with a high likelihood under an evolutionary model, which the random alignments do not consider, and so StatAlign's alignments could be considered more realistic. This con rms the expectation that StatAlign's alignments are more useful for RNA secondary structure prediction than random alignments.
Secondly, for the StatAlign dataset, we see a much higher correlation with the Fscore than with the relative Fscore. Some families of RNA consistently produce the same alignment, which skews the graphs. For example, an alignment which consistently has similarity to the reference alignment of 0.9, and Fscore 0.5 would give a relative Fscore of 1 every time, and would support the correlations seen on each graph. Because we can control the spread of distances in the random alignment data set, we don't see this behaviour. As expected, variation comes more with families that produce more variable alignments. In the StatAlign dataset, this is obscured by those families which produce consistent alignments.
Lastly, for the Random Alignment dataset (Figure 2B), we see many more zero quality predictions in the case of RNAalifold than in the case of PPfold. This is most easily seen by the larger intensity of red in the main body of the heatmap for PPfold. This suggests that PPfold functions better than RNAalifold when given a lowquality alignment, likely due to its more complete model for molecular evolution.
Evolutionary distance
We also consider the effects of evolutionary distance on RNA secondary structure prediction quality. One might expect that there is a “sweet spot” for evolutionary distance– sequences too close to each other do not display enough covariation to benefit the evolutionary model, but the evolutionary signal might be lost if the distance is too large. To investigate this, we measured evolutionary distance in the phylogenetic trees predicted by PPfold using four different measures:
Measure 1– Mean of all the evolutionary distances,
Measure 2– Standard deviation of all the evolutionary distances,
Measure 3– Maximum evolutionary distance,
Measure 4– Maximum difference between evolutionary distances.
All four measures would be expected to be correlated with sequence length, which is well known to correlate with predictive accuracy. To account for this, we considered relative evolutionary distance, similar to the relative predictive accuracy above. A measure was normalized by the average for that family and number of sequences, so that it could be seen whether an alignment had greater or less evolutionary distance than might be expected. We then looked at the correlation between relative evolutionary distance and predictive accuracy. All methods correlated extremely poorly with predictive accuracy and relative predictive accuracy, measures 1 to 4 having R^{2} correlations with relative predictive accuracy of 0.0259, 0.0373, 0.0303, and 0.0265 respectively (data not shown). This suggests that evolutionary distance is not an underlying factor for variation in RNA secondary structure prediction, and those other factors, such as those seen in [26], play a more important role in determining predictive accuracy.
Alignment distances and maximum posterior decoding
Given the results concerning accuracy lost as alignment quality decreases, it would be desirable to be able to predict alignment quality, with the hope of predicting structure prediction quality. This has previously been attempted in [36]. First, the sequences were aligned with ClustalW [18]. The sequences were then realigned using 4 other programs (Alignm [41], MUSCLE [42], ProbCons [43], and TCoffee [44]) and the similarity between the alignment generated using ClustalW to each of the 4 other alignments was measured. The maximum of the 4 similarities, max (g), was chosen as a predictor of alignment quality. The authors of [36] detected a strong correlation between the true similarity (the similarity between the ClustalW alignment and a reference alignment) and max (g).
We implemented a modified version of this method. For a given set of input sequences we aligned with both AMAP [36] and with StatAlign, obtaining the maximum posterior decoding alignment (MPD alignment) from StatAlign. The similarity between the AMAP alignment and the MPD alignment was used as our predicted similarity measure. This produced a strong correlation between our predicted similarity and true similarity, with an adjusted Rsquared value of 0.6524.
We also implemented another method, which calculates an estimate of the expected similarity score using posterior probabilities from the MPD alignment. For each column, we might expect that a posterior probability close to 1 would contribute a score of close to 1 to the similarity measure. So our predicted alignment distance is just the average of the column posterior probabilities. Figure 3 shows an example of the correlation between predicted similarity and true similarity, here giving an adjusted Rsquared value of 0.8403. Our new predicted similarity can be calculated efficiently, and is a strong predictor of true similarity to the reference alignment.
Figure 3. Predicted similarity with the reference alignment. Similarity with the reference alignment against predicted similarity with the reference alignment for 50 Rfam families with 5 sequences in. Predicted similarity was calculated by averaging the posterior probabilities given in the MPD alignment. The adjusted Rsquared value for a linear fit is 0.8403.
Extending information entropy to alignment space
To test the information entropy extension developed above, we calculated the alignment consensus entropy for samples of alignments from the StatAlign dataset. Figure 4 gives information entropy for 3 different representative RNA families from the StatAlign dataset. On each graph, the information entropy for each of 1000 statistical alignment samples is given, as well as the alignment consensus entropy. The leftmost figure is one where the alignment samples were very similar, the rightmost figure where the alignment samples were very different, and the middle figure closer to the median value of alignment sample similarity. For family 3 (where sample alignments had low diversity), we see that the alignment consensus entropy is comparative to the mean entropy of the individual samples. This is expected, as it indicates there is little uncertainty in the alignment. On the other hand, the highdiversity family has much higher alignment consensus entropy than for each individual sample. This is again expected, as the difference in entropies indicates a high uncertainty in alignment. In this way, we can incorporate alignment uncertainty into our understanding of comparative RNA secondary structure prediction.
Figure 4. Information entropy with alignment uncertainty. Information entropy calculated for each alignment sample (yellow) and the alignment consensus entropy (blue). The leftmost figure is one where the alignment samples were very similar, the rightmost figure where the alignment samples were very different, and the middle figure closer to the median value of alignment sample similarity. See Methods section for details on entropy calculation.
Predicting secondary structure accuracy
Given strong correlations between the alignment quality and the structure prediction quality, we might expect that we could find a predictor of structure prediction accuracy. By “integrating out” alignment uncertainty, we may find a reliability score which is more reliable than the one currently reported by the PPfold. To test this, we predicted accuracy for one of the fivesequence alignments of each family and then tested the predicted accuracy against the true accuracy. The PPfold reliability score produced an adjusted R^{2} score of 0.252 when considering correlation with the true Fscores.
For our new reliability score, we adjusted the PPfold reliability score to consider only basepairs, as the Fscore considers only basepairs (i.e. ignored unpaired nucleotide probabilities). We then performed linear regression with the average of the MPD column probabilities, the information entropy of the alignment space, and this pairsonly reliability score against the known Fscore measure. This multiple regression improved the reliability score significantly. Figure 5 shows the predicted Fscore against the true Fscore, for a randomly chosen fivesequence alignments from each family of the StatAlign dataset. The adjusted R^{2} value with the new reliability measure improved to 0.496. These results seem to indicate that while alignment quality does affect structure prediction quality, the actual structure prediction model still plays a great role in the overall prediction accuracy. Consequently, improving these models, possibly by incorporating other kinds of information (such as experimental probing data), is an area where new research e orts are still needed in RNA secondary structure prediction.
Figure 5. Predicted RNA secondary structure accuracy. Predicted Fscore of PPfold on MPD alignment against actual Fscore of PPfold on MPD alignment. The adjusted Rsquared value a linear fit is 0.496.
Conclusions
In this paper we have explored a number of factors which can contribute to poor RNA secondary structure prediction quality. We established a relationship between alignment quality and expected loss of accuracy. Furthermore, we provided a method to predict alignment quality based only on statistical alignment samples. While our predictor of accuracy improves on the PPfold reliability score, there is still a large amount of variability unaccounted for, which we therefore suggest comes from the predictive model itself. To consider this further, we extended the information entropy measure for SCFGs to consider uncertainty in alignments.
The fact that our accuracy predictor did not account for all the variances associated with RNA secondary structure prediction, despite good predictors being found for alignment quality and a strong correlation between alignment quality and predictive accuracy, suggests that whilst alignment quality is an important factor, the predictive model itself determines plays a large part in the quality of prediction. Given what is shown in Figure 1 for single sequence predictions, that the accuracy of PPfold and RNAfold is very variable, it is unsurprising that variances remain. Clearly then, further efforts should be put into creating stronger singlesequence models, and then the advantages of evolutionary modelling and additional structural constraints will benefit further. The use of experimental data from new probing experiments as well as more biologically realistic constraints, such as kinetic or cotranscriptional folding, may improve the results of RNA secondary structure prediction.
Competing interests
The authors declare they have no competing interests.
Authors’ contributions
JWJA, AN, and ZS formulated the ideas presented here, carried out some of the analyses, and wrote the manuscript. AN generated the StatAlign and Random Alignment datasets, and MG, PA, and IE generated some of the results for the reliability scores, alignment distances, and predictive algorithm. JWJA and ZS developed the theoretical framework for extending information entropy. All authors read and approved the final manuscript.
Acknowledgements
In part, this work was carried out as part of the Oxford Summer School in Computational Biology, 2012, in conjunction with the Department of Plant Sciences and the Department of Zoology. Funding was provided by the EU COGANGS Grant. ÁN would like to thank BBSRC for continued funding through OCISB. JWJA would like to thank the EPSRC for funding.
References

Chu D, Barnes DJ, von der Haar T: The role of tRNA and ribosome competition in coupling the expression of different mRNAs in Saccharomyces cerevisiae.
Nucleic Acids Res 2011, 39(15):67056714. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gahura O, Hammann C, Valentova A, Puta F, Folk P: Secondary structure is required for 3' splice site recognition in yeast.
Nucleic Acids Res 2011, 39(22):97599767. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kertesz M, Wan Y, Mazor E, Rinn JL, Nutter RC, Chang HY, Segal E: Genomewide measurement of RNA secondary structure in yeast.
Nature 2010, 467(7311):103107. PubMed Abstract  Publisher Full Text

Washietl S, Hofacker IL, Lukasser M, Huttenhofer A, Stadler PF: Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome.
Nat Biotech 2005, 23(11):13831390. Publisher Full Text

Hofacker I, Fontana W, Stadler P, Bonhoeffer L, Tacker M, Schuster P: Fast folding and comparison of RNA secondary structures.
Chem Monthly 1994, 125(2):167188. Publisher Full Text

Markham NR, Zuker M, Keith JM, Walker JM: UNAFold. Totowa, NJ: Humana Press; 2008:331. [Bioinformatics]
[Methods in Molecular Biology; SP: 3]

Bernhart S, Hofacker I, Will S, Gruber A, Stadler P: RNAalifold: improved consensus structure prediction for RNA alignments.
BMC Bioinformatics 2008, 9:474. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Seemann SE, Gorodkin J, Backofen R: Unifying evolutionary and thermodynamic information for RNA folding of multiple alignments.
Nucleic Acids Res 2008, 36(20):63556362.
http://nar.oxfordjournals.org/content/36/20/6355.abstract webcite
PubMed Abstract  Publisher Full Text  PubMed Central Full Text 
Anderson JWJ, Tataru P, Staines J, Hein J, Lyngso R: Evolving stochastic contextfree grammars for RNA secondary structure prediction.
BMC Bioinformatics 2012, 13:78. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Dowell R, Eddy S: Evaluation of several lightweight stochastic contextfree grammars for RNA secondary structure prediction.
BMC Bioinformatics 2004, 5:71. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Knudsen B, Hein J: RNA secondary structure prediction using stochastic contextfree grammars and evolutionary history.
Bioinformatics 1999, 15(6):446454. PubMed Abstract  Publisher Full Text

Knudsen B, Hein J: Pfold: RNA secondary structure prediction using stochastic contextfree grammars.
Nucleic Acids Res 2003, 31(13):34233428. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Felsenstein J: Evolutionary trees from DNA sequences: A maximum likelihood approach.
J Mol Evol 1981, 17(6):368376. PubMed Abstract  Publisher Full Text

Gardner P, Giegerich R: A comprehensive comparison of comparative RNA structure prediction approaches.
BMC Bioinformatics 2004, 5:140. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Sukosd Z, Knudsen B, Kjems J, Pedersen CNS: PPfold 3.0: fast RNA secondary structure prediction using phylogeny and auxiliary data.
Bioinformatics 2012, 28(20):26912692. PubMed Abstract  Publisher Full Text

Andronescu M, Bereg V, Hoos H, Condon A: RNA STRAND: The RNA Secondary Structure and Statistical Analysis Database.
BMC Bioinformatics 2008, 9:340. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Lyngsoe R, Anderson J, Sizikova E, Badugu A, Hyland T, Hein J: Frnakenstein: multiple target inverse RNA folding.
BMC Bioinformatics 2012, 13:260. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, Thompson JD: Multiple sequence alignment with the Clustal series of programs.
Nucleic Acids Res 2003, 31(13):34973500. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Katoh K, Kuma K, Toh H, Miyata T: MAFFT version 5: improvement in accuracy of multiple sequence alignment.
Nucleic Acids Res 2005, 33(2):511518. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wang LS, LeebensMack J, Wall PK, Beckmann K, de Pamphilis CW, Warnow T: The Impact of Multiple Protein Sequence Alignment on Phylogenetic Estimation.
IEEE/ACM Trans Comput Biol Bioinform 2011, 8(4):11081119. PubMed Abstract  Publisher Full Text

Chivian D, Baker D: Homology modeling using parametric alignment ensemble generation with consensus and energybased model selection.
Nucleic Acids Res 2006, 34(17):e112e112. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gardner PP, Wilm A, Washietl S: A benchmark of multiple sequence alignment programs upon structural RNAs.
Nucleic Acids Res 2005, 33(8):24332439. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bradley RK, Pachter L, Holmes I: Speci c alignment of structured RNA: stochastic grammars and sequence annealing.
Bioinformatics 2008, 24(23):26772683. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Doose G, Metzler D: Bayesian sampling of evolutionarily conserved RNA secondary structures with pseudoknots.
Bioinformatics 2012, 28(17):22422248. PubMed Abstract  Publisher Full Text

Harmanci A, Sharma G, Mathews D: TurboFold: Iterative probabilistic estimation of secondary structures for multiple RNA sequences.
BMC Bioinformatics 2011, 12:108. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Engelen S, Tahi F: Predicting RNA secondary structure by the comparative approach: how to select the homologous sequences.
BMC Bioinformatics 2007, 8:464. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Engelen S, Tahi F: Tfold: efficient in silico prediction of noncoding RNA secondary structures.
Nucleic Acids Res 2010, 38(7):24532466. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sukosd Z, Knudsen B, Anderson JWJ, Novak A, Kjems J, Pedersen C: Characterising RNA secondary structure space using information entropy.
BMC Bioinformatics 2013, 14:S22. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Hein J, Wiuf C, Knudsen B, Moller MB, Wibling G: Statistical alignment: computational properties, homology testing and goodnessof fit.
J Mol Biol 2000, 302:265279. PubMed Abstract  Publisher Full Text

Lunter G, Miklos I, Drummond A, Jensen J, Hein J: Bayesian coestimation of phylogeny and sequence alignment.
BMC Bioinformatics 2005, 6:83. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Suchard MA, Redelings BD: BAliPhy: simultaneous Bayesian inference of alignment and phylogeny.
Bioinformatics 2006, 22(16):20472048. PubMed Abstract  Publisher Full Text

Novak A, Miklos I, Lyngso R, Hein J: StatAlign: an extendable software package for joint Bayesian estimation of alignments and evolutionary trees.
Bioinformatics 2008, 24(20):24032404. PubMed Abstract  Publisher Full Text

Satija R, Novak A, Miklos I, Lyngsø R, Hein J: BigFoot: Bayesian alignment and phylogenetic footprinting with MCMC.
BMC Evol Biol 2009, 9:217. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Miklos I, Novak A, Dombai B, Hein J: How reliably can we predict the reliability of protein structure predictions?
BMC Bioinformatics 2008, 9:137. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Thorne JL, Kishino H, Felsenstein J: Inching toward reality: An improved likelihood model of sequence evolution.
J Mol Evol 1992, 34:316. PubMed Abstract  Publisher Full Text

Schwartz AS, Myers EW, Pachter L: Alignment Metric Accuracy. http://arxiv.org/abs/qbio/0510052 webcite

Schwartz AS, Pachter L: Multiple alignment by sequence annealing.
Bioinformatics 2007, 23(2):e24e29. PubMed Abstract  Publisher Full Text

Freyhult E, Gardner P, Moulton V: A comparison of RNA folding measures.
BMC Bioinformatics 2005, 6:241. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Lari K, Young SJ: The estimation of stochastic contextfree grammars using the InsideOutside algorithm.
Comput Speech Language 1990, 4:3556. Publisher Full Text

GriffithsJones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A: Rfam: annotating noncoding RNAs in complete genomes.
Nucleic Acids Res 2005, 33(suppl 1):D121D124. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Walle IV, Lasters I, Wyns L: Alignm a new algorithm for multiple alignments of highly divergent sequences.
Bioinformatics 2004, 20(9):14281435. PubMed Abstract  Publisher Full Text

Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput.
Nucleic Acids Res 2004, 32(5):17921797. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Do CB, Mahabhashyam MSP, Brudno M, Batzoglou S: ProbCons: Probabilistic consistencybased multiple sequence alignment.
Genome Res 2005, 15(2):330340. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Cedric Notredame DGH, Heringa J: TCoffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment.
J Mol Biol 2000, 302:205217. PubMed Abstract  Publisher Full Text