Skip to main content

Dinucleotide controlled null models for comparative RNA gene prediction

Abstract

Background

Comparative prediction of RNA structures can be used to identify functional noncoding RNAs in genomic screens. It was shown recently by Babak et al. [BMC Bioinformatics. 8:33] that RNA gene prediction programs can be biased by the genomic dinucleotide content, in particular those programs using a thermodynamic folding model including stacking energies. As a consequence, there is need for dinucleotide-preserving control strategies to assess the significance of such predictions. While there have been randomization algorithms for single sequences for many years, the problem has remained challenging for multiple alignments and there is currently no algorithm available.

Results

We present a program called SISSIz that simulates multiple alignments of a given average dinucleotide content. Meeting additional requirements of an accurate null model, the randomized alignments are on average of the same sequence diversity and preserve local conservation and gap patterns. We make use of a phylogenetic substitution model that includes overlapping dependencies and site-specific rates. Using fast heuristics and a distance based approach, a tree is estimated under this model which is used to guide the simulations. The new algorithm is tested on vertebrate genomic alignments and the effect on RNA structure predictions is studied. In addition, we directly combined the new null model with the RNAalifold consensus folding algorithm giving a new variant of a thermodynamic structure based RNA gene finding program that is not biased by the dinucleotide content.

Conclusion

SISSIz implements an efficient algorithm to randomize multiple alignments preserving dinucleotide content. It can be used to get more accurate estimates of false positive rates of existing programs, to produce negative controls for the training of machine learning based programs, or as standalone RNA gene finding program. Other applications in comparative genomics that require randomization of multiple alignments can be considered.

Availability

SISSIz is available as open source C code that can be compiled for every major platform and downloaded here: http://sourceforge.net/projects/sissiz.

Background

Comparative genome analysis is currently the most widely used strategy to detect and annotate noncoding RNAs (ncRNAs) [1, 2]. In the past few years a series of different algorithms have been developed that predict functional ncRNAs on the basis of conserved secondary structure [3–10]. Some of these methods have been used to predict novel ncRNAs on a genome wide scale [7, 11–14]. In combination with experimental verification (microarray, RT-PCR, Northern blot) these methods could successfully uncover many examples of novel ncRNAs [15–20]. However, in particular in large vertebrate genomes the signal-to-noise ratio of true predictions and false positives is thought to be relatively low [20]. In a recent paper, Babak and colleagues demonstrated that comparative ncRNA gene finders are strongly biased by the genomic dinucleotide content leading to an excess of false predictions [21]. Especially methods that are based on a thermodynamic folding model are sensitive to this effect: In the so-called nearest neighbour model, energies are not assigned to single base-pairs but rather to neighbouring base-pairs that stack on each other. As a consequence, the folding stability of genomic sequences does not only depend on the monunucleotide content but also the dinucleotide content.

To assess the significance of predicted structures, e.g. to estimate the false discovery rate in a genomic screen for ncRNAs, one should therefore compare the genomic predictions to the results obtained on randomized data with the same dinucleotide content. In the case of single sequences, there are well known and widely used algorithms to generate dinucleotide controlled random sequences either by shuffling or first order Markov chain simulation [22, 23]. However, there is currently no algorithm to randomize multiple sequence alignments preserving the dinucleotide content. Babak and colleagues [21] added the conservation of dinucleotides as an additional constraint to the commonly used (mononucleotide) shuffling algorithm shuffle-aln.pl [5] and applied it to pairwise alignments. Their approach corresponds to a heuristic used in reference 24, that is very inefficient as only a small subspace of the whole permutation space is covered. The heuristic exchanges only positions that have the same neighbours left and right. For the short sequence ACAGCCAA for example not a single permutation can be found that way. However, there are 11 such permutations according to the Altschul & Erikson algorithm [22]. But even a more efficient shuffling algorithm will soon run into difficulties on multiple alignments. Unless two neighbouring columns are 100% conserved, there are several different dinucleotide pairs in these columns. It is therefore impossible to exactly preserve the dinucleotide content as in the single sequence case.

In this paper, we address the problem in a different way. In analogy to a first order Markov model for single sequences, we simulate alignments of a given dinucleotide content. We present a substitution model that captures the neighbour dependencies and all other relevant alignment features. We describe a time efficient way to estimate a tree under this model that we use as a guide to simulate alignments of the desired properties. This new control strategy is tested on genomic alignments and the effect on thermodynamic RNA structure predictions is studied.

Results

Requirements for an accurate null model

An optimal null model preserves all the features of the original data with the exception of the signal under question that needs to be removed efficiently. In our case, the data are multiple alignments of homologous sequences and the signal of interest is an evolved RNA secondary structure. Correlations arising from base-pairing patterns need to be removed. Currently, alignments are usually randomized by shuffling the alignment columns (see ref. 5 for a discussion of this method). Although the shuffling approach has its limitations and considering dinucleotides seems difficult, it is an appealing approach because it is relatively simple, fast, and extremely conservative. Changing the order of the columns does not change the mutational patterns within the columns and thus the underlying phylogenetic tree is exactly preserved.

In this paper we attempt to simulate new alignments from scratch. Even the most sophisticated model cannot capture all evolutionary processes and therefore a simulation approach will inevitably change the original data more than shuffling does. So much care has to be taken to preserve all the relevant characteristics of the data. To qualitatively assess the most important parameters that need to be considered in our model, we performed a series of simulation experiments. Using a simple tree with four taxa we simulated alignments under the HKY evolutionary model [25]. We systematically varied model and tree parameters to study how they affect thermodynamic RNA consensus structure predictions in the alignment. We used RNAalifold [26] to predict consensus secondary structures which is the basis of the AlifoldZ [5] and RNAz [6] gene finders.

Not surprisingly, base composition is one of the parameters affecting the predicted folding energies strongest (Fig. 1A). High G+C content leads to more stable RNA predictions, while high A+T content gives less stable predictions. As mentioned in the introduction and in fact the main motivation of this paper, also dinucleotide content affects folding energies. We used our simulation algorithm that is described below to simulate alignments of the same mononucleotide content but varying dinucleotide content. Fig. 1B shows for example that a three times enriched ApT content lead to more stable predictions. The excess of some other dinucleotides like for example GpT can cause the opposite effect leading to less stable predictions.

Figure 1
figure 1

Parameters effecting thermodynamic consensus RNA structure predictions. As a basic parameter set we used equal base frequencies of 0.25, a transition/transversion rate ratio κ = 1, and the following tree ((A:0.09,B:0.09):0.09,(C:0.09,D:0.09):0.09) One parameter was varied at a time while others were kept constant. If necessary branch lengths were adjusted to keep a mean pairwise sequence identity (MPI) of 0.75 ± 0.01. 1000 alignments of length 80 were simulated under each condition. Cumulative histograms for the RNAalifold consensus folding energies are shown. Please note that we plot negative minimum free energies, i.e. higher values correspond to more stable folds. (A) Base frequencies were varied to get high and low G+C content. (B) Two specific dinucleotide frequencies were elevated 3-fold while the mononucleotide content was kept constant. (C) Branch lengths were equally scaled to produce alignments with lower or higher MPI identity than for the basic tree. (D) The transition/transversion rate ratio was varied. κ = 1 means equal rates, while κ > 1 gives more transition than transversions. (E) The alignment of size 80 was divided into a central block of 40 and two anking regions of 20. We set 100% conservation in the central block and low conservation in the anks (rate "high-low-high") and the other way round ("low-high-low"). The total average MPI was always 0.75. (F) We tested all possible topologies of this 4 taxa tree and adjusted the branch lengths to give a MPI of 0.75. For one given topology, all the branch lengths were of the same length.

Another major parameter that needs to be controlled is the sequence diversity of the alignment. Variation of the branch lengths of the tree gives alignments with different sequence diversity which we usually measure as the mean pairwise sequence identity (MPI, also sometimes refered to as average pairwise sequence identity APSI). High diversity (i.e. low MPI) makes it difficult to predict a consensus structure if there is no selection pressure for it. On the other hand, almost perfectly conserved sequences fold readily in some random structure even if there is no natural RNA structure present. Therefore we observe a strong dependency on the MPI (Fig. 1C).

One well known characteristic of natural mutation processes are the different rates for transitions and transversions [27]. Interestingly, this also affects the consensus structure predictions. A model with equal transition/transversion rates (parameter κ = 1 in the HKY model) gives less stable predictions than a model with more realistic rates (e.g κ = 4, Fig. 1D). This parameter affects the type of column patterns observed in the simulated alignments which in turn affects how well they can form consensus base pairs.

Natural mutation processes are not homogeneous across all sites, in particular in functional genomic regions. It was observed previously that mutation patterns within an alignment can affect structure predictions [5]. For example, an alignment containing a 100% conserved block with low mutation rate that is flanked by highly divergent regions of high mutation rate can have different folding energies compared to an alignment with homogeneous rates but the same overall MPI (Fig. 1E). The same is true for patterns of insertions and deletions which was also already discussed in reference 5 and which we do not show here explicitly again.

We also tested the effect of different tree topologies, but did not find a significant influence of this parameter at least in our four taxa example.

Taken together, an accurate randomization procedure needs to generate alignments that preserve (i) mono- and dinucleotide content, (ii) mean pairwise sequence identity, (iii) transition/transversion rate ratio (iv) site-specific mutation rates, and (v) gap patterns.

In the next section we describe a model that is capable of simulating alignments under these constraints.

Algorithm

Model

Sequence evolution is usually described by a time-continuous Markov process [27, 28]. The most commonly used models assume that all sites of a sequence evolve independently from each other rendering it impossible to model dinucleotide dependencies between neighbouring pairs. Various evolutionary models have been proposed in the past years to overcome this limitation [29–36]. We make use of the recently introduced framework called SISSI (SImulating Site-Specific Interactions). SISSI allows to define site dependencies of arbitrary complexity in the form of a "neighbourhood system" that also may include overlapping dependencies [37]. Given the requirements of our specific problem, we extended and simplified several aspects of SISSI as necessary.

Following the general framework of SISSI, we introduce a site-specific rate matrix Q k for every site k = 1, ..., l in a sequence x = (x1, ..., xl). This matrix defines the substitution process at site k, where the substitution of a given nucleotide x k ∈ A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXheaaa@3749@ = {A,C,G,U} by another one depends on the states xk-1, x k , xk+1(Fig. 2).

Figure 2
figure 2

Site dependencies for overlapping dinucleotides (red-gray): The substitution process of a given nucleotide x k at site k by another one depends on the states xk-1, x k , xk+1, the subsequence s k . Q k has the dimension 64 × 64, where only one mutation is allowed at the current site k. The substitution rate for the whole sequence q(x) is the sum of each rate q(k) = Q k (s k , s k ) multiplied with a site-specific scaling factor f k , with k = 1, ..., l.

Thus, the instantaneous rate matrix Q k has the dimension | A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXheaaa@3749@ |3 × | A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXheaaa@3749@ |3 = 64 × 64. The stationary distribution of Q k determines the equilibrium dinucleotide content of our system (see the next section for how the required trinucleotide frequencies of Q k are calculated from the dinucleotide frequencies).

To be able to control the transition/transversion rate ratio and the site-specific mutation rates, we have to add two additional parameters. Let s k = (xk-1, x k , xk+1) represent the current triplet of sequence x and y = (y1, y2, y3) another triplet in A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXheaaa@3749@ 3. First, we introduce a general parameter r(s k , y) ≥ 0 to incorporate the additional mechanistic rates. Second, we introduce a site-specific scaling factor f k with k = 1, ..., l, such that:

1 l ⋅ ∑ 1 l f k = 1 . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcfa4aaSaaaeaacqaIXaqmaeaacqWGSbaBaaGccqGHflY1daaeWbqaaiabdAgaMnaaBaaaleaacqWGRbWAaeqaaOGaeyypa0JaeGymaedaleaacqaIXaqmaeaacqWGSbaBa0GaeyyeIuoakiabc6caUaaa@3BC7@
(1)

We impose the usual restriction, that only one substitution per unit time is admissible [38, 39]. Moreover, Q k only allows for substitutions at site k. The diagonal elements of our instantaneous rate matrix Q k are defined by the mathematical requirement that the sum of each row is zero.

The entries of Q k are thus given by

Q k ( s k , y ) = f k â‹… { r ( s k , y ) â‹… Ï€ k ( y ) if  H ( s k , y ) = 1  and  x k ≠ y 2 − ∑ z ∈ A 3 z ≠ s k Q k ( s k , z ) if  H ( s k , y ) = 0 0 otherwise MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyuae1aaSbaaSqaaiabdUgaRbqabaGccqGGOaakcqWHZbWCdaWgaaWcbaGaem4AaSgabeaakiabcYcaSiabhMha5jabcMcaPiabg2da9iabdAgaMnaaBaaaleaacqWGRbWAaeqaaOGaeyyXIC9aaiqaaeaafaqaaeWacaaabaGaemOCai3aaSbaaSqaaiabcIcaOiabhohaZnaaBaaameaacqWHRbWAaeqaaSGaeiilaWIaeCyEaKNaeiykaKcabeaakiabgwSixlabec8aWnaaBaaaleaacqWGRbWAaeqaaOGaeiikaGIaeCyEaKNaeiykaKcabaGaeeyAaKMaeeOzayMaeeiiaaIaemisaGKaeiikaGIaeC4Cam3aaSbaaSqaaiabdUgaRbqabaGccqGGSaalcqWH5bqEcqGGPaqkcqGH9aqpcqaIXaqmcqqGGaaicqqGHbqycqqGUbGBcqqGKbazcqqGGaaicqWG4baEdaWgaaWcbaGaem4AaSgabeaakiabgcMi5kabdMha5naaBaaaleaacqaIYaGmaeqaaaGcbaGaeyOeI0YaaabuaeaacqWGrbqudaWgaaWcbaGaem4AaSgabeaakiabcIcaOiabhohaZnaaBaaaleaacqWGRbWAaeqaaOGaeiilaWIaeCOEaONaeiykaKcalqaabeqaaiabhQha6jabgIGioprtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab=bq8bnaaCaaameqabaGaeG4mamdaaaWcbaGaeCOEaONaeyiyIKRaeC4Cam3aaSbaaWqaaiabdUgaRbqabaaaaSqab0GaeyyeIuoaaOqaaiabbMgaPjabbAgaMjabbccaGiabdIeaijabcIcaOiabhohaZnaaBaaaleaacqWGRbWAaeqaaOGaeiilaWIaeCyEaKNaeiykaKIaeyypa0JaeGimaadabaGaeGimaadabaGaee4Ba8MaeeiDaqNaeeiAaGMaeeyzauMaeeOCaiNaee4DaCNaeeyAaKMaee4CamNaeeyzaugaaaGaay5Eaaaaaa@A77B@
(2)

where π k (y) is the stationary frequency of y and the Hamming distance H(s k , y) counts the number of differences between the sites of the triplets s k and y.

In principle, we can choose any rate for the parameter r(s k , y). However, based on the requirement that we want to use the counted dinucleotide content as the stationary distribution, we choose r(s k , y) so that the model becomes reversible. Any parameter of the commonly used independent nucleotide substitution models, like HKY [25] or the general time-reversible model GTR [40] can be chosen for r(s k , y). For our application, we use a transition/transversion rate ratio and set r(s k , y)= κ for transitions and r(s k , y) = 1 for transversions.

The restriction that a substitution is only possible at site k leads to sparse rate matrices. Q k has only | A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXheaaa@3749@ |4 non-zero entries. Hence, we can write Q k in the form of 16 submatrices, which describe the admissible substitutions for site k depending on the left y1 and right y3 neighbours,

y 1 A y 3 y 1 C y 3 y 1 G y 3 y 1 U y 3 y 1 A y 3 y 1 C y 3 y 1 G y 3 y 1 U y 3 ( ∗ π y 1 C y 3 κ π y 1 G y 3 π y 1 U y 3 π y 1 A y 3 ∗ π y 1 G y 3 κ π y 1 U y 3 κ π y 1 A y 3 π y 1 C y 3 ∗ π y 1 U y 3 π y 1 A y 3 κ π y 1 C y 3 π y 1 G y 3 ∗ ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeqabeGaaaqaauaabeqafeaaaaqaaaqaaiabdMha5naaBaaaleaacqaIXaqmaeqaaOGaeeyqaeKaemyEaK3aaSbaaSqaaiabiodaZaqabaaakeaacqWG5bqEdaWgaaWcbaGaeGymaedabeaakiabboeadjabdMha5naaBaaaleaacqaIZaWmaeqaaaGcbaGaemyEaK3aaSbaaSqaaiabigdaXaqabaGccqqGhbWrcqWG5bqEdaWgaaWcbaGaeG4mamdabeaaaOqaaiabdMha5naaBaaaleaacqaIXaqmaeqaaOGaeeyvauLaemyEaK3aaSbaaSqaaiabiodaZaqabaaaaaGcbaqbaeqabiqaaaqaauaabeqabqaaaaqaaiadaciP+daadMha5nacaciP+daaBaaaleacaciP+daacWaGasQ=aaaIXaqmaeqcaciP+daaaOGamaiGK6paaeyqaeKamaiGK6paamyEaK3aiaiGK6paaSbaaSqaiaiGK6paaiadaciP+daaiodaZaqajaiGK6paaaaakeaacWafaoyEaK3aiqbGBaaaleacuaOamqbGigdaXaqajqbGaOGamqbGboeadjadua4G5bqEdGafaUbaaSqaiqbGcWafaI4mamdabKafacaakeaacWaPaoyEaK3aiqkGBaaaleacKcOamqkGigdaXaqajqkGaOGamqkGbEeahjadKc4G5bqEdGaPaUbaaSqaiqkGcWaPaI4mamdabKaPacaakeaacWaGaInaaaWG5bqEdGaGaInaaaWgaaWcbGaGaInaaaGamaiGydaaaGymaedabKaGaInaaaaakiadaci2aaaabwfavjadaci2aaaadMha5nacaci2aaaaBaaaleacaci2aaaacWaGaInaaaaIZaWmaeqcaci2aaaaaaaaaOqaamaabmaabaqbaeqabqabaaaaaeaacqGHxiIkaeaacqaHapaCdaWgaaWcbaGaemyEaK3aaSbaaWqaaiabigdaXaqabaWccqqGdbWqcqWG5bqEdaWgaaadbaGaeG4mamdabeaaaSqabaaakeaacqaH6oWAcqaHapaCdaWgaaWcbaGaemyEaK3aaSbaaWqaaiabigdaXaqabaWccqqGhbWrcqWG5bqEdaWgaaadbaGaeG4mamdabeaaaSqabaaakeaacqaHapaCdaWgaaWcbaGaemyEaK3aaSbaaWqaaiabigdaXaqabaWccqqGvbqvcqWG5bqEdaWgaaadbaGaeG4mamdabeaaaSqabaaakeaacqaHapaCdaWgaaWcbaGaemyEaK3aaSbaaWqaaiabigdaXaqabaWccqqGdbWqcqWG5bqEdaWgaaadbaGaeG4mamdabeaaaSqabaaakeaacqGHxiIkaeaacqaHapaCdaWgaaWcbaGaemyEaK3aaSbaaWqaaiabigdaXaqabaWccqqGhbWrcqWG5bqEdaWgaaadbaGaeG4mamdabeaaaSqabaaakeaacqaH6oWAcqaHapaCdaWgaaWcbaGaemyEaK3aaSbaaWqaaiabigdaXaqabaWccqqGvbqvcqWG5bqEdaWgaaadbaGaeG4mamdabeaaaSqabaaakeaacqaH6oWAcqaHapaCdaWgaaWcbaGaemyEaK3aaSbaaWqaaiabigdaXaqabaWccqqGbbqqcqWG5bqEdaWgaaadbaGaeG4mamdabeaaaSqabaaakeaacqaHapaCdaWgaaWcbaGaemyEaK3aaSbaaWqaaiabigdaXaqabaWccqqGdbWqcqWG5bqEdaWgaaadbaGaeG4mamdabeaaaSqabaaakeaacqGHxiIkaeaacqaHapaCdaWgaaWcbaGaemyEaK3aaSbaaWqaaiabigdaXaqabaWccqqGvbqvcqWG5bqEdaWgaaadbaGaeG4mamdabeaaaSqabaaakeaacqaHapaCdaWgaaWcbaGaemyEaK3aaSbaaWqaaiabigdaXaqabaWccqqGbbqqcqWG5bqEdaWgaaadbaGaeG4mamdabeaaaSqabaaakeaacqaH6oWAcqaHapaCdaWgaaWcbaGaemyEaK3aaSbaaWqaaiabigdaXaqabaWccqqGdbWqcqWG5bqEdaWgaaadbaGaeG4mamdabeaaaSqabaaakeaacqaHapaCdaWgaaWcbaGaemyEaK3aaSbaaWqaaiabigdaXaqabaWccqqGhbWrcqWG5bqEdaWgaaadbaGaeG4mamdabeaaaSqabaaakeaacqGHxiIkaaaacaGLOaGaayzkaaaaaaaaaaa@14F0@
(3)

Finally, we scale Q k such that the number of substitutions d k equals 1:

d k = − ∑ z ∈ A 3 π k ( z ) ⋅ Q k ( z , z ) = 1 . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemizaq2aaSbaaSqaaiabdUgaRbqabaGccqGH9aqpcqGHsisldaaeqbqaaiabec8aWnaaBaaaleaacqWGRbWAaeqaaOGaeiikaGIaeCOEaONaeiykaKIaeyyXICTaemyuae1aaSbaaSqaaiabdUgaRbqabaGccqGGOaakcqWH6bGEcqGGSaalcqWH6bGEcqGGPaqkcqGH9aqpcqaIXaqmaSqaaiabhQha6jabgIGioprtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab=bq8bnaaCaaameqabaGaeG4mamdaaaWcbeqdcqGHris5aOGaeiOla4caaa@56CC@
(4)

and thus the total instantaneous substitution rate for a sequence x can be written as the sum over each rate Q k (s k , s k ) multiplied with the site-specific scaling factor f k , with k = 1, ..., l (Fig. 2),

q ( x ) = − ∑ k = 1 l f k ⋅ Q k ( s k , s k ) . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyCaeNaeiikaGIaeCiEaGNaeiykaKIaeyypa0JaeyOeI0YaaabCaeaacqWGMbGzdaWgaaWcbaGaem4AaSgabeaakiabgwSixlabdgfarnaaBaaaleaacqWGRbWAaeqaaOGaeiikaGIaeC4Cam3aaSbaaSqaaiabhUgaRbqabaGccqGGSaalcqWHZbWCdaWgaaWcbaGaeC4AaSgabeaakiabcMcaPaWcbaGaem4AaSMaeyypa0JaeGymaedabaGaemiBaWganiabggHiLdGccqGGUaGlaaa@4B34@
(5)

Without dependencies on the neighbours, Q k is of dimension 4 × 4 and the model reduces essentially to a HKY model with a specific rate for each site. We use this mononucleotide variant later in this paper for testing and comparison to the dinucleotide model.

Simulation

For the simulation process, we essentially used the same algorithm described previously [37] with some modifications. During the simulation process, we pick a site k with the relative mutability

P ( k ) = | f k â‹… Q k ( s k , s k ) | q ( x ) , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiuaaLaeiikaGIaem4AaSMaeiykaKIaeyypa0tcfa4aaSaaaeaadaabdaqaaiabdAgaMnaaBaaabaGaem4AaSgabeaacqGHflY1cqWGrbqudaWgaaqaaiabdUgaRbqabaGaeiikaGIaeC4Cam3aaSbaaeaacqWGRbWAaeqaaiabcYcaSiabhohaZnaaBaaabaGaem4AaSgabeaacqGGPaqkaiaawEa7caGLiWoaaeaacqWGXbqCcqGGOaakcqWH4baEcqGGPaqkaaGaeiilaWcaaa@4ADF@
(6)

and for the chosen site k, the nucleotide x k will replaced by a new nucleotide y2 ∈ A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXheaaa@3749@ from the corresponding triplet y with probability:

P ( x k → y 2 ) = f k ⋅ Q k ( s k , y ) | f k ⋅ Q k ( s k , s k ) | = Q k ( s k , y ) | Q k ( s k , s k ) | MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiuaaLaeiikaGIaemiEaG3aaSbaaSqaaiabdUgaRbqabaGccqGHsgIRcqWG5bqEdaWgaaWcbaGaeGOmaidabeaakiabcMcaPiabg2da9KqbaoaalaaabaGaemOzay2aaSbaaeaacqWGRbWAaeqaaiabgwSixlabdgfarnaaBaaabaGaem4AaSgabeaacqGGOaakcqWHZbWCdaWgaaqaaiabdUgaRbqabaGaeiilaWIaeCyEaKNaeiykaKcabaWaaqWaaeaacqWGMbGzdaWgaaqaaiabdUgaRbqabaGaeyyXICTaemyuae1aaSbaaeaacqWGRbWAaeqaaiabcIcaOiabhohaZnaaBaaabaGaem4AaSgabeaacqGGSaalcqWHZbWCdaWgaaqaaiabdUgaRbqabaGaeiykaKcacaGLhWUaayjcSdaaaiabg2da9maalaaabaGaemyuae1aaSbaaeaacqWGRbWAaeqaaiabcIcaOiabhohaZnaaBaaabaGaem4AaSgabeaacqGGSaalcqWH5bqEcqGGPaqkaeaadaabdaqaaiabdgfarnaaBaaabaGaem4AaSgabeaacqGGOaakcqWHZbWCdaWgaaqaaiabdUgaRbqabaGaeiilaWIaeC4Cam3aaSbaaeaacqWGRbWAaeqaaiabcMcaPaGaay5bSlaawIa7aaaaaaa@737C@
(7)

In the most general SISSI framework Q k needs to be updated for all k sites every time one nucleotide in x is substituted. However, in our special case we can use the same instantaneous rate matrix Q k for each site with special conditions for r(s k , y). As a consequence, we can fix q(x) and do not need to sum over each rate of the site, which improves the running time of the algorithm.

Parameter estimation

The idea of our randomization procedure is to estimate a tree under the model described in the previous section and simulate sequences along this tree. Ideally, all parameters are estimated simultaneously within a maximum likelihood framework. One problem is the high number of parameters since we want to estimate a specific rate for each site. A more fundamental issue is, however, that our model includes overlapping dependencies which breaks the independence assumption necessary for basic maximum likelihood estimation. Other possible techniques like Markov chain Monte Carlo in a Bayesian framework are not a viable alternative either. Speed is a critical issue as the algorithm is meant to be applied to data on a genome wide scale.

Facing these difficulties, we have developed heuristic approximations to estimate the parameters and use a distance based approach to estimate the tree. The method is fast and yet surprisingly accurate for our application.

Equilibrium frequencies

The stationary frequencies of our model are set in a way that in equilibrium we obtain a dinucleotide frequency that is the same as the dinucleotide content of the alignment to be randomized. To this end, we first count the dinucleotide frequencies as an average of all sequences in the original alignment (see Methods on how gaps are treated). Then, we calculate the corresponding trinucleotide frequencies needed for Q k as a function of the single and dinucleotide frequencies using an approximation based on simple conditional probabilities [30, 32]:

π ( α β γ ) = π ( α β ) π ( β γ ) π ( β ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeqiWdaNaeiikaGIaeqySdeMaeqOSdiMaeq4SdCMaeiykaKIaeyypa0tcfa4aaSaaaeaacqaHapaCcqGGOaakcqaHXoqycqaHYoGycqGGPaqkcqaHapaCcqGGOaakcqaHYoGycqaHZoWzcqGGPaqkaeaacqaHapaCcqGGOaakcqaHYoGycqGGPaqkaaaaaa@4893@
(8)

where π(αβγ) are the trinucleotide frequencies, π(αβ) and π(βγ) the counted dinucleotide frequencies and π(β) = Σ α π(αβ) = Σ α π(βα) with α, β, γ ∈ {A, C, G, U}.

Fig. 3A shows an example of the dinucleotide frequency distribution of 1000 simulated alignments. We counted the dinucleotide frequencies of an alignment of 7 5.8 rRNA sequences and set the trinucleotide parameters of our model accordingly. On average, we get the same dinucleotide frequencies in the simulated alignments as in the original one.

Figure 3
figure 3

Key concepts of the algorithm shown on an example alignment of 5.8S rRNA. (A) Distribution of dinucleotide frequencies of 1000 simulated alignments are shown as box-plots (the line in the box indicates the median, the borders of the box the 25th and 75th quartile, and the dotted lines 1.5× the interquartile range). Red circles show the frequencies observed in the original alignment. (B) Relationship between the number of substitutions and observed differences empirically determined by sampling of 25 points. Each point shows the average of 10 simulations. Note that the short distances are sampled more densely. These settings are the default values in our program and used throughout the paper. (C) Distribution of mean pairwise identities for 1000 random samples. The MPI of the original alignment is shown in red. (D) Comparison of site-wise MPIs in the original alignment and the average of the corresponding sites of 1000 random alignments.

Distances and tree construction

To build a distance based tree, we first have to estimate the number of substitutions that have taken place between two sequences. In other words, we have to estimate the genetic or evolutionary distance d from the Hamming distances p under our model. Both distances are different because back mutations have taken place that are not directly visible. To estimate the relationship between d and p, we simulate sequence pairs separated by different branch lengths d and calculate the corresponding Hamming distances p (Fig. 3B). We fit an exponential function to this curve:

p = a ^ ⋅ ( 1 − e b ^ ⋅ d ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiCaaNaeyypa0JafmyyaeMbaKaacqGHflY1cqGGOaakcqaIXaqmcqGHsislcqWGLbqzdaahaaWcbeqaaiqbdkgaIzaajaGaeyyXICTaemizaqgaaOGaeiykaKcaaa@3C48@
(9)

Using this function, all pairwise distances are calculated for the sequences in the original alignment. From this distance matrix a tree is constructed using the BIONJ algorithm [41]. BIONJ is a variant of the well known neighbour joining algorithm and currently one of the most accurate algorithms for distance based tree building.

Given that the distances and the tree are accurately estimated, we observe on average the same mean pairwise identity in the simulated alignment as in the original one. Fig. 3C shows the distribution of MPIs of 1000 simulations of our example rRNA alignment. The average MPI of the simulations is exactly the same as the MPI 0.73 of the original alignment.

Site-specific rates

Setting different mutation rates at different sites gives us the possibility to preserve natural mutation patterns of the original alignment. The problem of finding accurate site-specific rates is illustrated in Fig. 3D. For each site in the alignment, the MPI of this site is plotted against the average MPI observed in the simulated alignments on the same site. If we consider equal rates for all sites, each site will have the same average MPI which is of course equal to the overall MPI of 0.73 of the whole alignment. Ideally, the average MPI for each simulated site is the same as the original MPI at this site. In this case, the points in the plot are on a diagonal indicating that we have found accurate estimates for the rates.

Simple estimates for site-specific rates in combination with distance based trees have been described previously [42]. The method includes fits to a gamma distribution which requires data of at least 1000 nucleotides and 30 sequences to get reasonable results. Here we use a different approach that also gives good results for smaller alignments.

The substitution rate at a site is of course related to the observed sequence diversity at this site. If a site is highly conserved the rate is low, whereas high sequence diversity indicates a high mutation rate. So in a first step, we calculate the average number of pairwise differences ⟨p k ⟩ for each site k in the alignment with n sequences:

〈 p k 〉 = 2 n ( n − 1 ) ∑ i n ∑ j > i n δ i j k ; with  δ i j k = { 1 if nucleotides in sequences  i , j  differ at site  k 0 otherwise MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeqabeGaaaqaaiabgMYiHlabdchaWnaaBaaaleaacqWGRbWAaeqaaOGaeyOkJeVaeyypa0tcfa4aaSaaaeaacqaIYaGmaeaacqWGUbGBcqGGOaakcqWGUbGBcqGHsislcqaIXaqmcqGGPaqkaaGcdaaeWbqaamaaqahabaGaeqiTdq2aa0baaSqaaiabdMgaPjabdQgaQbqaaiabdUgaRbaaaeaacqWGQbGAcqGH+aGpcqWGPbqAaeaacqWGUbGBa0GaeyyeIuoakiabcUda7aWcbaGaemyAaKgabaGaemOBa4ganiabggHiLdaakeaacqqG3bWDcqqGPbqAcqqG0baDcqqGObaAcqqGGaaicqaH0oazdaqhaaWcbaGaemyAaKMaemOAaOgabaGaem4AaSgaaOGaeyypa0ZaaiqaaeaafaqaaeGacaaabaGaeGymaedabaGaeeyAaKMaeeOzayMaeeiiaaIaeeOBa4MaeeyDauNaee4yamMaeeiBaWMaeeyzauMaee4Ba8MaeeiDaqNaeeyAaKMaeeizaqMaeeyzauMaee4CamNaeeiiaaIaeeyAaKMaeeOBa4MaeeiiaaIaee4CamNaeeyzauMaeeyCaeNaeeyDauNaeeyzauMaeeOBa4Maee4yamMaeeyzauMaee4CamNaeeiiaaIaemyAaKMaeiilaWIaemOAaOMaeeiiaaIaeeizaqMaeeyAaKMaeeOzayMaeeOzayMaeeyzauMaeeOCaiNaeeiiaaIaeeyyaeMaeeiDaqNaeeiiaaIaee4CamNaeeyAaKMaeeiDaqNaeeyzauMaeeiiaaIaem4AaSgabaGaeGimaadabaGaee4Ba8MaeeiDaqNaeeiAaGMaeeyzauMaeeOCaiNaee4DaCNaeeyAaKMaee4CamNaeeyzaugaaaGaay5Eaaaaaaaa@A812@
(10)

⟨p k ⟩ are observed differences ignoring multiple substitutions. If we naively choose our rates proportional to ⟨p k ⟩ we would underestimate high rates while overestimating low rates. We therefore use the relationship in equation 9 to correct for this effect and calculate estimates f ^ k MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmOzayMbaKaadaWgaaWcbaGaem4AaSgabeaaaaa@2EC5@ for the rates at site k as follows:

f ^ k = 1 b ^ ⋅ ln ( 1 − 〈 p k 〉 a ^ ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmOzayMbaKaadaWgaaWcbaGaem4AaSgabeaakiabg2da9KqbaoaalaaabaGaeGymaedabaGafmOyaiMbaKaaaaGccqGHflY1cyGGSbaBcqGGUbGBdaqadaqaaiabigdaXiabgkHiTKqbaoaalaaabaGaeyykJeUaemiCaa3aaSbaaeaacqWGRbWAaeqaaiabgQYiXdqaaiqbdggaHzaajaaaaaGccaGLOaGaayzkaaaaaa@43FD@
(11)

It must be pointed out that the site-specific rates change the relationship between genetic distance and observed differences (Fig. 3B). For correcting the site-specific rates we use the estimates for a ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmyyaeMbaKaaaaa@2D30@ and b ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmOyaiMbaKaaaaa@2D32@ from our model without site-specific rates. So this is only an approximation and one could think about iteratively refining the estimates. However, we found that this approach already yields accurate rates within one step as can be seen in Fig. 3D. Using the model with site-specific rates, the simulated alignments have on average almost exactly the same site-wise MPI as the original one.

The reader will notice that the first three points deviate from the diagonal. This illustrates a limitation of our method. With our simulation procedure we can on average only reach the level of saturation even if we use very high rates. It is possible, however, that the original data contains sites below the level of saturation. For example in a four way alignment a column can be ACGT, i.e. MPI = 0. However, we cannot simulate on average columns with MPI = 0, since the MPI is bounded below by zero and our simulations will always contain columns with MPI > 0. In practice this does not seem to cause any obvious problems in particular when we have many sequences where it is unlikely to see columns below saturation.

Gaps

Gaps have been ignored completely so far. There are evolutionary models including deletions and insertions [43–47] and, in principle, it would be possible to combine the insertion-deletion dynamics with our model. However, this does not appear practical in our case. Existing algorithms for joint estimation of phylogenies and alignments are not only very time-consuming [47], it also seems difficult to estimate reasonable indel model parameters on relatively short alignment blocks which hold only little information. Moreover, alignment programs produce gap patterns that do not necessarily reflect phylogenetically reasonable insertion/deletion events and thus cannot always be captured by an idealized model that is motivated by evolutionary processes and ignores algorithmic idiosyncrasies of alignment programs.

So we follow here a very pragmatic strategy that has also been used previously [5]: We keep exactly the same gap pattern in our randomized alignments as in the original alignment. To this end, we simply treat gaps as missing data and simulate nucleotide characters for the gapped positions. This is done in a way that the overall characteristics are not changed when they are replaced with gaps again at the end (see Methods for details).

Transition/transversion rate ratio

The transition/transversion rate ratio κ is a parameter in our model that cannot be simply counted as in the case of the dinucleotide frequencies, or empirically determined like the branch lengths. Given that the influence of this parameter is not that critical as for example the branch length or base composition (see Fig. 1), one possibility might be to use a fixed transition/transversion ratio if a reasonable average value is known for the genome at hand. Alternatively, we found that a good estimate can be obtained by using maximum likelihood on an independent mononucleotide model. We used here the HKY model with -distributed rates which is closest to our dinucleotide model.

Putting it together

Fig. 4 gives a short outline over the whole randomization procedure. We start by parametrization our model: we count the dinucleotides and calculate the corresponding stationary trinucleotide frequencies. A transition/transversion rate ratio for the alignment is estimated using maximum likelihood under a HKY+Γ model. Having set these parameters, we empirically estimate the relationship between substitutions and observed differences with equal rates for each site. This first estimate is used to calculate the site-specific rates, which are then used for the second estimation. In the next step, the pairwise distances between all sequences are calculated. For the calculation of the site-specific rates and the pairwise distances gap characters are treated in a special way as missing data (see Methods). From the distance matrix a tree is built using the BIONJ algorithm. An ancestral sequence is sampled from a first order Markov model parametrized according to the dinucleotide frequency in the original alignment. This is used as a starting sequence for the simulation that is guided by the tree. Finally, the gap pattern of the original alignment is introduced into the simulated one. Fig. 5 shows our rRNA example and two randomized versions obtained by this procedure.

Figure 4
figure 4

Overview of the algorithm. Left: The steps of the randomization procedure are shown. Right: In combination with RNAalifold consensus folding the randomization procedure can be used to calculate z-scores and to predict significant RNA structures. See text for details.

Figure 5
figure 5

Example of randomized alignments. Part of the example alignment used in Fig. 3 are shown. The grey bars indicate the level of local conservation. Exactly conserved sites are marked by asterisks.

Implementation

We implemented our method in ANSI C in a program called SISSIz. The source code is available under the GNU Public Licence for download [48].

Some words on running time: One might suspect that the randomization algorithm including two times the sampling procedure to estimate the parameters of equation 9 and the maximum likelihood estimation of the transition/transversion rate ratio is relatively slow. Indeed, it is much slower than for example randomization by shuffling, but still very fast. To build the model for our example of 7 rRNAs of 158 length takes 0.2 seconds on a modern Intel Core 2 Quad CPU at 2.4 GHz. To simulate 1000 alignments using this model takes another 0.6 seconds.

Testing

Randomizing vertebrate genomic alignments

We tested our randomization method on vertebrate genomic alignments. In a setting similar to recent genomic screens in vertebrates [11, 20], we extracted Multiz [49] alignment blocks from human chromosome 1. We randomly selected 1000 alignment blocks between 70 and 120 nt in length and between 4 and 10 sequences without considering annotation information of any sort. These alignments are meant to represent an unbiased "genomic background" that may also contain functional elements like coding exons or structured RNAs depending on their frequency in the genome.

The alignments were randomized using our new simulation procedure with both the dinucleotide and the mononucleotide model. In addition, we shuffled the alignments using shuffle-aln.pl. The global distribution of dinucleotides for the original and randomized data is shown in Fig. 6. As expected, the shuffling approach and the mononucleotide simulation give the same results. The dinucleotide distribution obtained by these methods, however, differs from the distribution in the native alignments. One can see for example the well known under-representation of CpGs in the native genomic data. Using our dinucleotide based model, we obtain simulated alignments which are statistically indistinguishable from the native data in terms of their average dinucleotide content.

Figure 6
figure 6

Dinucleotide frequencies of genomic alignments. 1000 vertebrate genome alignments were randomized using three different methods. The dinucleotide frequency of the native and randomized data is shown as box-plots.

Also the observed sequence diversity of the simulated alignments closely follows the original data as shown in Fig. 7. 98.7% of the simulated alignments are within a range of ± 0.05 mean pairwise identity compared to the original alignments. It must be noted, that the distribution in Fig. 7 has a mean of +0.007 which indicates a subtle bias of the simulations towards higher MPIs. We suspect that this is an indirect result of the way we estimate site-specific rates and related to the issue of sites below saturation discussed before. However, this deviation does not have any practical consequences since it represents a conservative bias in the context of RNA folding controls and, more importantly, seems to be too small to have any noticeable effect at all.

Figure 7
figure 7

Mean pairwise identity in randomized genomic alignments. The distribution of the difference of the mean pairwise identity between the original genomic alignments and the simulated ones (dinucleotide model) is shown.

Influence of randomization procedure on RNA predictions

The main motivation of this paper is to provide dinucleotide based controls for comparative RNA gene predictions. Therefore, we ran RNAalifold and RNAz on the alignments to demonstrate how different randomization procedures affect the results. Fig. 8A shows the distribution of RNAalifold consensus MFEs on the genomic alignments and their different randomizations. One can see that the genomic alignments show the most stable structures. There is a clear difference between the native genomic alignments and the shuffled and mononucleotide simulated ones. However, the folding energies of the dinucleotide simulated alignments are much closer to the native data. This difference between the di- and mononucleotide simulations reflects the bias caused by the genomic dinucleotide content. The difference between the native and the dinucleotide controls indicates the existence of RNA signals in the genome or, alternatively, another as yet unidentified bias.

Figure 8
figure 8

Influence of the randomization procedure on RNA predictions. (A) Cumulative frequency distribution of RNAalifold consensus folding energies for the native and randomized alignments. (B) Cumulative frequency distribution of RNAz scores. The "decision-value" is the result of the support vector machine classification. Positive values indicate a potential functional RNA while negative values indicate no significant fold. The positive tail is magnified.

Clearly, the differences shown here in these cumulative histograms might appear very subtle. The results for the RNAz predictions, however, show that such differences can strongly affect the statistics of RNA gene predictions (Fig. 8B). On this particular test set, RNAz predicts RNA signals in 4.3% of the native alignments. Using the conventional shuffling strategy or mononucleotide based simulation one would estimate a false positive rate of 0.8% or 0.7%, respectively. Using the more conservative dinucleotide based model the estimate would be 2.1%, i.e. three times higher. This is consistent with the results obtained by Babak et al. using their dinucleotide shuffling approach on pairwise alignments.

Calculating z-scores to predict structural RNAs

We can directly assess the significance of a predicted RNA by calculating a z-score. The folding energy of the native data m and the mean μ and standard deviation σ of randomized data is calculated. The significance of the native fold can then be expressed as z = (m - μ)/σ, i.e. the number of standard deviations from the mean. This score has been repeatedly used on single sequences applying mono- or dinucleotide shuffling or simulation using a zero or first order Markov model [23, 24]. Using shuffled alignments as null model, this approach is implemented in the RNA gene finding program AlifoldZ [5]. The same strategy can be used in combination with our new dinucleotide base randomization strategy without any further modifications (Fig. 4).

To test the effectiveness of this approach, we conducted a benchmark similar to those used previously [5, 6] for testing AlifoldZ and RNAz. We used multiple sequence alignments of eight different structural RNA families taken from the Rfam database [50]. The alignments contained three to six sequences and had a mean pairwise identity between 50% and 100% (see Methods for details). For the tests of AlifoldZ and RNAz, shuffled alignments were used as negative controls. For obvious reasons, this is not possible here. So we used genomic alignments from random locations of the human genome (see Methods). Using the "genomic background" as negative controls in this test implies the assumption that the genome does not contain any structural RNAs at all, which is clearly not valid. However, if we assume true structural RNAs to be sparse in the genome this conservative assumption seems to be a sensible choice.

We calculated z-scores with a sample size of 1000 randomizations for both sets of true structured RNAs and the genomic background using three different randomization methods: Shuffling (AlifoldZ), simulation using a mononucleotide model (SISSIz mono) and simulation using the dinucleotide model (SISSIz di). The results are summarized in Tab. 1.

Table 1 z-scores and classification performance

Using mononucleotide based randomization the z-scores of the genomic background are approximately half a standard deviation from zero (-0.44 and -0.58, for shuffling and mononucleotide simulation respectively). This shows the relatively strong "bias" of the genomic background that causes false positive predictions as shown in the previous section and in reference 21. Albeit the signal does not vanish completely, the dinucleotide based z-scores are much closer to zero (-0.15).

The z-scores of the structural RNAs in this test set are on average well below -4 indicating a clear structural signal. Also here, we observe that mononucleotide simulated z-scores are lower than the dinucleotide simulated ones. In this case, a dinucleotide content that favors stable RNA structures is clearly not only a general background effect of the genomic base composition but a feature of structural RNAs. However, this signal is lost if the more conservative dinucleotide based null model is used.

There is also a clear difference between the two monunucleotide randomization procedures: Shuffling leads to more significant z-scores than simulation. The main reason is the fact that simulation results in higher standard deviations than shuffling which in turn lead to more conservative z-scores.

This shows that there are many effects that have to be taken into account. To assess the overall classification performance we generated receiver operating characteristic curves based on the three different z-scores, as well as the support vector machine score from RNAz (Fig. 9). In addition, we calculated the sensitivity at two different levels of specificity (0.01 and 0.05) for all four approaches (Tab. 1).

Figure 9
figure 9

Accuracy of z -score based classification of structured RNAs. As positive examples, alignments from eight different classes of structural RNAs were used. As negative examples, random locations from genome wide vertebrate alignments were chosen. ROC curves are shown in dependence on the null model used. In addition, the results of the RNAz support vector machine are shown. The region of high specificity which is of special interest is magnified.

The ROC curve shows that all the methods perform very well on this test set. The curve further suggests that there is not much difference between them. However, differences become evident when looking at the region of high specificity, the only relevant region for practical applications (see inset Fig. 9). Here, the dinucleotide based approach generally outperforms the mononucleotide based methods. The improvement is small but clearly noticeable: At a false positive rate of 0.01%, dinucleotide based simulation shows the highest sensitivity for 7 of the 8 RNA classes. For example, in the tRNA group the sensitivity is 13% higher than AlifoldZ and RNAz. The latter performs significantly worse than all other methods at this level. At a false positive rate of 0.05%, dinucleotide simulation still performs slightly better than mononucleotide shuffling/simulation but is on the same level as RNAz that performs significantly better here.

Discussion

Any experiment is only as good as its controls. What is true for experimental biology clearly also holds in the field of computational biology. The value of even the most sophisticated algorithm remains unclear if the significance of the results cannot be assessed properly. In this paper we addressed the problem of finding an adequate control strategy for comparative noncoding RNA predictions, which are started to get widely used for genome annotation.

Babak et al. demonstrated that currently used null models based on mononucleotide shuffling lead to an underestimation of the false positive rate in such screens. Although single opinions may be different [51], it is generally accepted that in the context of RNA gene prediction one should consider dinucleotide content as "background" rather than "signal". However, while there have been dinucleotide controlled randomization algorithms for single sequences for more than 20 years, it is a non-trivial problem in the case of multiple sequence alignments.

Here we devised a simulation procedure that produces alignments that have on average a given dinucleotide frequency and sequence diversity (globally and locally). The corresponding model needs to be relatively complex including overlapping dependencies and site-specific rates. Clearly, this model with a high number of parameters would not be a reasonable choice for use in phylogenetic analysis, but it turned out to be a good choice for this specific application.

We have to use heuristics and simplifications to estimate the tree and parameters for this model in reasonable time. The accuracy of our approach is measured in terms of how well the simulations reflect the properties of the original data. In this respect, we found that our strategy performs very well. Again, phylogenetic analysis was not the goal here, but some of the techniques introduced here might be of interest in this context. For example, we found that in the mononucleotide case our estimations for site-specific rates are surprisingly competitive when compared to the currently best maximum likelihood methods (data not shown).

The influence of the null model for genomic RNA predictions was found to be remarkable. Consistent with Babak and colleagues' findings on pairwise alignments, we observed three times more false positives using dinucleotide controls than using mononucleotide controls. This clearly shows that the new approach should be the method of choice to get more sensible estimates of the significance of comparative RNA predictions.

The next obvious step, is to use the new null model to improve current RNA gene prediction algorithms. In analogy to AlifoldZ, we combined our new simulation procedure with the RNAalifold consensus structure prediction algorithm. SISSIz calculates z-scores that are not biased by the genomic dinucleotide content and it is thus the first comparative gene finding program, that explicitely corrects for this effect. However, by using this conservative null model we also loose part of the signal in true structured RNAs. This might be the main reason, why the observed improvements in the overall classification performance were only relatively small.

In general, the support vector machine approach used by RNAz is preferable over the AlifoldZ approach, since it is orders of magnitude faster. However, it turned out to be difficult to create a dinucleotide based version of RNAz mainly for two reasons. Until now, there was no way to produce a dinucleotide controlled negative test set that is necessary for training the two class support vector machine [6]. With the method presented here, we have solved this problem and it is now possible to create test sets with specific dinucleotide properties. However, it remains an unsolved question how to compute dinucleotide based z-scores efficiently without shuffling. RNAz uses a regression approach to solve this problem for mononucleotides, which, unfortunately, does not scale well to the high dimensional dinucleotide case.

A promising alternative to the thermodynamic RNA prediction methods used in this paper, are probabilistic methods. The EvoFold algorithm [7] uses phylogenetic stochastic context-free grammars and, in its core, depends on a null model which is essentially an independent mononucleotide model. Since the folding grammar of EvoFold does not explicitely model stacking interactions there is no need for using a null model with overlapping dinucleotides as we have described here. However, also EvoFold was found to be affected to some degree by the dinucleotide content for reasons that are not immediately obvious [21]. A dinucleotide background model together with an advanced folding grammar that considers stacks can thus be expected to improve performance. However, it would require considerable effort to include such a null model into the sophisticated probabilistic framework of EvoFold.

Finally, we want to add that our randomization algorithm is not only of interest in the context of RNA gene prediction. It can be used for other comparative genomics applications whenever random alignments are needed as control. One could consider other applications in the context of RNA structures (e.g. prediction of conserved miRNA target sites) but also in different context (e.g. conserved sequence motifs). Currently our software implements a mono- and dinucleotide model which should be sufficient for many applications. In principle, however, it is also possible to consider higher order correlations within this framework.

Methods

Treating gaps

Gapped positions are treated as missing data. When counting the dinucleotide content, dinucleotides including a gap (N-, -N, --) are ignored. During simulation, gap positions are filled with nucleotides and gaps are re-introduced afterwards. Note that this way, if two nucleotides N1 and N2 are separated by a gap (e.g. N1----N2) the dinucleotide N1N2 is not in equilibrium. Depending on how gaps are treated in the downstream analysis this might be or might not be of concern. In any case, since not every gap position but only every gap opened is affected, this (potential) error is generally very small for reasonable alignments. So we did not consider correcting for this effect which would require reconstructing the gap history and setting lineage specific neighbourhood systems.

For calculating the site-specific rates, we also treat gaps as missing data and calculate ⟨p k ⟩ in eq. 10 only over non-gap positions. After the simulation, the whole column has on average ⟨p k ⟩ estimated from the non-gap positions that does not change when originally gapped positions are masked again. For calculating the observed differences p between two sequences we set positions that includes gaps to the average ⟨p k ⟩ at this site.

Distances above the level of saturation

When calculating genetic distances between two sequences the problem may occur that the observed number of differences is higher than the level of saturation. We found that this problem becomes severe when considering site-specific rates that generally lead to much lower levels of saturation (cf. Fig. 3B). We use a simple trick to overcome this limitation. We add additional sites during the simulation with site-specific rates that correspond to the average of the whole alignment (i.e. ⟨p k ⟩ is set to 1-MPI in eq. 11 for all these additional sites). They act as "buffer sites" that reduces the number of mutation events that repeatedly hit the same sites of high rate leading to many double substitutions. As a consequence, the overall level of observed differences is higher and we do not run into problems building the distance matrix. In the end, the sites are removed again and since the relative rate ratios between the sites remained unchanged, we get the desired site-specific mutation patterns.

Limiting base composition variation

During the testing of the influence of the randomization procedure on RNA folding, we made an interesting observation. As expected, the variance of the folding energies of randomized data is higher with simulation than with shuffling. However, we also observed that there is difference in the mean. Simulation leads to slightly higher (i.e. less stable) folding energies than shuffling. We observed this behaviour not only on multiple alignments but also on single sequences using shuffling vs. first order Markov simulation. We suspect that extreme deviations in the base composition that can occur in simulated data do not symmetrically lead to the same deviations of the folding energies but preferentially impair the formation of RNA structures. To compensate for this effect, we have introduced an option in our software that only outputs simulated alignments, that are within a specific range of mononucleotide frequencies. We can thus limit our simulations to mononucleotide frequencies that are almost exactly as in the original data. As a distance measure we use the Euclidean distance ∑ α ∈ A , G , C , T ( π α s − π α ) 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaabeaeaadaGcaaqaaiabcIcaOiabec8aWnaaDaaaleaacqaHXoqyaeaacqWGZbWCaaGccqGHsislcqaHapaCdaWgaaWcbaGaeqySdegabeaakiabcMcaPmaaCaaaleqabaGaeGOmaidaaaqabaaabaGaeqySdeMaeyicI4SaemyqaeKaeiilaWIaem4raCKaeiilaWIaem4qamKaeiilaWIaemivaqfabeqdcqGHris5aaaa@4434@ with π α the desired frequency of nucleotide α in the original alignment, and π α s MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeqiWda3aa0baaSqaaiabeg7aHbqaaiabdohaZbaaaaa@30CD@ the observed frequency in the simulation. For all the data shown in Figs. 8, 9 and Tab. 1 we used simulations with this cutoff set to 0.05.

Software

For the simulations in Fig. 1 we used seq-gen version 1.3.2 [52, 53]. Monunucleotide shuffling was carried out using shuffle-aln.pl with option "--mode conservative2". Together with alifoldz.pl it is available online [54]. For the tests in Figs. 1 and 8 we used RNAalifold from the Vienna RNA package [55] version 1.6.1. with options "-nc 0 -cv 0") and RNAz [56] version 1.0 with standard parameters. For implementation of our software we used a series of third party C-code that is available as open source: levmar [57] by Manolis Lourakis for least squares fitting, BIONJ [41, 58] by Olivier Gascuel, PHYML [59, 60] by Stéphane Guindon and Olivier Gascuel for maximum likelihood estimation of the transition/transversion rate, Vienna RNA package [55] by Ivo L. Hofacker and others for consensus folding in SISSIz.

Sequence data

For the benchmark we used sequences from the following eight Rfam families: RF00001 (5S rRNA), RF00004 (U2 snRNA), RF00005 (tRNA), RF00008 (Hammerhead ribozyme), RF00012 (U3 snRNA), RF00020 (U5 snRNA), RF00029 (Group II intron), RF00104 (mir-10 precursor). From these sequences, a set of non-redundant alignments between 3 and 6 sequences per alignment and mean pairwise identity between appr. 50% and 100% was created as described [5, 6]. The families were chosen because they represent different structural families and contain enough sequences to create sets of reasonable sample size.

Genomic alignments were extracted from Multiz 17-way vertebrate alignments available at the UCSC genome browser [61, 62]. For creating the set of 1000 alignments used for Figs. 6 and 8, we used the rnazWindow.pl script from the RNAz software package [56, 63] to get typical alignment blocks as used previously in genomic ncRNA screens [20] or [14]. For the benchmark we selected for each positive example of the structural RNA set a negative example from the genomic alignments. Subsets of sequences were chosen to get the same number of sequences and the same mean pairwise identity (± 0.05) as the structural RNA counterpart. Also the alignment length was adjusted accordingly (limited to a maximum length of 150).

References

  1. Griffiths-Jones S: Annotating noncoding RNA genes. Annu Rev Genomics Hum Genet 2007, 8: 279–298. 10.1146/annurev.genom.8.080706.092419

    Article  CAS  PubMed  Google Scholar 

  2. Athanasius F Bompfünewerer Consortium, Backofen R, Bernhart SH, Flamm C, Fried C, Fritzsch G, Hackermüller J, Hertel J, Hofacker IL, Missal K, Mosig A, Prohaska SJ, Rose D, Stadler PF, Tanzer A, Washietl S, Will S: RNAs everywhere: genome-wide annotation of structured RNAs. J Exp Zoolog B Mol Dev Evol 2007, 308: 1–25. 10.1002/jez.b.21130

    Article  Google Scholar 

  3. Rivas E, Eddy SR: Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics 2001, 2: 8–8. 10.1186/1471-2105-2-8

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  4. Coventry A, Kleitman DJ, Berger B: MSARi: multiple sequence alignments for statistical detection of RNA secondary structure. Proc Natl Acad Sci USA 2004, 101(33):12102–12107. 10.1073/pnas.0404193101

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  5. Washietl S, Hofacker IL: Consensus folding of aligned sequences as a new measure for the detection of functional RNAs by comparative genomics. J Mol Biol 2004, 342: 19–30. 10.1016/j.jmb.2004.07.018

    Article  CAS  PubMed  Google Scholar 

  6. Washietl S, Hofacker IL, Stadler PF: Fast and reliable prediction of noncoding RNAs. Proc Natl Acad Sci USA 2005, 102(7):2454–2459. 10.1073/pnas.0409169102

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  7. Pedersen JS, Bejerano G, Siepel A, Rosenbloom K, Lindblad-Toh K, Lander ES, Kent J, Miller W, Haussler D: Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput Biol 2006., 2(4):

  8. Yao Z, Weinberg Z, Ruzzo WL: CMfinder-a covariance model based RNA motif finding algorithm. Bioinformatics 2006, 22(4):445–452. 10.1093/bioinformatics/btk008

    Article  CAS  PubMed  Google Scholar 

  9. Uzilov AV, Keegan JM, Mathews DH: Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change. BMC Bioinformatics 2006, 7: 173. 10.1186/1471-2105-7-173

    Article  PubMed Central  PubMed  Google Scholar 

  10. Torarinsson E, Sawera M, Havgaard JH, Fredholm M, Gorodkin J: Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain common RNA structure. Genome Res 2006, 16(7):885–9. 10.1101/gr.5226606

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  11. Washietl S, Hofacker IL, Lukasser M, Hüttenhofer A, Stadler PF: Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome. Nat Biotechnol 2005, 23(11):1383–1390. 10.1038/nbt1144

    Article  CAS  PubMed  Google Scholar 

  12. Missal K, Rose D, Stadler PF: Non-coding RNAs in Ciona intestinalis. Bioinformatics 2005, 21(Suppl 2):ii77–78. 10.1093/bioinformatics/bti1113

    Article  CAS  PubMed  Google Scholar 

  13. Missal K, Zhu X, Rose D, Deng W, Skogerbo G, Chen R, Stadler PF: Prediction of structured non-coding RNAs in the genomes of the nematodes Caenorhabditis elegans and Caenorhabditis briggsae. J Exp Zoolog B Mol Dev Evol 2006, 306(4):379–392. 10.1002/jez.b.21086

    Article  Google Scholar 

  14. Rose D, Hackermueller J, Washietl S, Reiche K, Hertel J, Findeiss S, Stadler PF, Prohaska SJ: Computational RNomics of Drosophilids. BMC Genomics 2007, 8: 406. 10.1186/1471-2164-8-406

    Article  PubMed Central  PubMed  Google Scholar 

  15. Axmann IM, Kensche P, Vogel J, Kohl S, Herzel H, Hess WR: Identification of cyanobacterial non-coding RNAs by comparative genome analysis. Genome Biol 2005., 6(9):

  16. Weile C, Gardner PP, Hedegaard MM, Vinther J: Use of tiling array data and RNA secondary structure predictions to identify noncoding RNA genes. BMC Genomics 2007, 8: 244–244. 10.1186/1471-2164-8-244

    Article  PubMed Central  PubMed  Google Scholar 

  17. del Val C, Rivas E, Torres-Quesada O, Toro N, Jiménez-Zurdo JI: Identification of differentially expressed small non-coding RNAs in the legume endosymbiont Sinorhizobium meliloti by comparative genomics. Mol Microbiol 2007, 66(5):1080–1091. 10.1111/j.1365-2958.2007.05978.x

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  18. Mourier T, Carret C, Kyes S, Christodoulou Z, Gardner PP, Jeffares DC, Pinches R, Barrell B, Berriman M, Griffiths-Jones S, Ivens A, Newbold C, Pain A: Genome-wide discovery and verification of novel structured RNAs in Plasmodium falciparum. Genome Res 2007.

    Google Scholar 

  19. Sandmann T, Cohen SM: Identification of Novel Drosophila melanogaster MicroRNAs. PLoS ONE 2007., 2(11):

    Google Scholar 

  20. Washietl S, Pedersen JS, Korbel JO, Stocsits C, Gruber AR, Hackermüller J, Hertel J, Lindemeyer M, Reiche K, Tanzer A, Ucla C, Wyss C, Antonarakis SE, Denoeud F, Lagarde J, Drenkow J, Kapranov P, Gingeras TR, Guigó R, Snyder M, Gerstein MB, Reymond A, Hofacker IL, Stadler PF: Structured RNAs in the ENCODE selected regions of the human genome. Genome Res 2007, 17(6):852–864. 10.1101/gr.5650707

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  21. Babak T, Blencowe BJ, Hughes TR: Considerations in the identification of functional RNA structural elements in genomic alignments. BMC Bioinformatics 2007, 8: 33. 10.1186/1471-2105-8-33

    Article  PubMed Central  PubMed  Google Scholar 

  22. Altschul SF, Erickson BW: Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. Mol Biol Evol 1985, 2(6):526–538.

    CAS  PubMed  Google Scholar 

  23. Clote P, Ferré F, Kranakis E, Krizanc D: Structural RNA has lower folding energy than random RNA of the same dinucleotide frequency. RNA 2005, 11(5):578–591. 10.1261/rna.7220505

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  24. Workman C, Krogh A: No evidence that mRNAs have lower folding free energies than random sequences with the same dinucleotide distribution. Nucleic Acids Res 1999, 27(24):4816–4822. 10.1093/nar/27.24.4816

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  25. Hasegawa M, Kishino H, Yano T: Dating of the Human-Ape Splitting by a Molecular Clock of Mitochondrial DNA. J Mol Evol 1985, 22: 160–174. 10.1007/BF02101694

    Article  CAS  PubMed  Google Scholar 

  26. Hofacker IL, Fekete M, Stadler PF: Secondary structure prediction for aligned RNA sequences. J Mol Biol 2002, 319(5):1059–1066. 10.1016/S0022-2836(02)00308-X

    Article  CAS  PubMed  Google Scholar 

  27. Felsenstein J: Inferring Phylogenies. Sunderland, Massachusetts: Sinauer Associates; 2004.

    Google Scholar 

  28. Tavaré S: Some probabilistic and statistical problems on the analysis of DNA sequences. Lec Math Life Sci 1986, 17: 57–86.

    Google Scholar 

  29. Jensen J, Pedersen AM: Probabilistic models of DNA sequence evolution with context dependent rates of substitution. Adv Appl Prob 2000, 32: 499–517. 10.1239/aap/1013540176

    Article  Google Scholar 

  30. Duret L, Galtier N: The covariation between TpA deficiency, CpG deficiency, and G+C content of human isochores is due to a mathematical artifact. Mol Biol Evol 2000, 17(11):1620–1625.

    Article  CAS  PubMed  Google Scholar 

  31. Pedersen AM, Jensen J: A dependent rates model and MCMC based methodology for the maximum likelihood analysis of sequences with overlapping reading frames. Mol Biol Evol 2001, 18: 763–776.

    Article  CAS  PubMed  Google Scholar 

  32. Arndt PF, Burge CB, Hwa T: DNA sequence evolution with neighbor-dependent mutation. J Comput Biol 2003, 10: 313–322. 10.1089/10665270360688039

    Article  CAS  PubMed  Google Scholar 

  33. Robinson DM, Jones DT, Kishino H, Goldman N, Thorne JL: Protein Evolution with Dependence Among Codons Due to Tertiary Structure. Mol Biol Evol 2003, 20: 1692–1704. 10.1093/molbev/msg184

    Article  CAS  PubMed  Google Scholar 

  34. Siepel A, Haussler D: Phylogenetic Estimation of Context-Dependent Substitution Rates by Maximum Likelihood. Mol Biol Evol 2004, 21: 468–488. 10.1093/molbev/msh039

    Article  CAS  PubMed  Google Scholar 

  35. Lunter G, Hein J: A nucleotide substitution model with nearest-neighbour interactions. Bioinformatics 2004, 20: i216-i223. 10.1093/bioinformatics/bth901

    Article  CAS  PubMed  Google Scholar 

  36. Christensen OF: Pseudo-likelihood for non-reversible nucleotide substitution models with neighbor dependent rates. Stat Appl Genet Mol Biol 2006, 5: 1–29.

    Google Scholar 

  37. Gesell T, von Haeseler A: In silico sequence evolution with site-specific interactions along phylogenetic trees. Bioinformatics 2006, 22: 716–722. 10.1093/bioinformatics/bti812

    Article  CAS  PubMed  Google Scholar 

  38. Schöniger M, von Haeseler A: A Stochastic Model for the Evolution of Autocorrelated DNA sequences. Mol Phylogenet Evol 1994, 3: 240–247. 10.1006/mpev.1994.1026

    Article  PubMed  Google Scholar 

  39. Schöniger M, von Haeseler A: Simulating efficiently the evolution of DNA sequences. Comput Appl Biosci 1995, 11: 111–115.

    PubMed  Google Scholar 

  40. Lanave C, Preparata G, Saccone C, Serio G: A new method for calculating evolutionary substitution rates. J Mol Evo 1984, 20: 86–93. 10.1007/BF02101990

    Article  CAS  Google Scholar 

  41. Gascuel O: BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol Biol Evol 1997, 14(7):685–695.

    Article  CAS  PubMed  Google Scholar 

  42. Peer Y, Baldauf SL, Doolittle WF, Meyer A: An updated and comprehensive rRNA phylogeny of (crown) eukaryotes based on rate-calibrated evolutionary distances. J Mol Evol 2000, 51(6):565–576.

    Article  PubMed  Google Scholar 

  43. Thorne J, Kishino H, Felsenstein J: An Evolutionary Model for Maximum Likelihood Alignment of DNA Sequences. J Mol Evol 1991, 33: 114–124. 10.1007/BF02193625

    Article  CAS  PubMed  Google Scholar 

  44. Thorne J, Kishino H, Felsenstein J: Inching toward reality: An improved likelihood model of sequence evolution. J Mol Evol 1992, 34: 3–16. 10.1007/BF00163848

    Article  CAS  PubMed  Google Scholar 

  45. Metzler D: Statistical alignment based on fragment insertion and deletion models. Bioinformatics 2003, 19: 490–499. 10.1093/bioinformatics/btg026

    Article  CAS  PubMed  Google Scholar 

  46. Miklós I, Lunter G, Holmes I: A "Long Indel" Model For Evolutionary Sequence Alignment. Mol Biol Evol 2004, 21: 529–540. 10.1093/molbev/msh043

    Article  PubMed  Google Scholar 

  47. Fleißner R, Metzler D, von Haeseler A: Simultaneous Statistical Alignment and Phylogeny Reconstruction. Syst Biol 2005, 54: 548–561. 10.1080/10635150590950371

    Article  PubMed  Google Scholar 

  48. SISSIz[http://sourceforge.net/projects/sissiz]

  49. Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, Haussler D, Miller W: Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 2004, 14(4):708–715. 10.1101/gr.1933104

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  50. Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A: Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res 2005, (33 Database):D121–4.

    Google Scholar 

  51. Forsdyke DR: Calculation of folding energies of single-stranded nucleic acid sequences: conceptual issues. J Theor Biol 2007, 248(4):745–753. 10.1016/j.jtbi.2007.07.008

    Article  CAS  PubMed  Google Scholar 

  52. Seq-Gen[http://tree.bio.ed.ac.uk/software/seqgen]

  53. Rambaut A, Grassly NC: Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci 1997, 13(3):235–238.

    CAS  PubMed  Google Scholar 

  54. AlifoldZ/shuffle-aln[http://www.tbi.univie.ac.at/papers/SUPPLEMENTS/Alifoldz]

  55. The Vienna RNA package[http://www.tbi.univie.ac.at/~ivo/RNA]

  56. RNAz – predicting structural noncoding RNAs[http://www.tbi.univie.ac.at/~wash/RNAz]

  57. levmar: Levenberg-Marquardt nonlinear least squares algorithms in C/C++[http://www.ics.forth.gr/~lourakis/levmar]

  58. BIONJ[http://www.lirmm.fr/~w3ifa/MAAS/BIONJ/]

  59. PhyML[http://atgc.lirmm.fr/phyml]

  60. Guindon S, Gascuel O: A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 2003, 52(5):696–704. 10.1080/10635150390235520

    Article  PubMed  Google Scholar 

  61. UCSC genome browser[http://genome.ucsc.edu]

  62. Karolchik D, Kuhn RM, Baertsch R, Barber GP, Clawson H, Diekhans M, Giardine B, Harte RA, Hinrichs AS, Hsu F, Kober KM, Miller W, Pedersen JS, Pohl A, Raney BJ, Rhead B, Rosenbloom KR, Smith KE, Stanke M, Thakkapallayil A, Trumbower H, Wang T, Zweig AS, Haussler D, Kent WJ: The UCSC Genome Browser Database: 2008 update. Nucleic Acids Res 2007.

    Google Scholar 

  63. Washietl S: Prediction of Structural Noncoding RNAs With RNAz. Methods Mol Biol 2007, 395: 503–526.

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We thank Arndt von Haeseler, Ivo Hofacker, and Nick Goldman for discussions and Roland Fleißner for comments on the manuscript. We acknowledge funding from the Austrian GEN-AU projects "noncoding RNA" and "Bioinformatics Integration Network" and Financial support to the CIBIV institute from the Wiener Wissenschafts-, Forschungs- and Technologiefonds (WWTF).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefan Washietl.

Additional information

Authors' contributions

Both authors contributed equally.

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Gesell, T., Washietl, S. Dinucleotide controlled null models for comparative RNA gene prediction. BMC Bioinformatics 9, 248 (2008). https://doi.org/10.1186/1471-2105-9-248

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-9-248

Keywords