Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Research article

Effects of using coding potential, sequence conservation and mRNA structure conservation for predicting pyrrolysine containing genes

Christian Theil Have*, Sine Zambach* and Henning Christiansen

Author Affiliations

Research group PLIS: Programming, Logic and Intelligent Systems, Department of Communication, Business and Information Technologies, Roskilde University, P.O. Box 260, Roskilde, DK-4000, Denmark

For all author emails, please log on.

BMC Bioinformatics 2013, 14:118  doi:10.1186/1471-2105-14-118

The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/14/118


Received:11 July 2012
Accepted:19 March 2013
Published:4 April 2013

© 2013 Theil Have et al.; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Pyrrolysine (the 22nd amino acid) is in certain organisms and under certain circumstances encoded by the amber stop codon, UAG. The circumstances driving pyrrolysine translation are not well understood. The involvement of a predicted mRNA structure in the region downstream UAG has been suggested, but the structure does not seem to be present in all pyrrolysine incorporating genes.

Results

We propose a strategy to predict pyrrolysine encoding genes in genomes of archaea and bacteria. We cluster open reading frames interrupted by the amber codon based on sequence similarity. We rank these clusters according to several features that may influence pyrrolysine translation. The ranking effects of different features are assessed and we propose a weighted combination of these features which best explains the currently known pyrrolysine incorporating genes. We devote special attention to the effect of structural conservation and provide further substantiation to support that structural conservation may be influential – but is not a necessary factor. Finally, from the weighted ranking, we identify a number of potentially pyrrolysine incorporating genes.

Conclusions

We propose a method for prediction of pyrrolysine incorporating genes in genomes of bacteria and archaea leading to insights about the factors driving pyrrolysine translation and identification of new gene candidates. The method predicts known conserved genes with high recall and predicts several other promising candidates for experimental verification. The method is implemented as a computational pipeline which is available on request.

Background

Over the past two decades, the standard genetic code has been revised to include the two new amino acids, selenocysteine and pyrrolysine. These amino acids are, under certain circumstances, encoded by codons that are normally stop codons. Translation of these codons can be influenced by the mRNA structure. This is the case for selenocysteine where a cis-acting mRNA structure (SECIS) drives translation of the opal stop codon (UGA) as selenocysteine. Similarly, a structure (PYLIS) has been identified in some genes where pyrrolysine is encoded by the (usual) stop codon UAG [1]. The structure lies in the region between the UAG codon and approximately 100 bp downstream. The role of the structure in translation is unclear and it is only conserved among some pyrrolysine incorporating genes [2]. Zhang et al. [2] suggest that either a complete recoding of the UAG codon as pyrrolysine occurs or alternatively that UAG serves a dual function in pyrrolysine incorporating organisms; termination and translation competes leading to "statistical proteins" where both terminated and elongated products occur, but where the amounts of protein products may depend on circumstances.

The latter possibility is substantiated by an in vitro study [3] where the components necessary for pyrrolysine synthesis are inserted into E. coli. The study shows that the PYLIS structure is not essential for translation of pyrrolysine incorporating genes, but also concludes that the presence of the structure results in a higher amount of pyrrolysine incorporating protein products and that synonymous codon mutations in the PYLIS sequence result in lesser amounts.

The translation of pyrrolysine is associated with methane metabolism. All known organisms with methane metabolism have pyrrolysine incorporating methyltransferases, that initiate the transfer of methyl groups from methyl amines and into a process of which methane is the result [4,5]. Three distinct methyl trans-ferases have been identified — monomethylamine methyltransferase (mtmB), dimethylamine methyltrans-ferase (mtbB), and trimethylamine methyltransferase (mttB) — each of which allows metabolism of different kinds of methyl amines [6]. Each transferase catalyzes the transfer of a methyl group from mono-, di- or trimethylamine to each of their respective corrinoid cofactors, and the three methyltransferases are not all present in all methane producing organisms. It has been hypothesized that the availability of methyl amines regulates translation of UAG as pyrrolysine [2].

While selenocysteine is translated in a broad variety of organisms including archaea, bacteria and eukaryotes, pyrrolysine translation was up until recently believed to occur only in a few microbes, but it has recently been observed in a somewhat larger number of genomes within archea and bacteria [7]. So far, approximately 16 species are known to have pyrrolysine-containing genes.

Methods for identification of selenocysteine encoding genes based on detection of the SECIS structural motif are quite successful [8]. Such methods may successfully identify genes with a PYLIS structure, but have difficulties predicting pyrrolysine incorporating genes without the consensus structure. Previous com-putational methods for detection of pyrrolysine genes, e.g., [9,10], are based on homology search. These methods do not consider structural conservation and codon sequence composition of downstream regions for predicting pyrrolysine incorporating genes. In this paper we introduce an approach which takes all these factors in account. Our approach does not assume a particular consensus structure to be present, but is capable of taking conserved structure in the region downstream UAG into account.

Methods

An flowchart illustrating our method is shown in Figure 1. Details of the steps involved are described in the following subsections.

thumbnailFigure 1. Illustration of the pipeline for identifying pyrrolysine containing genes. The process extracts iORFs which are then clustered using blast. Finally, the clusters are ranked according to several features.

Identification of relevant organisms

We identify organisms of interest by searching for a tRNApyl synthetase using blast [11]. Additionally, we search for the tRNApyl by creating a structure profile of the tRNApyl using ClustalW [12] and RNAalifold [13]. This profile is used to screen the genomes with Infernal [14]. The genomes verified to have the tRNApyl and for which complete assembled genomes are available are used for further investigations. These are, in alphabetical order:

• Acetohalobium arabaticum [RefSeq:NC_014378.1]

• Desulfitobacterium hafniense [RefSeq:NC_011830.1]

• Desulfobacterium autotrophicum [RefSeq:NC_012108.1]

• Desulfosporosinus orientis [RefSeq:NC_016584.1]

• Desulfotomaculum acetoxidans [RefSeq:NC_013216.1]

• Methanococcoides burtonii [RefSeq:NC_007955.1]

• Methanohalobium evestigatum [RefSeq:NC_014253.1]

• Methanohalophilus mahii [RefSeq:NC_014002.1]

• Methanosalsum zhilinae [RefSeq:NC_015676.1]

• Methanosarcina acetivorans [RefSeq:NC_003552.1]

• Methanosarcina barkeri [RefSeq:NC_007355.1]

• Methanosarcina mazei [RefSeq:NC_003901.1]

• Thermincola potens [RefSeq:NC_014152.1]

A summary of the genome screening can be found in Table 1.

Table 1. Table of potential Pyl-coding organisms and their pyrrolysine connection

Extraction of interrupted ORFs

We adopt the terminology Interupted ORFs (iORFs) from Chaudhuri and Yeates [9] and in a similar vein, we extract iORFs from the genomes of interest. Interrupted ORFs are like traditional ORFs except that they contain an UAG codon between the first potential start codon and the following stop codon. Such iORFs are described by the following grammar in Extended Backus-Naur Form [16],

<iORF> ::= < start > <not-stop > * < amber > <not-stop > * < stop>

<start> ::= TTG | CTG | ATT | ATC | ATA | ATG | GTG

<stop> ::= TAA | TAG | TGA

<amber> ::= TAG

<regular> ::= AAA | … | TTT //all codons except those in < start > and < stop>

<not-stop> ::= < start > | < regular>

An iORF is any subsequence of nucleotides specified by the grammar in either the sense strand or the reverse complemented anti-sense strand. The star notation in the first rule indicates a repetition (zero or more times), and the vertical bar is used for alternatives.

We extract only iORFs that have at least 100 bases downstream the UAG. This ensures that the iORF can accommodate a PYLIS structure. An obvious consequence of this restriction is that we do not consider iORFs where the PYLIS structure could possibly extend beyond the stop codon or where a hypothetical PYLIS structure occurs upstream the amber codon.

Reciprocal blast

Presumably, the PYLIS structure region is subject to purifying selective pressure due to both its protein function and the possible importance of a putative structure.

We identify homologous putative PYLIS sequences (100 bp downstream the UAG) conserved at amino acid level using a reciprocal blast search. For each iORF, we translate the region 100 bp downstream the UAG to its amino acid sequence. Then, using tblastn with an e-value threshold of 10-6 we search for this amino acid sequence in all the candidate genomes. We disregard hits to the query iORF itself and also hits to non-iORF regions.

The result of this search is a set of pairwise matches between some of the iORFs. Transitively, matched iORFs form connected clusters of similar hits. If there is a blast match between two iORFs, then they are members of the same cluster. An iORF can only be member of one cluster. In graph theoretical terms, an undirected graph is formed such that iORFs constitute nodes and matches between iORFs constitute edges. The clusters are then those connected components of the graph that include more than one node.

We disregard clusters that include an iORF where a) the region 100 bp downstream is completely overlapped by a known genea in a different reading frame or b) the region partially overlaps a known functional RNA gene.

In particular, requirement a) excludes a lot of shadow ORFs that may arise when a true protein sequence in a different reading frame is conserved. Since we only consider overlaps with genes in different reading frames, iORFs for known pyrrolysine incorporating genes are not eliminated since those will be in the same reading frame as the iORF. It happens that such genes are erroneously annotated as two genes in RefSeqb, where the first one uses the UAG as stop codon, but these are still in the same reading frame as the iORF.

After these pruning steps, 1789 clusters remain.

Feature extraction

Coding potential is a measure of the likelihood that a stretch of DNA may encode a protein. Protein coding genes exhibit a non-random sequence of codons that turn out to be a strong indicator of coding potential. Many contemporary gene finders use variants of Hidden Markov Models (HMM) to statistically model the codon sequences of genes. We apply an HMM in which the hidden states correspond to amino acids. These hidden states can emit the codons that encode the amino acid with distinct probabilities [17]. Additionally the HMM incorporates length modeling of genes. As an adaptation to be able to model iORFs as well as usual ORFs, the states may emit the UAG codon (with no effect on probability) in addition to the usual probabilistic codon emission. This adaptation means that we are able to train the model on non-pyrrolysine incorporating genes, but are able to decode also on iORFsc.

We train the model on the RefSeq annotated genes of each genome resulting in maximum-likelihood parameters Θ g for each genome g.

The trained models are used assign a probability to each iORF i from a genome g. In effect, the probability reflects how much the iORF i resembles the known genes of the genome g in sequence composition and length. A log-odds ratio is calculated using the probability score of a nucleotide sequence model modelnull as null model, which assumes that all nucleotides occur with the same frequency:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/118/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/118/mathml/M1">View MathML</a>

We define the coding potential of a cluster ω of size n to be the average of iORF coding potential scores within a cluster,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/118/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/118/mathml/M2">View MathML</a>

Only 958 of the initial 1789 have <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/118/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/118/mathml/M3">View MathML</a>. We only consider these 958 clusters for further investigation. The number of homologues may be indicative of functional importance. We define a feature <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/118/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/118/mathml/M4">View MathML</a> that measures the number of hits in a cluster, <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/118/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/118/mathml/M5">View MathML</a>. The <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/118/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/118/mathml/M6">View MathML</a> feature does not distinguish between homologues within or across species. Since conservation among species is a stronger indicator of important function, we also define a feature <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/118/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/118/mathml/M7">View MathML</a> which is the number of unique organisms present in a cluster ω.

We expect clusters that contain real PYLIS regions to be relatively more diverse in their nucleic sequence than in their amino acid sequence, whereas this may not be the case for spurious hits. On the other hand, primary sequence variation can have a degrading effect on protein function and for homologue genes within the same organism, the variation may be minimal. We model diversity using the fdiversity feature, which is calculated as the average distance between the PYLIS regions of all n iORFs in a cluster,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/118/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/118/mathml/M8">View MathML</a>

where spyl and tpyl are the regions 100 bp downstream of the in-frame UAG of s and t, respectively. distm(spyl, tpyl) is the edit distance — the smallest number of insertions, deletions or mutations needed to transform spyl into tpyl — disallowing gaps that are not in multiples of m. Note that DISTm is symmetric, i.e., DISTm(a, b) = DISTm(b, a), but for convenience of notation, the feature includes all pairwise distances.

Since primary sequence variation may have a degrading effect on a protein, we also consider the average number of synonymous codons, fsyn_codons, defined as,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/118/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/118/mathml/M9">View MathML</a>

where, s'pyl and t'pyl are the amino acid sequences translated from spyl and tpyl.

Note that fdiversity and fsyn_codons are inversely correlated except in cases where diversity is preferen-tially in third codon position such that it leads to synonymous codons.

The iORF extraction step ensures that iORFs have at least 100 bases downstream an in-frame UAG. In many clusters, however, iORFs have only a few bases upstream the UAG and a start codon just upstream the UAG. In such short upstream regions it becomes more likely that the UAG and upstream start codon occur due to chance. To address this we define the features fupstream and fdownstream,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/118/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/118/mathml/M10">View MathML</a>

where ‖istart … iuag‖ is the distance in nucleotides from the start codon to the in-frame UAG codon and

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/118/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/118/mathml/M11">View MathML</a>

where ‖iuag … istop‖ is the distance in nucleotides from the UAG codon to the stop codon.

Assuming that the structure of the PYLIS region may be important, we model structural similarity within a cluster ω with the feature <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/118/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/118/mathml/M12">View MathML</a> defined as follows. We measure similarity based on alignment of base-pairing probabilities of the sequences, which is independent of any predicted structure. We compute the base-pairing probabilities using RNAfold [18] and align these using pmcomp [19] with default settings. The fstructure score is the average pmcomp score for each pair of pylis regions in a cluster,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/118/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/118/mathml/M13">View MathML</a>

Normalization

We normalize features to the interval [0, 1]; <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/118/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/118/mathml/M14">View MathML</a> is the normalized value for the j’th feature in the i’th cluster, defined as

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/118/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/118/mathml/M15">View MathML</a>

where <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/118/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/118/mathml/M16">View MathML</a> is the value for the j'th feature for the i'th cluster, min(fj) is the minimum value for feature in any cluster and vice versa for max(fj).

Complex features

In addition to the basic features, we derive two combined features based on our intuition and on observed correlations (see Figure 2). fcoding and fupstream are inversely correlated in general, but positively correlated for the known clusters. The negative correlation occurs, e.g., in the case of an iORF with a long gene just downstream the UAG (high fcoding) but a only a few bases upstream the UAG. The correlation effect observed between fcoding and fdownstream is less pronounced. This motivates the addition of a combined feature, <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/118/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/118/mathml/M17">View MathML</a>, which is the geometric mean of <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/118/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/118/mathml/M18">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/118/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/118/mathml/M19">View MathML</a>.

thumbnailFigure 2. Correlations between cluster features. Each cluster is represented as a point in each of the panels. Each panel shows the correlation between a pair of feature values. Known pyrrolysine gene clusters are marked with colors: mttB (red), mtbB (orange), mtmB (yellow), transposase1 (green), transposase2 (blue) and TetR (purple). Unknown clusters are white. Feature combinations with discriminatory potential display significant separation between the bulk of white clusters (the majority of which are false positives) and the colored clusters (true positives). A separation is apparent in panel 4.3 which shows the combination of the fcoding and fupstream features.

Structural similarity may arise by chance. In the absence of sequence diversity, it is diffcult to judge the significance of a structurally similar cluster. To penalize diversity due to overhangs in blast hits and diversity that has a degrading effect on the amino acid sequence, we also take into account the number of synonymous codons. This leads to a combined feature <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/118/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/118/mathml/M20">View MathML</a>.

Feature significance

For each of the features we calculate a p-value to assess its statistical significance. We obtain p-values in the following way: We sample without replacement 105 means of n random ranks out of the 958 possible, where n is the number of clusters containing known pyrrolysine-incorporating genesd. Building on the central limit theorem which guarantees that the distribution of means is normal, we fit the sampled means to a normal distribution (μ ≈ 479, σ ≈ 112). p-values – probabilities of getting a mean rank at least as low as a given mean rank by chance – can be calculated from the estimated normal distribution by integrating the area below the given mean rank (cumulative density function)e.

We calculate the p-value for the mean rank — when ranked according to each feature — of the clusters known to include pyrrolysine incorporating genes.

Features for which the null hypothesis cannot be rejected (p > 0:05) are not used for the final ranking, i.e., we use a subset of features indexed by h: p(rank(fh)) > 0:05.

The p-values of each feature are reported in Table 2.

Table 2. Cluster ranking

Comparison of prediction results

Based on significant normalized feature values <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/118/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/118/mathml/M23">View MathML</a> we calculate a combined cluster score which is used for ranking,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/118/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/118/mathml/M24">View MathML</a>

where wh is a unique weight associated with feature <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/118/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/118/mathml/M25">View MathML</a>. We estimate weights for each feature using gradient descent to minimize the sum of ranks of positive examples,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/118/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/118/mathml/M26">View MathML</a>

The set of positive examples, denoted E+, are clusters that include known pyrrolysine incorporating genes. There are six of these: mtbB, mttB, mtmB, two clusters with transposases only annotated in M. acetivorans, and a transcriptional regulator of the TetR family also only annotated in M. acetivorans. In the genomes we consider there are five other (RefSeq) annotated genes with in-frame UAGs, but these are not conserved and as result, they are not present in reciprocal blast clusters.

The ranking scheme is based on a simple a linear combination of features, where the weights are estimated by regression over the rankings of the known positive examples. It is possible to devise a more precise but complex ranking function, but we have opted for this simple scheme. We have done so because we only have a few positive examples, and there is a large potential for overfitting a more complex function. With the few positive examples available, we have no real means of doing cross-validation and even this simple function may slightly overfit. With the discovery of additional pyrrolysine incorporating genes, the generality of the approach can improve.

The individual and combined rankings are shown in Table 2.

Hierarchical clustering of PYLIS structures

To assess evolutionary relationships between known pyrrolysine incorporating genes we group the PYLIS regions of these genes into clusters that are similar in structure as follows. We consider all genes that are members of the reciprocal blast clusters which include RefSeq annotated genes. We perform a hierarchical clustering of the genes using a form of neighbor joining (rapidNJ [20]) resulting in a phylogenetic tree that depicts structural conservation relationships. The clustering method relies on a distance measure between a pair of sequences, e.g., derived from a structural alignment. Our clustering does not rely on alignment of predicted structures. Instead, our distance measure is calculated using PMCOMP [19] which is based on alignment of base-pairing probabilities of the sequences. We compute the base-pairing probabilities using the RNAFOLD tool. The distance between two sequences is the inverse of the alignment score, scoreA-B, from PMCOMP:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/118/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/118/mathml/M27">View MathML</a>

The resulting dendrogram of PYLIS regions is shown in Figure 3.

thumbnailFigure 3. Structural clustering of PYLIS regions from iORFs in the clusters for known pyrrolysine incorporating genes. The figure shows that there are certain structural groupings which roughly correspond to gene fami-lies. Almost all transposase genes are in the green cluster (26 sequences), but the cluster also contains sequences from the mtmB (5 sequences), mttB (2 sequences) and mtbB (2 sequences) families. This is also a quite diverse cluster in terms of maximal distance between elements. The orange cluster (12 sequences) predominantly contains mttB genes (8 sequences), but also includes the TetR genes (2 sequences), a single transposase1 gene and a single mtmB gene. The purple cluster (9 sequences) is a mix of mttB genes (4 sequences) and mtmB genes (4 sequences), but includes a single spurious transposase gene. The red cluster (12 sequences) is predominantly mtmB (8 sequences), but includes also two mttB genes and two mtbB genes. The blue cluster (15 sequences) mostly contains mtbB genes (13 sequences) and includes a distant sub-cluster having a mttB gene and a transposase1 gene.

Results and discussion

To ensure generalization capability and to minimize model complexity, we systematically assess the ranking features using a criterion of statistical significance. In effect, this assessment leads to a deeper understanding of the factors influencing pyrrolysine translation. We inspect and discuss the list of candidate genes ranked by our method and we discuss structural conservation for known pyrrolysine incorporating genes.

Ranking of gene clusters

Our method to automatically identify pyrrolysine coding candidate genes is unique in that it utilizes both coding potential, structural conservation, and amino acid conservation. Additionally, we take into account the number of organisms with homologous PYLIS regions as well as the length of up and downstream the potential in-frame UAG.

The ranking is systematically modeled from the known genes taking several factors into account and weighs the different features with respect to the distance of known genes using pyrrolysine (Table 2).

The method has some limitations due to our assumptions. We assume that the PYLIS region is comprised of 100 bases downstream the UAG and that it is well-conserved due to the presumptive presence of a PYLIS structure. Our method cannot detect pyrrolysine containing genes that have divergent PYLIS regions with no significant conservation in homologues and it cannot predict genes with only non-pyrrolysine incorporating homologues.

Our approach is similar to an earlier approach called read-through Similarity Analysis [9]. As in our approach, the authors extract iORFs from candidate genomes and perform a reciprocal blast analysis. The query is a 100 bp window pivoting around the read-through codon. In the case of pyrrolysine incorporating genes this means that the downstream region is shorter than in our case and may not hold the entire PYLIS structure. They calculate an alignment score for the region downstream the read-through codon and a measure of statistical significance by aligning shuffled sequences. Hits with sufficiently high significance are examined further. These hits are expanded using psi-blast and manually checked for non read-through codons lining up with read-though codons. Their driving assumption is that read-through genes will have non read-through homologues.

They identify 34 methyl transferases some of which do not contain UAG in four archaea species (M. acetivorans, M. burtonii, M. barkeri and M. mazei). We predict only 26 methyl transferases in these genomes. This may be because we enforce more conservative requirements for iORFs and clusters. On the other hand, [9] found only 29 pyrrolysine containing genes among the M. acetivorans and M. mazei while we found 34 among the same two speciesf. A detailed gene by gene comparison is difficult, since their annotations were made on draft genomes and they identify predictions only in terms of their homologues.

Another approach sets out to discover unknown amino acids by conservation of iORFs [10]. The approach is also capable of detecting pyrrolysine incorporating genes. It also begins with iORF extraction and uses blast to detect homologous sequences. The authors use a query (80 bp) centered around the stop codon. Similar to our approach, the blast search results in a number of clusters which is then reduced by pruning rules. Unlike our approach, they only examine clusters with interspecies matches. To distinguish adjacent genes, they exploit the synteny, by looking for blast hits to the N-terminal and the C-terminal of candidates in other genomes. If there are distinct but closely arranged hits in other genomes, this is taken to indicate evidence of adjacent genes rather than a single pyrrolysine incorporating gene. Furthermore, they filter iORFs clusters based on purifying selection, i.e. they prune hits with significantly high incidence of non-synonymous codon usage in the sequences flanking the read-through codon.

As in our approach, they identify clusters for all three methyl transferase families and also a cluster for the TetR family. They do not detect the transposase genes. The methyl transferase clusters they identify have considerably fewer members than what we find. They detect 7 mtmB genes (we detect 19), 7 mtbB genes (we detect 18) and 6 mttB genes (we detect 18). This difference is in part due to the fact that we include more recently published genomes in our search.

However, unlike [9] we do not require non-pyrrolysine homologues in the methyltransferase prediction, and unlike [10] our method can detect genes with only paralogoue conservation within a species. This is reasonable, since some annotated pyrrolysine containing genes are only found in several copies in the same genome.

Candidate genes

As a result of our method, we obtain a ranking of the clusters in which the known pyrrolysine incorporating genes occur within the top 25 clusters. The ranked list is supplied as Additional file 1. Several other high ranking clusters seem to be false positives, but there are also some interesting candidates.

Additional file 1. Annotated list of ranked clusters. The file, supplied as an Excel sheet, includes 25 top-ranked clusters with our manual annotations.

Format: XLSX Size: 23KB Download fileOpen Data

• An example, which is probably a false positive, is the cluster with rank 11. Although it is conserved in three organisms and is on average 711 bases long, it has the problem that in T. potens it overlaps with a much longer gene in another reading frame. Either the long gene (CRISPR-assosiated protein Cas2) is wrongly annotated, or the short one (CRISPR-associated protein Cas1) “shadows” by incidence the long one. However, in the two other organisms, D. hafniense and D. orientis, the possibility of a direct translation of UAG is more probable since the two annotated genes are in same reading frame and within the predicted iORF.

• An example of a probable gene participating in neutral evolution (drift) is the cluster with rank 7. It consists of only two T. potens genes that are almost identical. However, they are both long and the genes flank another gene (putative anaerobic sulfite) which is in the same reading frame. What has probably happened is that a random mutation has introduced a stop-codon. However, since it is an UAG-codon and the introduction of pyrrolysine does not obstruct the protein, the mutation is conserved in the organism, although it might not introduce an extra advantage. (A similar neutral evolution is believed to occur in known genes of M. acetivorans as in clusters 19 and 23 [21]). This type of pyrrolysine usage does usually not create completely new genes with new functions.

• In cluster 16, a less neutral selection is probable in the evolution towards using UAG in-frame, since it occurs in three different Archaea. The organisms are closely related and the iORF covers a gene (sensory transduction histidine kinase) and a hypothetical protein in the same reading frame in M. acetivorans, a full pseudogene in M. mahii and a shadow part of a sensory transduction histidine kinase in M. barkeri. Likewise, the three methyl transferases are products of selection towards a function of producing methane.

All candidates need to be verified experimentally before they can be determined as pyrrolysine encoding. In addition to this, the clusters containing the three known methyl transferases include several genes annotated as pseudogenes, as two genes, or not at all. Only 21 of the 46 methyl transferases in the 12 investigated organisms are annotated correctly, i.e., as one non-pseudo gene with an in-frame UAG stop codon. These are likely deficiencies in the existing annotations and should be corrected or included after further investigation.

Structural conservation and the evolution of genes

Our assessment of ranking features indicates that the majority of clusters have a degree of potential structural similarity that is comparable to the clusters for known pyrrolysine genes. Consequently, neither the fstructure feature nor the complex <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/118/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/118/mathml/M28">View MathML</a> feature is adequate to recover all the known genes. The known genes seem to have either high structural similarity and low primary sequence diversity (as is the case for the two transposases) or they have low structural similarity but high primary sequence diversity. If we instead consider only the three methyl transferase clusters when calculating feature significance, then the p-value for <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/118/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/118/mathml/M29">View MathML</a> is 0:012 which is significant within the 0:05 limitf. This may suggest that the methyl transferases have important structures although the statistical significance does not necessarily imply biological relevance. It is difficult to say if there is an important structure in the transposases because of the high sequence similarity.

Note that a difference in structures of different clusters does not affect ranking using <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/118/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/118/mathml/M30">View MathML</a> or fstructure. The fstructure feature does not imply a particular structure for a cluster – it only implies the presence of a high degree of structural similarity within the cluster. Neither of these two features were used to produce the final ranking (see Table 2), which are based only on features that are significant when considering all six known pyrrolysine genes.

The structural clustering of the PYLIS sequences of known pyrrolysine incorporating genes (see Figure 3) reveals relatively compact clusters that roughly correspond to the gene families. It is possible to observe general trends of the clustering, but chance similarities distort the clustering accuracy to an undesired degree and the presence of outliers should not be given too much attention. It is, however, clear that the UAG downstream regions of pyrrolysine incorporating genes do not all share similar structures. There are distinct sub-groupings that may correspond to distinct structures and hence it seems that there are several possible PYLIS structures.

It is possible to create a relatively canonical secondary structure for the PYLIS region of mtmB and mtbB. The mtmB structure has similarities to the previously predicted PYLIS structure [Rfam:RF01982]g, but the mtbB structure does not share these similarities. The same region in mttB does not show any sign of common structure. However, since mttB is the gene present in most different organisms and has a high degree of sequence diversity, an elevated variance in the structure is possible. See Figure 4 for predicted consensus structures. The set of predictions from all methods used are included as Additional file 2.

thumbnailFigure 4. Predicted consensus structures for each of the known pyrrolysine incorporating clusters. The consensus sequences are using IUPAC Ambiguity Codes. The consensus structures are predicted through the WAR web service [22] and are based on a variety of methods for structure predictions and alignments integrated through the WAR web service ([23-36]). For each structure, the average pairwise identity, free energy, covariance, base pair probability and percentage canonical base pairs are reported. The consensus structures suggested are not necessarily the correct structures, but reasonable (usually conservative) approximations. The different methods integrated in WAR predict slightly different structures, but the similarities between these are reflected in the consensus structure. It is clear that there are several different structures, but also that mttB does not seem to have any significant structure.

Additional file 2. Complete set of alignments and predicted structures. This zip file contains a subdirectory for each of the six clusters for known pyrrolysine incorporating genes, where alignments and predicted structures using all methods supported by the WAR web-service are available in fasta format.

Format: ZIP Size: 58KB Download fileOpen Data

A search using Infernal reveals that using the predicted structures has a positive impact on the recall and precision of detecting the pyrrolysine containing genes mtmB and mtbB, when compared to searching without the predicted structure (data not shown).

Our findings support that pyrrolysine has different ways to evolve in the genomes containing the tRNApyl as suggested in [21]. For the methyl transferases, a selection for producing methane may have conserved the structures as well as the amino acid residue. In other cases, a neutral evolution is believed to have occured, allowing for a single mutation leading to an in-frame amber stop codon [21]. In our list of candidate genes, a few high ranking candidates had multiple homologues of the UAG in-frame codon either within only one genome or among a few only. These genes are typically conserved within a given species and have even accomplished duplication events. However, as the mutation is not conserved among different species, these may be relatively recent adaptations.

These different gene evolution models constitute a challenge for the approach to select the clusters that represent true positives among clusters of iORFs.

Conclusions

In this work, we presented a method for predicting pyrrolysine coding genes. The method clusters genes with homologue sequences downstream the in-frame UAG. The clusters are ranked according to observed properties of existing homologous pyrrolysine incorporating genes so that top ranking candidates correspond to known pyrrolysine incorporating gene families or to promising new candidates.

Our method is successful in recovering conserved pyrrolysine containing genes and additionally detects several promising candidates that are not currently annotated. We provide a ranked list of potential pyrroly-sine coding gene candidates.

In addition, our method provides insights into the features that characterise pyrrolysine incorporating genes. We find evidence of conserved structures only within mtmB and mtbB and provide substantiation to suggest that pyrrolysine genes may also arise due to neutral evolution.

Endnotes

aRefSeq annotated genes, except genes where one of the words “pseudo, predicted, putative, unknown, possible, hypothetical or probable” occur in the gene product description.

bThis is prevalent for instance in M. mazei.

cThe adaptation corresponds to removing the in-frame UAG codon before decoding.

dThere are six of clusters that contain known pyrrolysine-incorporating genes.

eWe calculate cumulative density function using the pnorm command in R.

fThe species compared were chosen by [9]

gP-values are calculated using the same procedure as used in section on feature significance, but with n = 3 and considering only the main rank of the three methyl transferase clusters.

h The RF01982 structure was also only predicted on the basis on mtmB genes. Our consensus structure prediction does not include all the base pairings from RF01982, but mostly agree on the tip of the hairpin. Predictions with MXSCARNA [23] and PETFOLD [37] agree on most base-pairings and have lower free energy scores than the consensus structure.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

CTH designed and wrote the pipeline, collected the data, performed the data analysis, and participated in the drafting of the manuscript. SZ participated in designing the pipeline, participated in the data analysis, performed the manual analysis of candidate genes clusters and participated in the drafting of the manuscript. HC supervised and financed the study, participated in designing the pipeline, and participated in the drafting of the manuscript. All authors approved the final manuscript.

Acknowledgements

We would like to thank Anders Krogh and Ole Skovgaard for their valuable feedback on an early draft of this paper. This work is part of the project “Logic-statistic modeling and analysis of biological sequence data” funded by the NABIIT program under the Danish Strategic Research Council.

References

  1. Xu Y, Gogarten JP (Eds): Computational Methods for Understanding Bacterial and Archaeal Genomes, Volume 7. : Imperial College Press; 2008. OpenURL

  2. Zhang Y, Baranov PV, Atkins JF, Gladyshev VN: Pyrrolysine and Selenocysteine use dissimilar De-coding strategies.

    J Biol Chem 2005, 280(21):20740-20751. PubMed Abstract | Publisher Full Text OpenURL

  3. Longstaff DG, Blight SK, Zhang L, Green-Church KB, Krzycki JA: In vivo contextual requirements for UAG translation as pyrrolysine.

    Mol Microbiol 2007, 63:229-241. PubMed Abstract | Publisher Full Text OpenURL

  4. Krzycki JA: Function of genetically encoded pyrrolysine in corrinoid-dependent methylamine methyltransferases.

    Curr Opin Chem Biol 2004, 8(5):484-491. PubMed Abstract | Publisher Full Text OpenURL

  5. Gaston MA, Zhang L, Green-Church KB, Krzycki JA: The complete biosynthesis of the genetically en-coded amino acid pyrrolysine from lysine.

    Nature 2011, 471(7340):647-650. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  6. Atkins J, Gesteland R: The 22nd amino acid.

    Science (Washington) 2002, 296(5572):1409-1410. Publisher Full Text OpenURL

  7. Gaston MA, Jiang R, Krzycki JA: Functional context, biosynthesis, and genetic encoding of pyrroly-sine.

    Curr Opin Microbiol 2011, 14:342-349. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  8. Kryukov GV, Castellano S, Novoselov SV, Lobanov AV, Zehtab O, Guigó R, Gladyshev VN: Characterization of mammalian selenoproteomes.

    Science (New York, N.Y.) 2003, 300(5624):1439-1443. Publisher Full Text OpenURL

  9. Chaudhuri B, Yeates T: A computational method to predict genetically encoded rare amino acids in proteins.

    Genome Biol 2005, 6(9):R79. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  10. Fujita M, Mihara H, Goto S, Esaki N, Kanehisa M: Mining prokaryotic genomes for unknown amino acids: a stop-codon-based approach.

    BMC Bioinformatics 2007, 8:225. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  11. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

    Nucleic Acids Res 1997, 25(17):3389-3402. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  12. Thompson JD, Gibson TJ, Higgins DG: Multiple Sequence Alignment Using ClustalW and ClustalX.

    Curr Protoc Bioinformatics 2002., 2.3.1–2.3.22 OpenURL

  13. Bernhart S, Hofacker I, Will S, Gruber A, Stadler P: RNAalifold: improved consensus structure prediction for RNA alignments.

    BMC Bioinformatics 2008, 9:474. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  14. Nawrocki EP, Kolbe DL, Eddy SR: Infernal 1.0: inference of RNA alignments.

    Bioinformatics 2009, 25(10):1335-1337. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  15. Abe T, Ikemura T, Sugahara J, Kanai A, Ohara Y, Uehara H, Kinouchi M, Kanaya S, Yamada Y, Muto A, Inokuchi H: tRNADB-CE 2011: tRNA gene database curated manually by experts.

    Nucleic Acids Res 2011, 39(suppl 1):D210-D213. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  16. Wirth N: Extended Backus-Naur Form (EBNF).

    ISO/IEC 1996, 14977:2996. OpenURL

  17. Mørk S, Holmes I: Evaluating bacterial gene-finding HMM structures as probabilistic logic programs.

    Bioinformatics 2012, 28(5):636-642. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  18. Hofacker I, Fontana W, Stadler P, Bonhoeffer L, Tacker M, Schuster P: Fast folding and comparison of RNA secondary structures.

    Monatshefte für Chemie/Chemical Monthly 1994, 125(2):167-188. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  19. Hofacker IL, Bernhart SHF, Stadler PF: Alignment of RNA base pairing probability matrices.

    Bioinformatics 2004, 20(14):2222-2227. PubMed Abstract | Publisher Full Text OpenURL

  20. Martin Simonsen TM, Pedersen CNS: Rapid Neighbour Joining. In Proceedings of the 8th Workshop in Algorithms in Bioinformatics (WABI), Volume LNBI 5251. : Springer Verlag; 2008:113-122. OpenURL

  21. Heinemann IU, O’Donoghue P, Madinger C, Benner J, Randau L, Noren CJ, Soll D: The appearance of pyrrolysine in tRNAHis guanylyltransferase by neutral evolution.

    Proc Natl Acad Sci USA 2009, 106(50):21103-21108. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  22. Torarinsson E, Lindgreen S: WAR: Webserver for aligning structural RNAs.

    Nucleic Acids Res 2008, 36(suppl 2):W79-W84. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  23. Tabei Y, Kiryu H, Kin T, Asai K: A fast structural alignment method for long RNA sequences.

    BMC Bioinformatics 2008, 9(33): . PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  24. Bauer M, Klau GW, Reinert K: Accurate multiple sequence-structure alignment of RNA sequences using combinatorial optimization.

    BMC Bioinformatics 2007, 8:271. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  25. Höchsmann M, Töller T, Giegerich R, Kurtz S: Local similarity of RNA secondary structures.

    Proc of the IEEE Bioinformatics Conference 2003, 159-168. OpenURL

  26. Hofacker IL, Fekete M, Stadler PF: Secondary Structure Prediction for Aligned RNA Sequences.

    J Mol Biol 2002, 319(5):1059-1066. PubMed Abstract | Publisher Full Text OpenURL

  27. Katoh K, Misawa K, Kuma K, Miyata T: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform.

    Nucleic Acids Res 2002, 30:3059-3066. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  28. Knudsen B, Hein J: Pfold: RNA secondary structure prediction using stochastic context-free gram-mars.

    Nucleic Acids Res 2003, 31(13):3423-3428. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  29. Lindgreen S, Gardner PP, Krogh A: MASTR: Multiple alignment and structure prediction of non-coding RNAs using simulated annealing.

    Bioinformatics 2007, 23 (24):3304-3311. OpenURL

  30. Notredame C, Higgins D, Heringa J: T-Coffee: A novel method for multiple sequence alignments.

    J Mol Biol 2000, 302:205-217. PubMed Abstract | Publisher Full Text OpenURL

  31. Reeder J, Giegerich R: Consensus shapes: An alternative to the Sankoff algorithm for RNA consensus structure prediction.

    Bioinformatics 2005, 21(17):3516-3523. PubMed Abstract | Publisher Full Text OpenURL

  32. Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.

    Nucleic Acids Res 1994, 22(22):4673-4680. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  33. Torarinsson E, Havgaard JH, Gorodkin J: Multiple structural alignment and clustering of RNA se-quences.

    Bioinformatics 2007, 23(8):926-932. PubMed Abstract | Publisher Full Text OpenURL

  34. Will S, Reiche K, Hofacker I, Stadler P, Backofen R: Inferring non-coding RNA families and classes by means of genome-scale structure-based clustering.

    PLOS Comp Bio 2007, 3(4):e65. Publisher Full Text OpenURL

  35. Xu X, Ji Y, Stormo GD: RNA Sampler: A new sampling based algorithm for common RNA secondary structure prediction and structural alignment.

    Bioinformatics 2007, 23(15):1883-1891. PubMed Abstract | Publisher Full Text OpenURL

  36. Yao Z, Weinberg Z, Ruzzo WL: CMfinder - a covariance model based RNA motif finding algorithm.

    Bioinformatics 2006, 22(4):445-452. PubMed Abstract | Publisher Full Text OpenURL

  37. Seemann SE, Gorodkin J, Backofen R: Unifying evolutionary and thermodynamic informa-tion for RNA folding of multiple alignments.

    Nucleic Acids Res 2008, 36(20):6355-6362.

    [http://nar.oxfordjournals.org/content/36/20/6355.abstract webcite]

    PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL