McGill Centre for Bioinformatics, McGill University, 3775 University St., room 332, Montreal, QC, Canada H3A 2B4

Abstract

Background

Comparative genomics approaches, where orthologous DNA regions are compared and inter-species conserved regions are identified, have proven extremely powerful for identifying non-coding regulatory regions located in intergenic or intronic regions. However, non-coding functional elements can also be located within coding region, as is common for exonic splicing enhancers, some transcription factor binding sites, and RNA secondary structure elements affecting mRNA stability, localization, or translation. Since these functional elements are located in regions that are themselves highly conserved because they are coding for a protein, they generally escaped detection by comparative genomics approaches.

Results

We introduce a comparative genomics approach for detecting non-coding functional elements located within coding regions. Codon evolution is modeled as a mixture of codon substitution models, where each component of the mixture describes the evolution of codons under a specific type of coding selective pressure. We show how to compute the posterior distribution of the entropy and parsimony scores under this null model of codon evolution. The method is applied to a set of growth hormone 1 orthologous mRNA sequences and a known exonic splicing elements is detected. The analysis of a set of

Conclusion

Non-coding functional elements, in particular those involved in post-transcriptional regulation, are likely to be much more prevalent than is currently known. With the numerous genome sequencing projects underway, comparative genomics approaches like that proposed here are likely to become increasingly powerful at detecting such elements.

Background

Vertebrate genomes are now recognized as containing a huge number of non-coding functional regions, a large fraction of which is likely to be involved in regulating the various steps of gene expression

The search for CRUNCS is more challenging. Although the same "conservation implies function" principle applies in this case, it needs to be used more cautiously. Indeed, CRUNCS are

The method suggested here takes a conservative approach to the problem. Given a set of aligned orthologous coding sequences, we first evaluate the degree of conservation of each column of the alignment, using either a parsimony score or an entropy score. We then put the burden of explaining the conservation observed as much as possible on the shoulders of the selective pressure on the protein product. Because most amino acids are encoded by many synonymous codons, amino acid selective pressure leaves room for some sequence variation. A region of the sequence will be predicted to be an CRUNCS only if the conservation observed cannot be explained solely by the need for conservation of the encoded amino acids. The method introduced here build a mixture model of codon evolution, and then uses it as a null model to assess the significance of the observed degree of conservation. We illustrate our approach on two sets of orthologous vertebrate genes (growth hormone 1 and

Results and Discussion

Given a multiple alignment of orthologous mRNA sequences, our goal is to identify alignment columns that are conserved beyond what would be expected by chance if the corresponding sites were evolving only under the selective pressure on the amino acid they contribute to encode. Such sites are likely to be under non-coding selective pressure. This section, which constitutes the main contribution of this paper, is structured as follows. First, we define two commonly used sequence conservation scoring methods: the entropy, and the parsimony score. We then describe a methods for assigning a p-value to a given entropy or parsimony score, under null models of evolution of codons that are only under coding selective pressure. Under this method, we model codon evolution as a mixture of codon substitution models and use these models to assign a posterior p-value to a given conservation score.

Two measures of sequence conservation

A number of methods have been proposed to measure the degree of conservation of a set of orthologous sequences and to identify regions under selective pressure (see

Entropy

In the area of transcription factor binding sites detection, a popular method for evaluating sequence conservation is the entropy (see, for example, _{1}, _{2}, ...,_{n }from _{1}, _{2}, ...,_{n}) = -∑_{α }log_{2}(_{α}), where _{α }is the relative frequency of nucleotide

Parsimony score

A major drawback of the entropy score is that it does not take into consideration the phylogenetic relationships among the sequences being compared, and indeed the method is mostly used for motif discovery within a single species. An alternative to the entropy score is the parsimony score _{1}, _{2}, ..., _{n }and a phylogenetic tree

Example of alignment columns where parsimony score and relative entropy differ greatly in their assessment of sequence conservation

Example of alignment columns where parsimony score and relative entropy differ greatly in their assessment of sequence conservation.

Conservation p-values under a mixture of codon models

In this section, we introduce a null model of coding sequence evolution that consists of a mixture of codon substitution models representing the evolution of codons that are under different types of

Mixture models for codon evolution

Different positions in a protein sequence are usually subject to different types of coding selective pressures. Some are constrained to have a specific amino acid (e.g. the active site in a zinc finger has to be a cystein), while others are free to have any residues with some particular chemical properties (say, a hydrophobic residue), and still others are under little or no selective pressure at all. Selective pressure on amino acids translates into selective pressure on codons, which explains part of the sequence conservation observed at the mRNA level in coding exons. We describe a mixture model of amino acid evolution, and the corresponding mixture model of codon evolution. We derive a set of 50 amino acid substitution rate matrices

We learn amino acid functional categories using the Pfam database of amino acid sequence alignments of protein domains

The stationary amino acid distributions _{1}, _{2}, ..., _{50 }and class prior probability distribution

Mixture of codon substitution models based on amino acid functionality classes

Mixture of codon substitution models based on amino acid functionality classes. (Top) Stationary distributions estimated from the Pfam database for the 50 functional classes. Each row corresponds to one class. The numbers on the right are the prior probability of each class. (Bottom) Two examples of codon rates matrices, where dark cells correspond to high substitution rates and light cells to low rates. The left matrix corresponds to a functional class that favors hydrophobic residues (ILV), while the right matrix comes from a class that favors the glycine amino acid (G).

Amino acid and codon substitution rate matrices

For each of the 50 classes above, an amino acid substitution rate matrix and a codon substitution rate matrix are derived. We first compute the probability of each Pfam alignment column to belong to each of the classes, and use these to estimate the probability of amino acid and codon transitions between human and mouse sequences. Rate matrices are then derived from these empirically estimated transition probabilities matrices. The detailed procedure is described in Methods.

Figure

Distribution of conservation scores

We now return to the problem of identifying regions under non-coding selective pressure in a multiple alignment of orthologous coding mRNA sequences _{1 }... _{m}, where _{i }is the triplet of alignment columns corresponding to the _{i}(_{i,p}(_{i}, we want to assess whether the conservation observed at position

To describe more formally our null model of sequence evolution, we need to introduce some notation. Let **Q **be some codon substitution rate matrix. The codon transition probability matrix for a branch (**P**_{(u,v) }= ^{λ(u,v)}^{Q }**Q**. These three parameters (**Q**) describe a process that generates random but related codons at the leaves of the tree **Q **at the root of

We are interested in computing the distribution of the conservation score of a given position _{p}(1),_{p}(2), ..., _{p}(_{u }= (_{u}(A), _{u}(C), _{u}(G), _{u}(T)) be a random multivariable where _{u}(_{u }is only a function of the codons at the leaves of subtree(_{r }that yield an entropy score

We will show how to compute Pr[_{u }= (_{a}, _{c}, _{g}, _{t})|_{a}, _{c}, _{g}, _{t }and

Define (_{a}, _{c}, _{g}, _{t}) ⊕ (_{a}, _{c}, _{g}, _{t}) : = (_{a}_{a}, _{c}_{c}, _{g}_{g}, _{t}_{t}). Now, let _{u }= _{v }⊕ _{w}. We compute the desired conditional probabilities at node

Implementation optimizations and computational complexity analysis are given in Methods.

Distribution of parsimony scores

The method described in the previous section can be surprisingly easily modified to compute the conditional p-value of parsimony scores instead of that of the entropy score. We need to redefine the random variable _{u }= (_{a}, _{c}_{g}_{t}) so that _{α }is now the parsimony score obtained for the nucleotides at position

(_{a}, _{c}, _{g}, _{t}) ⊕ (_{a}, _{c}, _{g}, _{t}) = (min(_{a }+ _{a}, _{a }+ _{a }+ 1,

min(_{c }+ _{c}, _{c }+ _{c }+ 1,

min(_{g }+ _{g}, _{g }+ _{g }+ 1,

min(_{t }+ _{t}, _{t }+ _{t }+ 1,

where _{j≠i}_{j}. Notice that this is again in direct analogy to Sankoff's algorithm. Again, _{u }= _{v }⊕ _{w }and we get _{p}(1), (_{p},(2),..., _{p}(_{r}). With these redefinitions, Equation 2 holds without any modifications needed. We get

Posterior distributions of conservation scores

Having shown how to compute the p-value of a given entropy or parsimony score under a fixed codon rate matrix, it is simple to compute posterior p-values for the case where the functional class is not known in advance. Consider a given set of aligned codons _{i }= (_{i}(1), ..., _{i}(_{i }= (_{i}(1), ..., _{i}(_{i,k }= 1 if the site _{i,k}, given the observed amino acids at site

where Pr[_{i}(1), ..., _{i}(_{i,k }= 1, _{post}(

and similarly for parsimony scores p-values.

Conditional p-values

An alternative to trying to guess the type of selective pressure under which a given codon evolves is to use a single codon rate matrix but subject to the constraint that the random codon generated at each leaf has to encode the amino acid that was actually observed at that leaf. This approach was originally proposed by Blanchette _{j}(_{i }surprising? Notice how, compared to the mixture model approach, this model transfers the responsibility of sequence conservation even more onto the shoulders of coding selection. See Figure

Example of codons where the posterior p-values under the mixture of codon models differs significantly from p-values obtained from the conditional p-value method

Example of codons where the posterior p-values under the mixture of codon models differs significantly from p-values obtained from the conditional p-value method. Amino acids V,A,G,D,L do not have any common properties, so, under the codon mixture model, the column is assigned to a functional class with little selective pressure, explaining why the p-value produced for the first codon position is small. In contrast, under the conditional p-value approach, the amino acids almost completely determine the first codon position (the only exception being the Fugu codon, encoding a leucine, which could have used a T at the first position), so a poor p-value is reported.

A sliding window approach

Until now, we have shown two ways to compute p-values for individual alignment columns. Since most non-coding functional elements are expected to span several consecutive positions (5-15nt for transcription factor binding sites and exonic splicing enhancers, and up to a few hundred nucleotides for RNA secondary structure elements), we can improve the sensitivity of the method by using a simple sliding window approach. For each position _{i }= ∏_{j = i- ⌊w/2⌋...i + ⌊w/2⌋ }_{i}: _{i}] = _{i}(1 + ∑_{j = 1...w-1 }- ln(_{i})^{j}

Implementation

The algorithms were implemented in C++ and the program is available upon request. A number of optimizations described in

Analysis of simulated data

We first verify that the p-values computed by our approach have the basic properties we would expect of them. To start, we confirm that sequences evolving under the null model obtain p-values that are approximately uniformly distributed. This would be a trivial statement if the functional category of each site was known, but in the absence of such prior knowledge, the uniformity of p-values under the null model is less obvious. To this end, we simulated the evolution of a 50 kb region of DNA, with each codon belonging to one of the 50 rate categories described above. Sequences were evolved along the branches of the 69-leaf phylogenetic tree derived from the GH1 data set described below. Figure

Distribution of the p-values obtained on a set of 50-kb sequences evolving according to the null model

Distribution of the p-values obtained on a set of 50-kb sequences evolving according to the null model. The tree used for the simulation is the same as that for the GH1 dataset (see below).

Analysis biological data

We illustrate our approach on two sets of orthologous mRNA sequences: a set of 69 vertebrate growth hormone 1 (

^{-3}. Although this region could have been identified on the basis of parsimony scores alone (gray curve on Figure

Dark curve: compounded parsimony p-values for the

Dark curve: compounded parsimony p-values for the

Figure

Comparison of the posterior p-values obtained from the entropy and parsimony scores, under the mixture of codon models, on the

Comparison of the posterior p-values obtained from the entropy and parsimony scores, under the mixture of codon models, on the

Figure

Comparison of the p-values obtained from the parsimony scores, under the mixture of codon models and under the conditional probability computation

Comparison of the p-values obtained from the parsimony scores, under the mixture of codon models and under the conditional probability computation.

Finally, the analysis of the

Dark curve: compounded parsimony p-values for the

Dark curve: compounded parsimony p-values for the

Conclusion

With the many genome sequencing projects rapidly producing vertebrate genomic data, comparative genomic approaches are becoming increasingly powerful. In the case of CRUNCS, additional data is often available in the form of ESTs and cDNAs. We believe that within one year or two, there will be sufficient data for accurate detection of CRUNCS in vertebrate genes, using methods like those described here. Once many CRUNCS will have been detected, the next step will of course be to assign functions to these elements. Although the last word will remain with experimentalists, we have good hopes that more advanced bioinformatics approaches will yield insights into these questions.

Finally, we expect that organisms that are under severe genome size constraints, in particular bacteria and viruses, will more often use CRUNCS. We believe that our approaches will prove particularly fruitful to analyze these genomes.

Methods

Estimating amino acid stationary distributions for each class

The Pfam database consists of a set of multiple alignments of homologous protein domain sequences. For some domains, the database contains several sequences that are very closely related. To reduce biases due to this over-representation, one member of any pair of domain sequences that share more than 60% identity is discarded. Let _{1}, ..., _{m }be the set of alignment columns in this reduced Pfam database, let _{i }be the number of species in alignment column _{i}(_{i}(

For simplicity, we assume that the amino acids in a given column are drawn independently from the same unknown amino acid distribution. Let _{1}, _{2}, ..., _{50 }be a set of amino acid distributions, where _{k}(_{i}(_{i}|_{i}(

We search for the distributions _{1}, ..., _{50 }and prior probabilities _{1 }... _{m}] = ∏_{i = 1...m }Pr[_{i}] = ∏_{i = 1...m }∑_{k = 1..50 }Pr[_{i}|(

and

where _{i}(_{i }and where

Initialized with noisy uniform distributions, the algorithm converges quickly (less than 50 iterations) to the local optimum depicted in Figure

Estimating amino acid and codon rate matrices

Once the amino acid distributions _{1}, ..., _{50 }and prior distribution _{i}:

where _{i}(_{i}(_{i }(_{i}(

Finally, we estimate the instantaneous rate matrix _{(h,m)}, where _{(h,m) }is the expected number of substitutions per site between human and mouse, in neutrally evolving DNA. Codon rate matrices are obtained in a similar manner.

Note that the codon substitution rates we obtain are slightly underestimating the true rates for sites evolving only under coding selective pressure, because some sequences in Pfam are likely to be evolving slower due to non-coding selective pressure. However, we expect that this underestimation is negligible as the fraction of such sites is likely to be small. In any case, underestimating the rates will only cause conservative estimates of the conservation p-values.

Computing entropy and parsimony score p-values

The probabilities Pr[_{u }= (_{a}, _{c}, _{g}, _{t})|_{u}] are stored in a hash table associated to each node, indexed by the quintuplet (_{a}, _{c}, _{g}, _{t}, _{u}). Only non-zero probabilities are stored. To compute these probabilities for a node _{a}, _{c}, _{g}, _{t}, _{v}), (_{a}, _{c}, _{g}, _{t}, _{w}) of quintuplets from the hash tables of the two children, and add the proper quantity to the entry ((_{a}, _{c}, _{g}, _{t}) ⊕ (_{a}, _{c}, _{g}, _{t}), _{u}) of the hash table at _{u}. Please see

To study the complexity of the resulting algorithm for the entropy p-value computation, observe that the hash table associated with node ^{3}) non-zero entries and thus computing all entries of the hash table for node ^{3}^{3}) time. Depending on the topology of ^{4}) time for a completely skewed tree and ^{6}) time for a balanced tree. The time complexity analysis applies to both the posterior p-value and the conditional p-value computations. However, for posterior p-values, this computation needs to be repeated 50 times, once for each rate matrix.

The hash table implementation works well in the case of parsimony score p-value computation too, especially with the following optimizations. First, for binary trees, the only choices of _{a}, _{c}, _{g}, _{t }that may have non-zero probability are those where |_{i }– _{j}| ≤ 2 for all _{a}, _{c}, _{g}, _{t }such that _{a}, _{c}, _{g}, _{t}) > ^{2}), irrespective of the tree topology. More details can be found in

Conditional P-values

The method of conditional p-values, introduced by Blanchette

_{cond}(_{p}(1), ...,_{p}(_{i},_{p}(1), ..., _{i},_{p}(_{i}(l)), ..., _{i}(

This p-value can be computed in a manner that is similar to the Equation 2. Let us denote by _{j∈l(u)}(_{i}(

and

where Pr[

Authors' contributions

Chen implemented the program for the posterior p-value computation, for learning amino acid functional classes, and for learning the rate matrices associated to each class, and did some of the data analysis. Blanchette was responsible for the original ideas and mathematical derivations, for some of the data analysis, and for writing the paper.

Supplementary data

Multiple alignments, phylogenetic trees and detailed results are available at

Acknowledgements

We would like to thank Martin Tompa, Saurabh Sinha, Adam Siepel, and David Haussler for useful discussions early in this project. We also thank the editor Hervé Philippe and one anonymous referee for their hard work on this manuscript and for their useful suggestions.

This article has been published as part of