Department of Biology, University of Pennsylvania, Philadelphia, PA 19104, USA

Department of Mathematics and Computer Science, Denison University, Granville, OH 43023, USA

Department of Biology, Boston College, Chestnut Hill, MA 02467, USA

Abstract

Background

It has been increasingly appreciated that coding sequences harbor regulatory sequence motifs in addition to encoding for protein. These sequence motifs are expected to be overrepresented in nucleotide sequences bound by a common protein or small RNA. However, detecting overrepresented motifs has been difficult because of interference by constraints at the protein level. Sampling-based approaches to solve this problem based on codon-shuffling have been limited to exploring only an infinitesimal fraction of the sequence space and by their use of parametric approximations.

Results

We present a novel ^{2})-time algorithm, CodingMotif, to identify nucleotide-level motifs of unusual copy number in protein-coding regions. Using a new dynamic programming algorithm we are able to exhaustively calculate the distribution of the number of occurrences of a motif over all possible coding sequences that encode the same amino acid sequence, given a background model for codon usage and dinucleotide biases. Our method takes advantage of the sparseness of loci where a given motif can occur, greatly speeding up the required convolution calculations. Knowledge of the distribution allows one to assess the exact non-parametric p-value of whether a given motif is over- or under- represented. We demonstrate that our method identifies known functional motifs more accurately than sampling and parametric-based approaches in a variety of coding datasets of various size, including ChIP-seq data for the transcription factors NRSF and GABP.

Conclusions

CodingMotif provides a theoretically and empirically-demonstrated advance for the detection of motifs overrepresented in coding sequences. We expect CodingMotif to be useful for identifying motifs in functional genomic datasets such as DNA-protein binding, RNA-protein binding, or microRNA-RNA binding within coding regions. A software implementation is available at

Background

Coding sequences have been shown to harbor numerous regulatory sites in their nucleotide sequences for functions such as RNA localization

High-throughput studies of both RNA and DNA have also shown evidence of functional sites in coding regions, indicating the need for computational methods to identify such sites. Some of the RNA studies include those showing binding of proteins or microRNAs to mRNA coding regions

Identifying functional motifs in coding sequences computationally has been challenging due to the lack of appropriate algorithms to separate nucleotide-level signals from those caused by the amino acid sequences. Here we use the term motif to refer to a short possibly degenerate sequence element that may or may not be functional. Sequence conservation approaches that calibrate for the amino acids are one promising technique for identifying functional motifs, as it was shown that conservation can detect exonic splicing, microRNA binding, and DNA replication-associated motifs

A few groups

Methods based on comparisons to empirical motif counts in control exonic (and sometimes intronic) sequences have also been developed

Algorithms that simply ignore the amino-acid sequence have been applied to coding sequences as well. For example Jambhekar et al

In this work we present a novel enumerative method, CodingMotif, to detect functional noncoding motifs in coding sequences, solving the problems associated with sampling approaches. The algorithm exactly calculates the distribution of a motif's occurrence frequency over all coding sequences that code for the amino acid sequence, given a null model of codon usage. This approach allows for exact evaluation of the overrepresentation or underrepresentation p-value for a motif in any length of sequence ^{2}) time through a novel dynamic programming algorithm. We describe how to speed up the calculation by taking advantage of motif sparseness as well. Importantly, the program also takes into account dinucleotide biases, which are built into the model through a codon-to-codon Markov process. We show that CodingMotif assesses motifs more accurately than sampling approaches in both eukaryotic and prokaryotic datasets.

Results and discussion

Independent codon model

As a first approach to the problem, we developed a motif overrepresentation algorithm based on an Independent Codon Model (ICM), in which our null assumption was that codons do not influence the codons at adjacent positions (see Methods). To determine the effectiveness of this assumption, we first analyzed k-mer strings ((A,G,C,T)^{k }) for overrepresentation in the coding sequences of mouse chromosome 19 (623,203 codons; 1331 coding sequences). Prior studies have focused on analyzing overrepresentation for k-mers as well

k-mer scores exhibited a strong bimodal behavior under the ICM null, with the vast majority of p-values close to either 0 or 1. The distribution of p-values for these k-mers is shown in Figure

Distribution of motif overrepresentation p-values for mouse chr19 coding sequence with the Independent Codon Model null

**Distribution of motif overrepresentation p-values for mouse chr19 coding sequence with the Independent Codon Model null**. Three ICM p-value distributions are shown: the p-values for the original coding sequences; the p-values after shuffling synonymous codons across coding sequences; and the p-values after shuffling dicodons across coding sequences.

We hypothesized that the ICM null model may be inadequate for detecting motifs under selection because it ignores neutral dinucleotide mutation biases. To clarify the effect of dinucleotide biases, we shuffled the original coding sequences while maintaining dicodon frequencies (and consequently dinucleotide frequencies; see Methods) using the method of

Dinucleotide-corrected codon model

To handle this problem, we developed a method to calculate the motif frequency distribution that would be generated by a null model that includes dinucleotide biases. The algorithm uses as its null a Markov model that closely preserves the expected codon usage and dinucleotide frequencies in the reference sequence. We refer to this as the dinucleotide-corrected codon model (DCM). Full details of the DCM are given in the Methods.

If each amino acid had only one possible first nucleotide for the underlying codon, then the expected dinucleotide and codon usage in the DCM null model would be exactly equal to those of the reference sequence (see Methods for proof). However, the true genetic code deviates slightly from this behavior (Arginine, Leucine, and Serine can have two possible first nucleotides). To determine how well the DCM preserves dinucleotide and codon usage, we generated a sequence using the DCM Markov model and compared to the properties of the reference sequence. Figure

Comparison of dinucleotide usage under different null models

**Comparison of dinucleotide usage under different null models**. The dinucleotide usage of sequences generated by the DCM Markov model (black) and the dinucleotide usage of the original data (white) exhibit Pearson correlation r = 0.9999, in comparison to correlation r = 0.9580 between ICM-generated dinucleotide usage and that of the original sequences. The largest discrepancy is for CpG dinucleotides, for which the ICM-generated frequency is 1.60 times that in the original data. For the DCM-generated sequences, the CpG frequency is 1.0008 times that in the original data.

Preservation of dinucleotide usage inherently implies preservation of codon usage, as shown by the following argument. Define

where

Would it be better to use a higher order Markov model for the null? 5th order cyclic Markov models are used commonly in gene-finding algorithms, which would suggest they might be appropriate for a null model in motif finding. However, these models were chosen to be 5th order because hexamers were shown to be good for discriminating protein-coding and non-coding regions

The AA/dinucleotide null has the advantages of being straightforwardly interpretable and of being the lowest order model that accounts for both A A effects and dinucleotide mutation biases. Our emphasis on dinucleotide effects is reasonable because, in many genomes, by far the strongest neutral cause of base-base correlations is the CpG effect, which is known to act on 2 bases at a time

Time scaling

The CodingMotif algorithm takes as input a motif _{1}, _{2},...,_{L}}, corresponding to the coding regions to be analyzed with total sequence length _{i }is determined, and second these distributions are combined into a single distribution. Each of these parts is analyzed in turn.

Determination of the distribution for each _{i }is governed by the induction relation 4. Equation 4 calculates a new distribution _{μ}(_{k-Δ+3 }... α_{k+1}) by adding contributions from at most 6 previously calculated distributions (as there are at most 6 codons compatible with a given amino acid). This calculation is performed for all possible values of _{k-Δ+3 }... _{k+1}, yielding at most 6^{Δ }calculations during each stage of the induction. The number of basic operations each induction step requires depends directly on the size of the distribution, which is stored as an array. The size of the distribution is determined by the maximum number of motif occurrences, which is very conservatively bounded by the length of the subsequence, i.e. length(_{i}). Since len(_{i}) induction steps are required, an upper bound for the steps required to calculate the distribution of _{i }is 6^{Δ }length(_{i})^{2}. We need to do this calculation for all

In practice even within a single coding sequence _{i }we frequently observe sections where no copies of a given motif can possibly occur, due to the structure of the genetic code. These break each sequence _{i }into much smaller subsequences for which we can calculate the distribution independently, while we can ignore the sections where a motif is forbidden. To see why these subsequences are short, consider a 6-mer motif and its potential occurrence within a stretch of 3 codons. At most, each of these codons has 6-fold degeneracy, so there can be at most 6^{3 }= 216 possible DNA sequences consistent with the given amino acids. If the 6-mer occurs within the three codons, it may overlap in position 1-6, 2-7, 3-8, or 4-9. At most 216 · 4 = 864 motifs may occur within this three codon stretch, while there are 4^{6 }= 4096 possible 6-mer motifs. So at least 79% of 6-mers are forbidden within any three codon stretch. Consequently, regions where a motif is not forbidden will have an approximately geometrically decreasing length distribution. This leads to a much larger number of effective independent regions each with short lengths. We use these effective _{i }for the distribution function calculations, and this significantly improves the runtime of the algorithm (see Methods: Optimization for sparse motifs). The actual independent regions are a function of the motif, genetic code, and amino acid sequences, and in general there will be

The step of combining the distributions for all independent regions into the overall distribution is rate-limiting. Denote the maximum possible number of motif occurrences in the complete sequence as ^{k }for some

The distributions will be combined from smallest to largest size. Consider the worst case scenario in which there are 2^{k }distributions of size 1. In the first stage we combine these into distributions of size 2. This involves 2^{k-1 }pairs of distributions. In the next stage we combine 2^{k-2 }pairs of distributions of size 2 into distributions of size 4. Continuing hierarchically, at each stage we combine 2^{k-l }pairs of distributions of size 2^{l-1 }for ^{l-1}^{l-1})) using the FFT procedure. The total calculation time is then given by

So the time requirement for the program is ^{2}) = ^{2}), much shorter than the exponential number of possible coding sequences.

Tests of CodingMotif

Bacterial motifs

There are two relevant tests for CodingMotif, the first being its ability to more accurately detect over- and under- represented motifs relative to prior methods, and the second being its ability to identify biologically meaningful motifs. For the first type of test, we analyzed the coding sequences of the bacterium

Because we used an identical dataset to Robins et al, we were able to directly compare whether our exact approach gives results better than a finite sampling/z-score approach. Robins et al reported a set of 100 over- or under- represented motifs. Among their underrepresented motifs, we found 2 with very weak underrepresentation according to our exact method (underrepresentation p-values CCC: 0.54, CAGAT: 0.31). Moreover, 2 other motifs they call as underrepresented are in fact overrepresented in the data (overrepresentation p-value CTCC: 6e-4, CTGCTGG: 0.075). Among the 31 motifs they report to have unusually high occurrence frequencies, all 31 exhibited very low p-values according to CodingMotif as well(^{-8}). However, our exact method detected a total of 251 motifs of lengths between 3 and 7 that have ^{-8}. These findings indicate that, even with a dataset as large as the coding regions in a bacterial genome, a sampling/z-score approach can have significant error rates, which in this dataset are mostly false negatives. The differences between our exact method and that of Robins et al are somewhat influenced by the lack of dinucleotide effects in the Robins et al null model. When we used an ICM null, which is more similar to the Robins et al null, we found that CodingMotif classifies the motifs CCC, CAGAT, and CTCC similarly as Robins et al. However, under an ICM null, CodingMotif still finds the motif CTGCTGG to be overrepresented (p-value 2e-5), indicating that the misclassification by the Robins et al method is caused by weakness in the sampling/parameterization approach. Moreover, under the ICM null we find a total of 421 motifs of lengths 3-7 with overrepresentation p-values < 10^{-8}, demonstrating that the high false negative rate of Robins et al is due to the sampling/parameterization approach rather than the lack of dinucleotide effects in the null.

Mammalian splicing motifs

As a test of the ability of CodingMotif to identify biologically relevant motifs, we analyzed the behavior of splicing motifs on the coding sequences in human chromosome 1. Our expectation was that motifs with known activity in coding regions, such as exonic splicing enhancers, would show overrepresentation. Figure ^{2 }= 0.26 (t-test p-value = 0.02) between - log(

DCM p-values for motifs with known splicing activity

**DCM p-values for motifs with known splicing activity**. We observe a correlation between -log(^{2 }= 0.26 (t-test p-value 0.02).

Human transcription factor motifs

This issue of dataset size is important for applicability of the method, as a common application for motif detection algorithms is to search for functional motifs in targeted experimental datasets such as determined by chromatin or RNA immunoprecipitation. Because this type of dataset is typically smaller than the genome-scale sets described in the above examples, it can provide a more stringent and practical test of the effectiveness of a motif evaluation program. Neither Itzkovitz et al

Results for GABP are shown in Figure

Comparison of CodingMotif and parametric methods for known binding motifs

**Comparison of CodingMotif and parametric methods for known binding motifs**. A) All 4 of the top 4 motifs predicted by CodingMotif p-value are exact matches to the canonical motif for the human transcription factor GABP. For comparison, 3 of the top 4 motifs ranked by z-score, and 1 of the top 4 motifs ranked by the ratio of counts in the real sequence to the average in the null distribution, match the GABP canonical motif. B) All 4 of the top 4 motifs predicted by CodingMotif p-value match the canonical motif for NRSF. For motifs ranked by z-score 0/4 of the top motifs match the canonically known motif. 0/4 of the top motifs ranked by count-ratio match the canonically known motif.

We performed a similar test for the transcription factor NRSF also using data from

We have reported results for 6-mers rather than longer k-mers because we observed that for

We also analyzed whether sampling without resorting to parametric approximations could yield accurate motif predictions. For the GABP dataset, we obtained 100 randomized dicodon shuffles of the data using the method of

Evaluation on synthetic data

Finally, we tested CodingMotif on synthetic data to estimate what types of counts may be necessary for it to successfully identify motifs. We generated 20 random sequences each 350 codons long (comparable to real protein lengths) according to the DCM Markov model using human coding sequences to train the null and assuming that the 3'-most codon was a stop codon. We then picked from 1-15 of these sequences and inserted a copy of the motif into each by replacing randomly chosen positions. If replacement would create a premature stop codon, another location was chosen. We performed this test for each of the 4096 6-mers 10 times for each number of inserted motifs ranging from 1 to 15. The average and standard deviation of log

For 6-mers, a p-value better than 4^{-6 }= 0.0002 is an appropriate significance threshold taking into account multiple testing. As can be seen in Figure

Motif p-values on synthetic data

**Motif p-values on synthetic data**. For a randomly generated set of 20 sequences of 350 codons, copies of a motif were overwritten onto random positions within the sequences. CodingMotif and ideal z-score based p-values as a function of the number of inserted copies were calculated. This procedure was performed 10 times for each of the 4096 possible 6-mers. CodingMotif plotted values indicate average and standard deviation of log p-values. Z-score plotted values indicate the value of the erfc function when applied to the average z-score. Standard deviations of z-score based p-values were similar to those of CodingMotif (data not shown).

For comparison, we also calculated the average z-score for each motif across these runs, where the z-score was calculated from the exactly enumerated distribution returned by CodingMotif. We observed that the z-score based p-values were systematically too weak (by about one order of magnitude) at 9 or fewer inserted motif copies, though as for CodingMotif there was strong variation from motif-to-motif (data not shown). While CodingMotif tends to exhibit greater sensitivity at these lower copy numbers, this systematic effect is probably less important than the fact that CodingMotif p-values are more accurate for individual motifs. For greater than 9 inserted copies, z-score based p-values are systematically lower than those of CodingMotif. However, both CodingMotif and the z-score method have very significant p-values (much less than 4^{-6}) at this range of copy numbers, so this systematic difference is again probably less important than the differences for individual motifs.

Human synonymous constraint elements

Recently, Lin et al developed a method to detect elements in coding regions likely to be under constraint based on their synonymous conservation across 29 mammalian genomes (SCEs)

Software usage and caveats

A software implementation of CodingMotif is available at bioinformatics.bc.edu/chuanglab/codingmotif.tar. We have extended the algorithms described above to allow CodingMotif to calculate p-values for degenerate motifs (e.g. AGACT[A/G]) defined by a set of k-mers. These can be evaluated together, such that an occurrence of any of the k-mers constitutes a match to the degenerate motif. This requires only a minor modification to the counting procedure in the calculation of the distribution function. Note that this k-mer set approach is more general than using IUPAC symbols to handle degeneracy, since IUPAC symbols cannot handle base correlations within a motif. The k-mer set functionality can also be used to handle motifs that could appear on either the forward or reverse strand, e.g. by placing reverse complements such as [AACCTG/CAGGTT] together in a set. In addition, we have written a wrapper allowing CodingMotif to evaluate multiple motifs, each of which may be defined by a set of k-mers, in succession. CodingMotif has been written to handle arbitrary-sized motifs, so motifs of any length can be used as input. For a given run, CodingMotif can return the motif count in the input, its p-value, the count distribution in the null, the mean number of counts in the null, and the z-score for the motif. Underrepresentation p-values can be straightforwardly calculated as 1 minus the overrepresentation p-value. We have demonstrated that p-values for all 4096 6-mers can be calculated for dataset sizes on the scale of several hundred kb in a few hours on a single workstation. Calculations for larger datasets can be trivially parallelized using multiple processors by distributing motif runs across CPUs. The code is open source in C++.

CodingMotif takes fasta files as input. Note that input sequences which are not made up of full codons are conceptually inconsistent with the amino acid-conditioned null model, as hanging bases can match with many possible amino acids. The ends of sequences beginning/ending out of the canonical codon frame should be repaired to full codons before input to CodingMotif, e.g. by truncation of hanging ends. Full documentation for CodingMotif can be found in the downloadable tar file.

It is worth discussing what types of motifs CodingMotif will work best for. The results on NRSF and GABP are based on overrepresentation of exact 6-mers, which are appropriate because binding sites for these two transcription factors both have a relatively strong signal for exact 6-mer sequences as evidenced in their sequence logos (Figure

Similar issues also affect the power of CodingMotif for building a target classifier. For example, a simple type of classification would be whether a sequence does or does not have a copy of a motif determined to be overrepresented by CodingMotif. For the GABP data, we observe that 65% of the sequences have a copy of at least one of the top 4 CodingMotif hexamers from Figure

Conclusions

CodingMotif provides an exact non-parametric method for calculating overrepresentation p-values of motifs in coding regions, a previously unsolved problem. We have shown that CodingMotif is able to accurately detect functional motifs in a variety of prokaryotic and eukaryotic datasets, and in short times accessible on single workstations. Prior works have been based on sampling, an approach limited by the infeasibility of sampling more than a tiny fraction of the sequence space, and by their use of parametric approximations for the motif count distribution. We have demonstrated that CodingMotif performs better than such methods using representative experimental data, including human transcription factor ChIP-seq data overlapping coding regions.

CodingMotif provides a theoretically and empirically improved approach over prior methods to identify unusually overrepresented motifs in coding regions. We expect it to be useful for the study of a wide variety of functional genomic problems, notably DNA-protein binding, RNA-protein binding, and microRNA-RNA binding.

Methods

Independent codon model

We first consider a method to identify overrepresented motifs in a coding sequence conditional on the amino acid sequence, under the assumption that each codon in the sequence is independent. Specifically, we calculate the overrepresentation or underrepresentation of a motif in a set of protein-coding sequences of total length

We are given an amino acid sequence _{1}, _{2},..., _{n }and a motif ^{N}) different coding nucleotide sequences that translate into the same amino acid sequence _{a }as the set of all such nucleotide sequences. Let _{1}, _{2}, ..., α_{N }∈ _{A}.

The ~ ^{N }sequences in _{a }will not contribute equally to the expectation. This is because even in the absence of selection on motifs, amino acids have preferences for codon usage. The null model for codon usage can be set as the codon usage in a reference set, which we typically choose to be the set of all coding sequences genome-wide. This provides a background probabilistic model to weight the ~ ^{N }coding sequences.

A direct enumeration of all ~ ^{N }sequences is prohibitive. Therefore we have devised a dynamic programming approach to exactly calculate the distribution of

Here

where the individual _{i}) values are determined from the reference codon usage table for the corresponding amino acid. Since the weightings are conditional on the amino acid sequence, the _{i}) values for the codons in a synonymous group sum to one.

The distribution can be calculated by an inductive approach. One calculates the _{k+1}) distribution for the motif occurrences in the subsequence defined by the first _{k}) distribution defined by the motif occurrences in the first _{N}), which is the desired distribution

To perform the dynamic programming calculation, at a given iteration _{k}) conditioned on the possible codon strings in the last Δ - 1 codons {_{k-Δ+2 }... _{k}}. Δ is the maximum number of codons that a given instance of the motif can overlap, i.e. for motif length _{k}_{k-Δ+2 }..._{k}).

We will need these distributions for all possible values of the codons {_{k-Δ+2 }... _{k}}. Note that since the maximum number of copies of a motif scales with ^{Δ-1}^{st }codon to the first

The induction step requires a convolution calculation using all of the _{k}, {_{k-Δ+2 }... _{k}}) functions. In this step, one counts the number of copies of the motif in each possible set of Δ codons consistent with the amino acids in positions

where _{k-Δ+2 }_{k+1}) ≡ the number of copies of the motif that end in the last codon of {_{k-Δ+2 }... _{k+1}}. The sum is over all values of _{k-Δ+2 }consistent with the amino acid _{k-Δ+2}. When the end of the sequence is reached, the final value of _{N}) is calculated from the weighted sum of the _{N}

The probabilities in equation 5 can be calculated directly as

Note that all of these calculations can be done in either the 5' to 3' or 3' to 5' direction. In practice, we use the 3' to 5' direction, as this is necessitated by the way in which the Dinucleotide-corrected Codon Model (described below) is implemented.

Optimization for sparse motifs

For most amino acid sequences, the possible locations of the motif consistent with the genetic code are sparsely distributed. That is, depending on the motif, there can be large portions of the amino acid sequence where no motif is possible for any consistent choice of codons. Inductively calculating the motif occurrence distribution _{k}), the motif occurrence distribution for the subsequence _{1}_{2}, ..., _{k}, where _{1}_{2 }... _{k }and end in _{k+1}_{k+2 }... α_{N}. Such a motif must occur in the subsequence _{k-Δ}_{k-Δ+1 }... α_{k }_{k + Δ-1}_{k+Δ}. The calculation of whether such a motif exists is then constant time for a fixed motif length.

If a motif instance is possible, we continue the induction. However, if not, then the distribution on the traversed sequence is independent of the distribution for the rest of the sequence. We therefore store the current motif occurrence distribution, denoted as _{c }_{k}) and scan forward in the sequence until we find the next codon _{k'}). This process is repeated until the end of the amino acid sequence. We can calculate the complete distribution by convolving all such distributions. The advantage of this approach is that any regions for which a motif instance is impossible (positions

Convolution calculation

A convolution of two distributions can be calculated by considering the values in each distribution as coefficients of two generating functions and then multiplying the two generating functions. Term-by-term multiplication of the two generating functions will take time

We tested both the direct and FFT approaches. For the motif lengths we investigated (4 - 7 bp), the FFT approach is not noticeably faster than the direct polynomial multiplication. This is because the convolution calculations involve a large number of multiplications in which

Dinucleotide-corrected codon model

Because we were concerned that the Independent Codon Model (ICM) did not sufficiently account for neutral dinucleotide biases, we implemented a dinucleotide-corrected codon model (DCM) which includes dinucleotide biases in the null model. The DCM uses a Markov model to generate the sequence, starting from the 3'-most codon and working backward to the 5' end. This choice of direction simplifies the calculation, since for most amino acids specification of the amino acid fixes the 5'-most nucleotide of a codon. Note that although the program is run in this direction, the results sections describe motifs in the standard 5' to 3' direction.

To specify the Markov model, we use the conditional codon usage table as observed in the reference sequence. The probability of choosing a codon is conditioned upon the amino acid of the current codon as well as the first nucleotide of the adjacent 3' codon. Formally, let

In the DCM, sequences are then generated from the 3' to the 5' end with probabilistic weighting

Here we have written _{i}) to refer to the 1st base of codon _{i}, treating

By iterating through equation 8 we can calculate the probability of the complete sequence given the amino acids. One source of ambiguity is how to treat the 3'-most codon. Our rule is to use the first nucleotide 3' to the sequence as the starting point of the probability assignment. This is a minor assumption since for most amino acids the first base of the codon is forced. If the sequence is a whole gene then we require

With this approach in mind, calculating the motif occurrence distribution is analogous to the ICM case. One can apply equation 4 but with the substitution of the conditional probability

for the probability factor. When the 5' end of the sequence is reached, the final value of _{μ}(_{N-Δ+2 }... _{N}}) for equation 6 should again use conditional probabilities. For the DCM, this means replacing equation 6 with

Preservation of codon and dinucleotide usage by DCM

The purpose of using the DCM null instead of the ICM was to preserve the dinucleotide frequency and codon usage found in the reference sequence. Here we provide an argument for why the DCM model can closely preserve these quantities. Due to the structure of the genetic code, specification of an amino acid usually fixes the first nucleotide of the underlying codon. Only for the amino acids Ser, Arg and Leu is there degeneracy in the first nucleotide. Here we show that under the simplifying assumption that specifying the amino acid fixes the first nucleotide of the codon for every amino acid, then the codon and dinucleotide usage generated by the Markov process equals the codon and dinucleotide usage in the reference sequence. Because this assumption is approximately true for the real genetic code, codon and dinucleotide usage will be well-preserved by the DCM model.

Suppose in our amino acid sequence that amino acids

where the counts

Denote the number of occurrences of _{b }

In the second step we have made use of the fact that each amino acid

To see that the Markov process preserves dinucleotide counts, we again assume the idealized case in which specification of an amino acid also specifies the first base of the underlying codon. Denote _{xy}(

The expected number of copies of

Plugging in

Coding and UTR lengths

For the initial coding and UTR length analysis, all gene transcripts from the human genome were downloaded from Ensembl v63. Lengths were calculated using all transcripts having simultaneous 5' UTR, 3' UTR, and coding region annotations. The observed lengths were: 5' UTR 180 bp (σ = 340 bp), 3' UTR 820 bp (

GABP and NRSF analysis

We downloaded ChIP-seq peaks for the NRSF monoclonal antibody and GABP datasets of

Human synonymous constraint elements

The SCE9 dataset was obtained from

Authors' contributions

YD contributed to the design of the algorithms, wrote software, and contributed to the writing of the manuscript. WL contributed to the design of the algorithms, wrote software, and contributed to the writing of the manuscript. JHC contributed to the design of the algorithms, contributed to the writing of the manuscript, and oversaw the project. All authors read and approved the final manuscript.

Acknowledgements

We thank Kourosh Zarringhalam and Peter Clote for discussions. JHC was supported by National Science Foundation Award 0850155 as part of the American Recovery and Reinvestment Act.