Email updates

Keep up to date with the latest news and content from BMC Genomics and BioMed Central.

Open Access Research article

A codon substitution model that incorporates the effect of the GC contents, the gene density and the density of CpG islands of human chromosomes

Kazuharu Misawa

Author Affiliations

Research Program for Computational Science, Research and Development Group for Next-Generation Integrated Living Matter Simulation, Fusion of Data and Analysis Research and Development Team, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama City, Kanagawa, 230-0045, Japan

BMC Genomics 2011, 12:397  doi:10.1186/1471-2164-12-397

The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2164/12/397


Received:31 January 2011
Accepted:6 August 2011
Published:6 August 2011

© 2011 Misawa; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Developing a model for codon substitutions is essential for the analyses of protein sequences. Recent studies on the mutation rates in the non-coding regions have shown that CpG mutation rates in the human genome are negatively correlated to the local GC content and to the densities of functional elements. This study aimed at understanding the effect of genomic features, namely, GC content, gene density, and frequency of CpG islands, on the rates of codon substitution in human chromosomes.

Results

Codon substitution rates of CpG to TpG mutations, TpG to CpG mutations, and non-CpG transitions and transversions in humans were estimated by comparing the coding regions of thousands of human and chimpanzee genes and inferring their ancestral sequences by using macaque genes as the outgroup. Since the genomic features are depending on each other, partial regression coefficients of these features were obtained.

Conclusion

The substitution rates of codons depend on gene densities of the chromosomes. Transcription-associated mutation is one such pressure. On the basis of these results, a model of codon substitutions that incorporates the effect of genomic features on codon substitution in human chromosomes was developed.

Keywords:
Rate of molecular evolution; CpG hypermutability; codon substitution; gene density; chromosomal GC content

Background

Accurate models of DNA and protein evolution are essential for studying molecular evolution. Models of DNA and protein evolution are used in homology searches [1], sequence alignments [2,3], gene finding [4], detection of natural selection [5-8], and reconstruction of phylogenetic trees [9,10].

Recent studies on the mutation rates in non-coding regions have shown that the mutation rates of CpGs in the human genome are negatively correlated to the local GC content [11-14] and to the densities of functional elements [11].

CpG hypermutability [15] is known to be one of the major causes of codon substitution in mammalian genes [16-19]. CpG dinucleotides are often methylated at C, and methylated C spontaneously deaminates to thymine (T) approximately 10 times [20] or more [21] rapidly than other types of point mutations. Approximately 14% of codon substitutions are caused by CpG hypermutations [22]. Statistical indications of positive selection in some sequences or individual codons may be caused by a difference in mutation rates in the synonymous and nonsynonymous sites because of CpG hypermutations [23]. Previous studies that focused on the effect of CpG hypermutation on the rate of amino acid change in the human lineage [22,24] did not incorporate the regional variation in the rates of codon substitution. Thus, the regional variation in codon substitutions, especially those related to CpG hypermutation, in the human genome is of interest.

To understand the effect of these genomic features such as GC content, gene density, and frequency of CpG islands on the rates of codon substitution, the rates of codon substitution were estimated by using the coding regions of thousands of human and chimpanzee genes and inferring their ancestral sequences by assuming macaque genes as the outgroup. Since the genomic features are interdependent, partial regression coefficients of these features were obtained. Multiple regression analyses to evaluate the effect of GC content, gene density, and CpG island density on the rates of CpG to TpG substitutions, TpG to CpG substitutions, and non-CpG transitions and transversions were conducted.

Results

The number of genes on each chromosome is listed in Table 1. Table 1 also shows the GC content and the gene and CpG island density on human chromosomes.

Table 1. Summary of data

Figure 1 is a scatter plot of the synonymous rates of CpG to TpG and TpG to CpG substitutions, non-CpG transitions and transversions, and GC content of the chromosomes. The open squares (□) show the synonymous substitution rate on autosomes. The closed squares show the synonymous substitution rate on the X chromosome. In all the cases, substitution rates on the X chromosome are within the range of those on the autosomes.

thumbnailFigure 1. Scatter plot of CpG and non-CpG substitution rates that are positively correlated to the number of CpG islands per site on the chromosome. Open squares: Synonymous C to T substitution rate at CpG sites on the autosomes. Closed squares: Synonymous C to T substitution rate at CpG Sites on the X chromosome.

Figure 1 also shows that the synonymous CpG to TpG substitution rates were negatively correlated to the GC content of the chromosomes. In contrast, TpG to CpG synonymous substitution rates were positively correlated to the GC content of the chromosomes. As seen in Table 2, these correlations were statistically significant (P < 0.001), and all types of substitutions have a significantly negative correlation to the gene density (P < 0.001).

Table 2. Correlation coefficient analysis between ln (substitution rate ×108) and chromosomal features

Since the genomic features are interdependent, partial regression coefficients of these features were obtained. Table 3 shows the rates of CpG to TpG and TpG to CpG substitutions and non-CpG transitions and transversions. Except TpG to CpG substitutions, all types of substitutions show a significant correlation with Grantham's distance, GC content, gene and CpG island density, and chromosome size (P < 0.001).

Table 3. Partial coefficient of Grantham distance, GC content, gene density, CpG island density, and chromosome size obtained by multiple regression analysis

Table 4 shows the difference between correlation coefficient squared (r2) obtained by single regression to Grantham distance and adjusted r2 obtained by multiple regression to Grantham distance, GC content, gene density, CpG island density. This table shows that including GC content, gene density, CpG island density to the model increases r2 and improves the ability of prediction of the substitution rates.

Table 4. Correlation coefficients squared (r2) between ln (Rate ×108) and chromosomal features

On the basis of these results, a model that incorporates these genetic features was developed. Parameter x is calculated as the sum of the intercept and the partial correlation coefficient multiplied by the genetic features, as shown in Table 1. For example, x for a synonymous CpG to TpG substitution on chromosome 1 is calculated as follows: 5.13 + (5.13 × 0) - (7.13 × 10-2 × 41.74) + (7.21 × 10-3 × 8.30 - 4.23 × 10-2) × (33.69 - 3.83 × 10-9) × 249250621. The codon substitution rate can be further calculated as exp(x).

Discussion

In this study, codon substitution rates of CpG to TpG mutations, TpG to CpG mutations, and non-CpG transitions and transversions were estimated by comparing the coding regions of thousands of human and chimpanzee genes from entire genome and inferring their ancestral sequences by assuming macaque genes as the outgroup.

As seen in Figure 1, a marked difference was not observed between the synonymous substitution rates of autosomes and X chromosome. This result indicates that the effect of "male-driven evolution" [25] is not strong. The male-driven evolution is a phenomenon in which the number of mutations is proportional to the number of germ-line cell divisions, so that the sperms have a higher rate of mutation than the eggs. There are large discrepancies among the results of previous studies with regard to the effect of male-driven evolution on mutation rates [26] In this study, a large variation in the rates of the substitution caused by CpG hypermutation among chromosomes and genes was observed. The mutation rate of the X chromosome was within the range of the mutation rate among autosomes, which is expected considering the variation in substitution rates among chromosomes. Thus, this finding suggests that the effect of male-driven evolution on the mutation rate was less than that of regional variation.

Vogel and Motulsky [27] pointed out that since the deamination of methyl-C occurs spontaneously and is independent of DNA replication, the rate of CpG mutations should be scaled with time and not with the number of cell divisions. Recently, Taylor et al. [14] investigated male mutation bias separately at non-CpG and CpG sites by using human-chimpanzee whole-genome alignments. They concluded that CpG hypermutation is weakly affected by the number of cell divisions. In this study, I demonstrated that the effect of male-driven evolution on CpG hypermutation is less strong than that of other chromosomal properties. Thus, although our result is similar to that of Vogel and Motulsky [27], I cannot exclude the findings of Taylor et al. [14]. Variation in the mutation rates among chromosomes must be one of the causes of the discrepancies among the results of previous studies with regard to the effect of male-driven evolution on mutation rates. Therefore, further study is warranted in this area.

Table 2 indicated that the CpG to TpG substitution rates were negatively correlated to the GC content of the chromosomes. This is consistent with the recent finding that the CpG mutation rates in the non-coding regions of human genes are positively correlated to the local GC content [11-14]. On one hand, table 2 showed that other types of substitution rates were positively correlated to the GC content of the chromosomes. On the other hand, multiple regression analyses shown in table 3 indicated that other types of substitution rates were negatively correlated to the GC content but the correlation was not significant. It should be noted that GC content, gene density, and frequency of CpG islands were correlated to each other. Discrepancy of these coefficients indicates that the effects of genetic features other than GC content on TpG to CpG substitution rates might have overcome that of GC content.

The results presented in this study also suggest that codon substitution rates are positively correlated to gene density. The reason of this correlation is unclear, but it might be due to fact that DNA melting is the rate-limiting step in cytosine deamination [28]. The difference between CpG hypermutability of the coding and non-coding regions may be due to the difference in the reason of DNA melting in the 2 regions. In the case of the coding regions, the cause of DNA melting is transcription; the DNA molecule separates into 2 single-strands before being transcribed. The frequency of transcription was positively correlated to the expression level of the gene, which in turn was positively correlated to the number of CpG islands [29]. In contrast, the reason for DNA melting in the non-coding region is not transcription but denaturation. The base pairs open one at a time, far below the melting temperature, and the half-life of individual GC base pairs are approximately 3 times longer than those of individual AT base pairs [30], so that the rate of CpG hypermutability in the non-coding region is positively correlated to the local GC content. If this is the case, CpG hypermutability will depend on the expression level of the genes. Recent study on mouse genes showed that the effect of gene expression level on codon bias is weaker than both the effect of gene expression level on amino acid composition and the effect of CpG hypermutability on codon bias [31], so that effect of gene expression level on mutation rate would be relatively small. Further study is needed in this area.

In this study, the codon substitution rates were estimated by using the coding regions of thousands of human and chimpanzee genes and inferring their ancestral sequences by assuming macaque genes as the outgroup. All types of substitutions show a significant negative correlation to Grantham's distance. This result indicates that the properties of amino acids are one of the major factors influencing the variation among non-synonymous substitutions, as previously reported [22].

The reason for the specific correlation found between codon substitution rates and chromosome sizes might be the physical distance between the genes and telomeres. Tyekucheva et al. [11] suggested that distances from telomeres play important roles on mutation rates. However, the numbers of substitutions are not very large; thus, dividing too many bins by the distance from the telomeres may yield weak results. As more data become available, incorporating these additional predictors in the regression analyses may be beneficial. Of special interest would be data on species other than humans.

On the basis of these results, a model that incorporates these genetic features was developed. The result shown above indicates that including GC content, gene density, CpG island density to the model improves the ability of prediction of the substitution rates

Conclusion

The substitution rates of codons depend on gene densities of the chromosomes. Transcription-associated mutation is one such pressure. On the basis of these results, a model of codon substitutions that incorporates the effect of genomic features on codon substitution in human chromosomes was developed.

Methods

Data Set

To estimate the rates of codon substitutions, we used 10,372 orthologous gene trios obtained from human, chimpanzee, and macaque genomes [32]. The alignment of these genes comprised 4,717,227 codon triplets. Macaque genes were used as the outgroup. It was assumed that the human-chimpanzee divergence occurred 5 million years (MY) ago [33].

Estimation of codon substitution rates by using the maximum parsimony (MP) method

I determined the codon sequences of the common ancestors of humans and chimpanzees by using the maximum parsimony (MP) method. Next, I counted the number of codon substitutions that had occurred along the human lineage. For some codon trios, the ancestral state between the human and chimpanzee codons appeared ambiguous when estimated by the MP method. In such cases, I treated all possible ancestral states equally. I also calculated the substitution rates by dividing the number of codon substitutions occurring annually by the number of ancestral codons. The awk program used in the analysis is available from the author upon request.

Classification of codon substitutions

Because codon substitutions occur at different rates [22,24], they were classified into 4 categories: CpG to TpG substitutions, TpG to CpG substitutions, and non-CpG transitions and transversions. To distinguish CpG to TpG substitutions and TpG to CpG substitutions from non-CpG transitions, the adjacent nucleotides were also considered. For example, a codon substitution from CGC (Arg) to TGC (Ser) corresponds to a CpG to TpG substitution. A codon substitution from GGT, which codes Gly, to AGT, which codes Ser, corresponds to one CpG to TpG substitution if the third nucleotide of the 5'-adjacent codon is C. In contrast, a codon substitution from GGT to AGT is a non-CpG transition mutation if the third nucleotide of the 5'-adjacent codon is not C. If the observed nucleotide was T, its ancestral nucleotide was C, and the downstream nucleotide was G, the mutation was classified as a CpG type. If the observed nucleotide was A, its ancestral nucleotide was G, and the upstream nucleotide was C, the mutation was again classified as a CpG type. If the observed nucleotide was C, its ancestral nucleotide was T, and the downstream nucleotide was G, the mutation was classified as a TpG type. If the observed nucleotide was G, its ancestral nucleotide was A, and the upstream nucleotide was C, the mutation was again classified as a TpG type. Other types of substitutions are classified into two types: transitions and transversions [22,24,31].

Evaluation of the chemical properties of amino acids

The frequency of amino acid substitutions relative to the frequency expected by chance decreases linearly with the increase in physicochemical differences between the amino acid pairs involved in a substitution. Grantham [34] proposed a formula for estimating the difference between amino acids; this formula combines the properties of amino acids such as composition, polarity, and molecular volume, which correlate best with the substitution frequencies of protein residues. In this study, I compared codon substitution rates with Grantham's [34] distance.

Comparison between codon substitution rates and GC content, gene density and the density of CpG islands by using multiple regression analysis

I compared the codon substitution rates with the GC content and the gene and CpG island density, which are defined as the number of genes and CpG islands per MB of the chromosome, respectively. The GC content of each chromosome was calculated using the human reference sequence build 37. The gene density of each chromosome was calculated from the data in Ensembl release 57, and the density values of the CpG islands were taken from the study by Varriale and Bernardi [35]. Because of the lack of sufficient data, the relationship GC content, gene density and the density of CpG islands could not be analyzed. The substitution rates were then multiplied by 108 to simplify the calculation and transformed to their natural logarithms, as in a previous study [22]. These values were used in multiple regression analyses, which was performed using the statistical software R [36].

Abbreviations

MP: maximum parsimony; BY: billion years; MY: million years; CpG: a cytosine followed by a guanine; MB: mega bases.

Competing interests

The author declares that they have no competing interests.

Authors' contributions

KM compared DNA sequences, conducted statistical analyses, and wrote the manuscript. KM read and approved the final manuscript.

Acknowledgements and Funding

This study was funded through the National Project titled "Next-generation Integrated Living Matter Simulation" of the Ministry of Education, Culture, Sports, Science and Technology (MEXT). Computation time was provided by the Super Computer System, Human Genome Center, Institute of Medical Science, University of Tokyo. Data matrices used in this study are available from the author upon request.

References

  1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool.

    J Mol Biol 1990, 215(3):403-410. PubMed Abstract OpenURL

  2. Katoh K, Misawa K, Kuma K, Miyata T: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform.

    Nucleic Acids Res 2002, 30(14):3059-3066. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  3. Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG: Clustal W and Clustal X version 2.0.

    Bioinformatics 2007, 23(21):2947-2948. PubMed Abstract | Publisher Full Text OpenURL

  4. Misawa K, Kikuno RF: GeneWaltz--A new method for reducing the false positives of gene finding.

    BioData Min 2010, 3(1):6. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  5. Yang Z, Nielsen R: Mutation-selection models of codon substitution and their use to estimate selective strengths on codon usage.

    Mol Biol Evol 2008, 25(3):568-579. PubMed Abstract | Publisher Full Text OpenURL

  6. Hughes AL, Nei M: Pattern of nucleotide substitution at major histocompatibility complex class I loci reveals overdominant selection.

    Nature 1988, 335(6186):167-170. PubMed Abstract | Publisher Full Text OpenURL

  7. Nei M, Gojobori T: Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions.

    Mol Biol Evol 1986, 3(5):418-426. PubMed Abstract | Publisher Full Text OpenURL

  8. Yang Z: PAML 4: phylogenetic analysis by maximum likelihood.

    Mol Biol Evol 2007, 24(8):1586-1591. PubMed Abstract | Publisher Full Text OpenURL

  9. Felsenstein J: Inferring phylogenies. Sunderland, Mass, USA: Sinauer Associates; 2004. OpenURL

  10. Sullivan J, Joyce P: Model selection in phylogenetics.

    Annual Review of Ecology, Evolution, and Systematics 2005, 36:445-466. Publisher Full Text OpenURL

  11. Tyekucheva S, Makova KD, Karro JE, Hardison RC, Miller W, Chiaromonte F: Human-macaque comparisons illuminate variation in neutral substitution rates.

    Genome Biol 2008, 9(4):R76. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  12. Fryxell KJ, Moon WJ: CpG mutation rates in the human genome are highly dependent on local GC content.

    Mol Biol Evol 2005, 22(3):650-658. PubMed Abstract | Publisher Full Text OpenURL

  13. Walser JC, Ponger L, Furano AV: CpG dinucleotides and the mutation rate of non-CpG DNA.

    Genome Res 2008, 18(9):1403-1414. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  14. Taylor J, Tyekucheva S, Zody M, Chiaromonte F, Makova KD: Strong and weak male mutation bias at different sites in the primate genomes: insights from the human-chimpanzee comparison.

    Mol Biol Evol 2006, 23(3):565-573. PubMed Abstract | Publisher Full Text OpenURL

  15. Bird AP: DNA methylation and the frequency of CpG in animal DNA.

    Nucleic Acids Res 1980, 8(7):1499-1504. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  16. Krawczak M, Ball EV, Cooper DN: Neighboring-nucleotide effects on the rates of germ-line single-base-pair substitution in human genes.

    Am J Hum Genet 1998, 63(2):474-488. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  17. Jukes TH: Codons and nearest-neighbor nucleotide pairs in mammalian messenger RNA.

    J Mol Evol 1978, 11(2):121-127. PubMed Abstract | Publisher Full Text OpenURL

  18. Huttley GA: Modeling the impact of DNA methylation on the evolution of BRCA1 in mammals.

    Mol Biol Evol 2004, 21(9):1760-1768. PubMed Abstract | Publisher Full Text OpenURL

  19. Lunter G, Hein J: A nucleotide substitution model with nearest-neighbour interactions.

    Bioinformatics 2004, 20(Suppl 1):I216-I223. PubMed Abstract | Publisher Full Text OpenURL

  20. Scarano E, Iaccarino M, Grippo P, Parisi E: The heterogeneity of thymine methyl group origin in DNA pyrimidine isostichs of developing sea urchin embryos.

    Proc Natl Acad Sci USA 1967, 57(5):1394-1400. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  21. Kaneko S, Aki I, Tsuda K, Mekada K, Moriwaki K, Takahata N, Satta Y: Origin and evolution of processed pseudogenes that stabilize functional Makorin1 mRNAs in mice, primates and other mammals.

    Genetics 2006, 172(4):2421-2429. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  22. Misawa K, Kikuno RF: Evaluation of the effect of CpG hypermutability on human codon substitution.

    Gene 2009, 431(1-2):18-22. PubMed Abstract | Publisher Full Text OpenURL

  23. Suzuki Y, Gojobori T, Kumar S: Methods for incorporating the hypermutability of CpG dinucleotides in detecting natural selection operating at the amino acid sequence level.

    Mol Biol Evol 2009, 26(10):2275-2284. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  24. Misawa K, Kamatani N, Kikuno RF: The universal trend of amino acid gain-loss is caused by CpG hypermutability.

    J Mol Evol 2008, 67(4):334-342. PubMed Abstract | Publisher Full Text OpenURL

  25. Miyata T, Hayashida H, Kuma K, Mitsuyasu K, Yasunaga T: Male-driven molecular evolution: a model and nucleotide sequence analysis.

    Cold Spring Harb Symp Quant Biol 1987, 52:863-867. PubMed Abstract OpenURL

  26. Li WH, Yi S, Makova K: Male-driven evolution.

    Curr Opin Genet Dev 2002, 12(6):650-656. PubMed Abstract | Publisher Full Text OpenURL

  27. Vogel F, Motulsky AG: Human genetics: problems and approaches. Berlin: Springer-Verlag; 1997.

  28. Fryxell KJ, Zuckerkandl E: Cytosine deamination plays a primary role in the evolution of mammalian isochores.

    Mol Biol Evol 2000, 17(9):1371-1383. PubMed Abstract | Publisher Full Text OpenURL

  29. Antequera F, Bird A: CpG islands as genomic footprints of promoters that are associated with replication origins.

    Curr Biol 1999, 9(17):R661-667. PubMed Abstract | Publisher Full Text OpenURL

  30. Leroy JL, Kochoyan M, Huynh-Dinh T, Gueron M: Characterization of base-pair opening in deoxynucleotide duplexes using catalyzed exchange of the imino proton.

    J Mol Biol 1988, 200(2):223-238. PubMed Abstract | Publisher Full Text OpenURL

  31. Misawa K, Kikuno RF: Relationship between amino acid composition and gene expression in the mouse genome.

    BMC Res Notes 2011, 4(1):20. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  32. Gibbs RA, Rogers J, Katze MG, Bumgarner R, Weinstock GM, Mardis ER, Remington KA, Strausberg RL, Venter JC, Wilson RK, Batzer MA, Bustamante CD, Eichler EE, Hahn MW, Hardison RC, Makova KD, Miller W, Milosavljevic A, Palermo RE, Siepel A, Sikela JM, Attaway T, Bell S, Bernard KE, Buhay CJ, Chandrabose MN, Dao M, Davis C, Delehaunty KD, Ding Y, et al.: Evolutionary and biomedical insights from the rhesus macaque genome.

    Science 2007, 316(5822):222-234. PubMed Abstract | Publisher Full Text OpenURL

  33. Horai S, Hayasaka K, Kondo R, Tsugane K, Takahata N: Recent African origin of modern humans revealed by complete sequences of hominoid mitochondrial DNAs.

    Proc Natl Acad Sci USA 1995, 92(2):532-536. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  34. Grantham R: Amino acid difference formula to help explain protein evolution.

    Science 1974, 185(4154):862-864. PubMed Abstract | Publisher Full Text OpenURL

  35. Varriale A, Bernardi G: Distribution of DNA methylation, CpGs, and CpG islands in human isochores.

    Genomics 2010, 95(1):25-28. PubMed Abstract | Publisher Full Text OpenURL

  36. R Development Core Team: R: A Language and Environment for Statistical Computing.

    Vienna, Austria 2007. OpenURL