Open Access Research article

Context dependent substitution biases vary within the human genome

P Andrew Nevarez12, Christopher M DeBoever13, Benjamin J Freeland1, Marissa A Quitt14 and Eliot C Bush1*

Author Affiliations

1 Department of Biology, Harvey Mudd College, Claremont, CA, USA

2 Department of Biology, Duke University, Durham, NC, USA

3 Division of Biological Sciences, University of California San Diego, La Jolla, CA, USA

4 Division of Biology, California Institute of Technology, Pasadena, CA, USA

For all author emails, please log on.

BMC Bioinformatics 2010, 11:462  doi:10.1186/1471-2105-11-462

Published: 15 September 2010

Additional files

Additional file 1:

Proof of relative abundance algorithm by mathematical induction. PDF file displaying Proof of relative abundance algorithm by mathematical induction.

Format: PDF Size: 143KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 2:

Effect of sample size on total context bias calculation. To determine the effect of stochastic variation in pattern frequencies on our context bias estimates, we calculated total context bias at a variety of sample sizes. We repeatedly sampled with replacement from from our full transposon data set. We took a total of 5380 samples at 120 sample sizes. Here we have plotted the median total context bias at each sample size against sample size. For comparison we've also included the no-bias controls. At low sample sizes stochastic effects elevate context bias. This effect diminishes rapidly with increasing amounts of data.

Format: PDF Size: 1.2MB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 3:

Table of top context bias values for 2-5 bp single substitution patterns. We calculated context bias values for all single substitution 2-5 bp patterns for our transposon dataset, and for 10 corresponding no bias control data sets. We used the no-bias controls to determine a p-value for each pattern in the real data. (The no-bias controls tell us how likely are we to get a score this high or higher if there were in reality no bias.) We then used the method of Benjamini and Hochberg, 1995 to identify the set of patterns with a false discovery rate of 0.001. Those patterns are given in this table.

Format: CSV Size: 32KB Download file

Open Data

Additional file 4:

Comparison of context bias after removing CpG-containing patterns. One possible explanation for observed differences in context bias is that the methylation process that produces biases at 2 bp is also influenced by context at larger scales. To address this, we calculated context bias for each data set in Figure 2 while excluding substitution patterns including an ancestral CpG. We find that the effects at 3-5 bp remain, which suggests that bias at these scales is not working via the rate of cytosine deamination at CpG sites.

Format: PDF Size: 14KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 5:

Unweighted total context bias in tranposons and non-repetitive sequence. Differences in total context bias between transposons and non-transposons might be due to variation in pattern frequencies rather than difference in the substitution process. To address this, we calculated an unweighted version of eq. 5 across all single-substitution patterns at each pattern size. To do this we simply replaced the term f(P) in eq. 5 with the term 1/N, where N is the total number of patterns. With this new measure, as with total context bias, we find that transposons have more bias than non-transposon sequence at all sizes.

Format: PDF Size: 46KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 6:

Distribution of context bias differences between human lineage data sets. We found that total context bias differs between different types of sequence, for example between transposons and non-repetitive sequences. One question we would like to answer is what is the origin of this difference. It turns out it is not due to patterns which are unique in one data set or the other. Another question is whether the differences is due to differences in a few shared patterns, or many. Here we compare context bias values for patterns which are shared. For example, in (A) we are looking at 2 bp patterns. We calculate the value of transposon minus non-repetitive for each of these. We then sort large to small, and plot them according to their rank. The y value of this plot is the cumulative value of context bias difference. The horizontal line represents the total context bias value for all patterns. As can be seen, most of the final total context bias value is due to a few patterns which differ greatly in transposons and non-repetitive sequence. A-D represent transposon vs. non-repetitive for 2-5 bp, E-H represent near-far for transposon sequences, and I-L represent far-near for non-repetitive sequences.

Format: PDF Size: 624KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data