# Context dependent substitution biases vary within the human genome

^{1} Department of Biology, Harvey Mudd College, Claremont, CA, USA

^{2} Department of Biology, Duke University, Durham, NC, USA

^{3} Division of Biological Sciences, University of California San Diego, La Jolla, CA, USA

^{4} Division of Biology, California Institute of Technology, Pasadena, CA, USA

*BMC Bioinformatics* 2010, **11**:462
doi:10.1186/1471-2105-11-462

### Additional files

**Additional file 1:**

**Proof of relative abundance algorithm by mathematical induction**. PDF file displaying Proof of relative abundance algorithm by mathematical induction.

Format: PDF Size: 143KB Download file

This file can be viewed with: Adobe Acrobat Reader

**Additional file 2:**

**Effect of sample size on total context bias calculation**. To determine the effect of stochastic variation in pattern frequencies on our context
bias estimates, we calculated total context bias at a variety of sample sizes. We
repeatedly sampled with replacement from from our full transposon data set. We took
a total of 5380 samples at 120 sample sizes. Here we have plotted the median total
context bias at each sample size against sample size. For comparison we've also included
the no-bias controls. At low sample sizes stochastic effects elevate context bias.
This effect diminishes rapidly with increasing amounts of data.

Format: PDF Size: 1.2MB Download file

This file can be viewed with: Adobe Acrobat Reader

**Additional file 3:**

**Table of top context bias values for 2-5 bp single substitution patterns**. We calculated context bias values for all single substitution 2-5 bp patterns for
our transposon dataset, and for 10 corresponding no bias control data sets. We used
the no-bias controls to determine a p-value for each pattern in the real data. (The
no-bias controls tell us how likely are we to get a score this high or higher if there
were in reality no bias.) We then used the method of Benjamini and Hochberg, 1995
to identify the set of patterns with a false discovery rate of 0.001. Those patterns
are given in this table.

Format: CSV Size: 32KB Download file

**Additional file 4:**

**Comparison of context bias after removing CpG-containing patterns**. One possible explanation for observed differences in context bias is that the methylation
process that produces biases at 2 bp is also influenced by context at larger scales.
To address this, we calculated context bias for each data set in Figure 2 while excluding substitution patterns including an ancestral CpG. We find that the
effects at 3-5 bp remain, which suggests that bias at these scales is not working
via the rate of cytosine deamination at CpG sites.

Format: PDF Size: 14KB Download file

This file can be viewed with: Adobe Acrobat Reader

**Additional file 5:**

**Unweighted total context bias in tranposons and non-repetitive sequence**. Differences in total context bias between transposons and non-transposons might
be due to variation in pattern frequencies rather than difference in the substitution
process. To address this, we calculated an unweighted version of eq. 5 across all
single-substitution patterns at each pattern size. To do this we simply replaced the
term f(P) in eq. 5 with the term 1/N, where N is the total number of patterns. With
this new measure, as with total context bias, we find that transposons have more bias
than non-transposon sequence at all sizes.

Format: PDF Size: 46KB Download file

This file can be viewed with: Adobe Acrobat Reader

**Additional file 6:**

**Distribution of context bias differences between human lineage data sets**. We found that total context bias differs between different types of sequence, for
example between transposons and non-repetitive sequences. One question we would like
to answer is what is the origin of this difference. It turns out it is not due to
patterns which are unique in one data set or the other. Another question is whether
the differences is due to differences in a few shared patterns, or many. Here we compare
context bias values for patterns which are shared. For example, in (A) we are looking
at 2 bp patterns. We calculate the value of transposon minus non-repetitive for each
of these. We then sort large to small, and plot them according to their rank. The
y value of this plot is the cumulative value of context bias difference. The horizontal
line represents the total context bias value for all patterns. As can be seen, most
of the final total context bias value is due to a few patterns which differ greatly
in transposons and non-repetitive sequence. A-D represent transposon vs. non-repetitive
for 2-5 bp, E-H represent near-far for transposon sequences, and I-L represent far-near
for non-repetitive sequences.

Format: PDF Size: 624KB Download file

This file can be viewed with: Adobe Acrobat Reader