School of Crystallography, Birkbeck College, University of London, Malet Street, London, WC1E 7HX, UK

Abstract

Background

On a single strand of genomic DNA the number of As is usually about equal to the number of Ts (and similarly for Gs and Cs), but deviations have been noted for transcribed regions and origins of replication.

Results

The mouse genome is shown to have a segmented structure defined by strand bias. Transcription is known to cause a strand bias and numerous analyses are presented to show that the strand bias in question is not caused by transcription. However, these strand bias segments influence the position of genes and their unspliced length. The position of genes within the strand bias structure affects the probability that a gene is switched on and its expression level. Transcription has a highly directional flow within this structure and the peak volume of transcription is around 20 kb from the A-rich/T-rich segment boundary on the T-rich side, directed away from the boundary. The A-rich/T-rich boundaries are SATB1 binding regions, whereas the T-rich/A-rich boundary regions are not.

Conclusion

The direct cause of the strand bias structure may be DNA replication. The strand bias segments represent a further biological feature, the chromatin structure, which in turn influences the ease of transcription.

Background

Because of the Watson-Crick structure of DNA – A paired with T and C with G – the number of As must equal the number of Ts when the bases on both strands are counted. Although this equality does not have to be true for a single strand, Chargaff's second law refers to the equality of A/T and C/G bases on a single strand

Early work on strand bias analysed prokaryote and viral genomes where strand biases have been observed and associated with origins of replication: the leading strand is found to be G-rich and T-rich, with the G-C bias often being found to be more consistent than the A-T bias

Strand bias has been discovered at transcription start sites in plants and fungi

This paper has some similarities with a very recent paper by Huvet

The present work has its origins in a number of peculiarities in the data. Firstly, the strand bias around the transcription start site is highly variable; secondly, an average bias can be seen in the data for hundreds of thousands of bases upstream and downstream of the start site; thirdly, in a large random piece of DNA, say 500 kb bases, (whether or not from a transcribed region), there is a large negative correlation between (A-T) and (G-C)

Although this paper emphasises that the strand bias discussed here is not caused by transcription, the main result is that there is a strand bias structure to the genome and this structure affects the placement of genes and the probability of their expression. This suggests that the strand bias structure also reflects some aspect of the chromatin structure (which in turn makes some positions advantageous for transcription): direct evidence for this is presented.

Results and discussion

Basic statistics about segments

The text gives results for mouse and A:T boundaries. Similar results have been obtained for human but these have not been shown. The C:G results mirror the A:T results in nearly all respects but this has not been fully explored.

The algorithm defined in the Methods section finds 23482 segments, with a median length of 67289 bases (Figure

Basics statistics

Actual genome

Hybrid genome

Shuffled genome

A) Number of segments

segments defined by algorithm

23482

8229

2013

B) Median absolute AT-bias

segments defined by algorithm

3.68%

1.96%

0.54%

segments of random position

1.86%

0.82%

0.43%

C) Median absolute AT-skew

segments defined by algorithm

6.26%

3.46%

0.93%

segments of random position

3.21%

1.41%

0.73%

In this table, AT-bias = median absolute (A-T)/(A+C+G+T). In this table, AT-skew = median absolute (A-T)/(A + T). Absolute values are used in this table to give meaning to the results for the random sample. The comparison between the actual and shuffled genomes confirms that the algorithm is finding real features in the real genome. The comparison between the actual and hybrid genome confirms that the average segment in the real genome is of stronger bias than a segment generated by transcription.

Histogram of lengths between segment boundaries

**Histogram of lengths between segment boundaries**. (a) All strand bias segments, n = 23482, median = 67289, mean = 109253. Each segment is A+ on one strand and T+ on the other. For comparison the median length of genes is 11622. (b) Distance between consecutive T+/A+ boundaries: n = 11732, median = 160700, mean = 218500, modal value around 100 k. (b) gives an estimate of the size of replicons.

The A+ strand is defined to be the strand with more As than Ts, and the T+ strand is defined similarly. A DNA segment may be called the A+ segment or T+ segment, if it is clear which strand is being referred to. The average AT-bias about all the A+/T+ segment boundaries is shown in Figure

AT-bias with respect to the segment boundaries

**AT-bias with respect to the segment boundaries**. Both figures show the AT-bias for 50 k bases either side of the boundary, using a moving average over 100 bases. All boundaries in the genome were used in calculating the average. The thickness of the black line shows 95% confidence limits. Comparisons with other features are shown in the following figures. a) A+/T+ boundaries (n = 11750), b) T+/A+ boundaries (n = 11753). The orange line is the mirror image of (a) and is given as a reference line. (b) does not show the shoulder feature of (a).

Segment lengths in the autosomal chromosomes are similar to each other with median segment lengths ranging from 62137 (chromosome 7) to 79967 (chromosome 11): the segments on the sex chromosomes are comparatively short, with median lengths 56283 for the X and 65702 for the Y. There is a relationship between AT-percentage of a segment and its length: the correlation between AT-percentage and log of the length is -0.22. Dividing segments into two according to the median AT-percentage (60%) gives a median length of 56375 for the AT-rich half and 83216 for the AT-poor half. Each segment will be A+ on one strand and T+ on the other. Because of this symmetry our later results are not a consequence of the distribution of length of segments.

Statistical significance

To assess the statistical significance of the method of finding segment boundaries, a shuffled genome was constructed by dividing the mouse genome into 100 base pieces; the pieces for each chromosome were then shuffled separately. This method has the advantage of preserving many qualities of the raw sequence including the base frequency. The same algorithm applied to the shuffled sequence finds only 2013 segments and the average bias has been plotted in Figure

AT-bias at the A+/T+ boundary – comparison with two control genomes

**AT-bias at the A+/T+ boundary – comparison with two control genomes**. The black line shows the actual mouse genome (11750 boundaries). a) The red line shows the shuffled genome (1017 boundaries) b) The blue line shows the hybrid genome (4124 boundaries), that is the genome has been shuffled and then the sequence for the genes has been restored at their original positions. The thickness of all lines show 95% confidence limits and all lines show moving averages over 100 bases. There is a statistically significant difference between the black and red lines and between the black and blue lines: one-tailed z-test at position 5000 bases downstream of the the boundary: a) p < 10^{-50 }n1 = 11750, n2 = 1017, z = 21. b) p < 10^{-50 }n1 = 11750, n2 = 4124, z = 20.

Comparison with transcription associated bias

The segment bias is much larger than that caused by transcription; see Figure

A comparison of segment bias and transcription bias

**A comparison of segment bias and transcription bias**. The black line shows the AT-bias about the A+/T+ boundary. The red line shows the AT-bias for genes aligned by their TSS: for this line the x-axis gives the position relative to the TSS. Both lines show moving averages over 100 bases. The thickness of both lines show 95% confidence limits. There is a statistically significant difference between the two lines: at position 5000 bases downstream of the boundary a two-tailed z-test gives p-value < 10^{-50 }n1 = 11750, n2 = 23941, z = -63.

Another direct test is to compare the average bias about A+/T+ boundaries when the boundaries are divided into those where there is no or some transcription recorded in ENSEMBL from coding genes within 50 k bases either side of the boundary. Results are shown in Figure

Strand bias with respect to A+/T+ segment boundaries – comparison with no recorded transcription

**Strand bias with respect to A+/T+ segment boundaries – comparison with no recorded transcription**. The orange line shows the AT-bias for the entire sample. The red line shows the AT-bias for the boundaries where there is no recorded transcription from coding genes for 50 kb either side of the boundary – this line is based on 4997 segment boundaries. The blue line shows the AT-bias for the boundaries where there is some recorded transcription from coding genes within the range plotted – this line is based on 6753 segment boundaries. The thickness of the red and blue lines show 95% confidence limits. All lines show moving averages over 100 bases. There is a statistically significant difference between the red and blue lines: at position 5000 bases downstream of the boundary a two-tailed z-test gives p-value < 10^{-50}, n1 = 4997, n2 = 6753, z = -16.

Similar analyses can be made where genes are known to be on one side of the boundary and not the other and with a given direction. These analyses all give the results that the strand bias profile on both sides of the boundary is similar to the original average shown in Figure

Another way of analysing the bias coming from transcription is to remove the bias from everywhere except for the genes. A "hybrid genome" has been constructed by taking one of the shuffled genomes from the previous section and then copying back over this genome the actual sequences of the coding genes from TSS (Transcription Start Site) to TES (Transcription End Site) including both introns and exons in their real positions. Because of the length of introns about a third of the real genome is preserved in the hybrid genome. We then ask if the results are consistent with the hypothesis that all the strand bias in these regions comes from transcription and there is no strand bias outside these regions. The algorithm finds about a third of the segments for this genome as for the real one, 8229 as against 23482: the algorithm is searching for strand bias on a much larger scale than transcription generates. There is a difference in the profile of strand bias between the real and hybrid genomes which is proved by a statistical test – see Figure

The next analysis separates the size of the bias caused by transcription from that of the segment bias. Figure

Bias with respect to A+/T+ segment boundary – estimates for combined effect of transcription

**Bias with respect to A+/T+ segment boundary – estimates for combined effect of transcription**. The brown line shows the AT-bias where there is no transcription. The green line shows the AT-bias where there is transcription in the direction from the A+/T+ boundary to the T+/A+ boundary. The red line shows the bias where there is transcription in the opposite direction. All lines show moving averages over 100 bases.

If transcription were the cause of the segment bias then the amount of transcription would be highest where the bias was highest, that is at the segment boundary. The average amount of transcription relative to the A+/T+ boundary is shown in Figure

Volume of expression by position with respect to segment boundaries – estimated from expression data

**Volume of expression by position with respect to segment boundaries – estimated from expression data**. With respect to (a) A+/T+ boundaries; (b) T+/A+ boundaries. For both graphs, the unsymmetric black line shows the volume of transcription along one strand from left to right – transcription with the flow of the segment bias is on the right for (a), and on the left for (b). The peak is about 15 kb to 20 kb downstream/upstream of the segment boundary. Corresponding data for both sides of the boundary have been averaged. The upper red line plots the sum of the amounts on the two strands and the blue line plots their absolute difference. All three lines show moving averages over 100 bases and the thickness of the lines show 95% confidence limits. For both (a) and (b), a two-tailed z-test shows that the black line at position +5000 bases is statistically different from that at position -5000: a) p < 10^{-50}, n1 ~ 2058, n2 ~ 856, z = 16.5: b) p < 10^{-50}, n1 ~ 2253, n2 ~ 1260, z = 16.6.

Strand bias switch has been found within long vertebrate genes

The lines of argument given above prove that the segment bias is not caused by transcription. It is therefore not a circular question to ask how transcription fits into the structure defined by the segment bias. The next series of analyses discuss this question.

Number of genes by position in strand bias structure

Transcription Start Sites cluster towards the A+/T+ boundary with a bias to the downstream side of the boundary and avoid the T+/A+ boundaries – see Figures

Position of TSS with respect to the segment boundaries

**Position of TSS with respect to the segment boundaries**. The bold black line shows results for real TSSs and the faint red line is a control plot for randomly chosen positions. (a) shows the A+/T+ boundary and (b) shows T+/A+ boundary. TSSs of genes cluster near the A+/T+ boundary and have a tendency to occur downstream of this boundary, but avoid the T+/A+ boundary. The converse applies to the TES – see Figure 9.

The opposite results apply to the Transcription End Site. The TESs are clustered towards the T+/A+ boundary with a bias to be upstream of this boundary and avoid the A+/T+ boundary – see Figures

Position of TES with respect to the segment boundaries

**Position of TES with respect to the segment boundaries**. The bold black line shows results for real TESs and the faint red line is a control plot for randomly chosen positions. (a) shows the A+/T+ boundary and (b) shows the T+/A+ boundary.

These results show a very strong bias, but few genes run from one kind of boundary to the other. The pattern is more pronounced for genes with CpG islands and for long genes (details not shown). In this context, the length of the gene is the number of bases from the TSS to the TES, that is the length of the raw unspliced mRNA.

To discuss if Figure

Number of genes – comparisons with hybrid genome – upstream versus downstream

Genome

Number of TSSs upstream

Number of TSSs downstream

Total

Proportion

i

ii

iii

iv = ii + iii

v = iii/iv

A

real

5738

10033

15771

63.6%

B

hybrid

3791

6072

9863

61.6%

C

A – B

1947

3961

5908

67.0% (X)

Columns ii and iii refer to the number of TSSs within 100 kb of the A+/T+ boundary, upstream or downstream respectively. The figure of 67.0% (X) is significantly different from 50.0% using a binomial distribution approximated by the normal distribution, p < 10^{-50}, n = 5908, z = 26, one tailed test.

Number of genes – comparisons with hybrid genome – third quarter comparison

Genome

Number of TSSs Quarters 1,2 4

Number of TSSs Quarter 3

Total

Proportion

i

ii

iii

iv = ii + iii

v = iii/iv

A

real

8803

6968

15771

44.2%

B

hybrid

5939

3924

9863

39.8%

C

A – B

2864

3044

5908

51.5% (X)

D

real (random)

9362

4386

13748

31.9% (Y)

Quarter 3 is the region up to 50 kb downstream of the A+/T+ boundary. Quarters 1,2, 4 are the other parts of the region between 100 kb upstream and 100 kb downstream of this boundary. Row D refers to the control analysis in which TSSs are replaced by an equal number of random positions (the red line of Figure 9a). The figure of 51.5% (X) is significantly different from the figure 31.9% (Y) using a binomial distribution approximated by the normal distribution and a t-type test, p < 10^{-50}, n1 = 5908, n2 = 13478, z = 25, one tailed test.

Position of TSS with respect to the A+/T+ boundary – comparison with hybrid genome

**Position of TSS with respect to the A+/T+ boundary – comparison with hybrid genome**. The black line shows results for real TSSs in the real genome and the blue line shows the results for the real TSSs in the hybrid genome. Although the lines have a common feature of a central peak, the line for the real genome is higher. The distribution of the positions of the extra TSSs found by the extra segments in the real genome is statistically significant – see Tables 2 and 3.

Length of genes by position in strand bias structure

Genes starting near the A+/T+ boundary tend to be long and those starting on the T+ segment are much longer than on the A+ segment (Figure

Median length of gene by position of TSS with respect to the segment boundaries

**Median length of gene by position of TSS with respect to the segment boundaries**. The bins have been defined by the quantiles of the distribution within the range plotted. The error bars show 95% confidence ranges using Hettmansperger-Sheather's method. (a) shows the A+/T+ boundary and (b) shows the T+/A+ boundary. a) in each bin,

Gene expression by length of gene

The relationship between the probability that a gene is expressed and the length of the gene is shown in Figure

Gene expression by length of gene

**Gene expression by length of gene**. The bins for the x-axis are the quantiles of the length distribution for those genes which have expression data. There are 50 bins each containing 2% of the distribution: in each bin

Gene expression by position in strand bias structure

Given these results it is to be expected that the probability that a gene is expressed (and its expression level if expressed) varies with the position of the TSS and TES within the strand bias structure. This is borne out by direct analysis. However, Figure

Probability of a gene being expressed by position of TSS with respect to the A+/T+ boundary

**Probability of a gene being expressed by position of TSS with respect to the A+/T+ boundary**. The bins have been defined by the quantiles of the distribution within the range plotted: in each bin, n ~ 1152. The error bars show 95% confidence limits. This shows a strong peak near the segment boundary and a long range asymmetry about the boundary.

Probability of a gene being expressed by length of gene – split by position of TSS with respect to A+/T+ boundary

**Probability of a gene being expressed by length of gene – split by position of TSS with respect to A+/T+ boundary**. The upper red line refers to genes whose TSS is within 5 k bases upstream of the A+/T+ boundary and 15 k bases downstream of this boundary and the lower blue line refers to genes whose TSS falls outside this range. The analysis is based on 2532 genes (red line) and 11291 genes (blue line). This figure explains why genes with TSS near this boundary are often expressed, despite the fact that these genes tend to be long genes (Figure 11a) and long genes tend to be less often expressed (Figure 12a). The plot shows plus and minus one standard error.

The expression level of a gene (if it is expressed) shows a much weaker relationship with position of the TSS (or TES) with respect to the segment boundary. Because of the larger statistical uncertainties, we have reported a comparison between a) the genes which are within a T+ segment (with the flow) and b) those genes within an A+ segment (against the flow). In both cases genes which cross either an A+/T+ boundary or a T+/A+ boundary have been omitted. Most of these excluded genes are extremely long. Results are given in Table

Expression levels on different segments

segment

sample size

probability expressed

level of expressed gene

average expression

median length

1

2

3

4

5

6

T+:

7784

0.56

6.03

3.75

17606

A+:

4080

0.50

7.10

3.62

6145

Gene crossing either A+/T+ or T+/A+ boundary have been omitted from the sample. Genes on a T+ segment are with the flow of the bias and those on an A+ segment are against the flow. For an individual gene the variable in column 5 is the product of the variables in columns 3 and 4. Columns 4 and 5 are in arbitrary linear units. Genes with the flow of the bias have a higher probability of being expressed (column 3) than those against the flow, but their level of expression when expressed is smaller (column 4). Genes against the flow are much shorter than those with the flow (column 6). The differences between the segment types are statistically significant for columns 3 and 4 using a two-tailed t-test, p values ~ 2.10^{-16 }and 9.10^{-7 }respectively. The difference between the segment types in column 5 is not statistically significant.

Proportion of transcription with the flow of the strand bias

The proportion of DNA that is transcribed "with the flow" of the strand bias has been calculated as follows. As a gene may cross several segment boundaries, the number of bases on the T+ strand and the number on the A+ strand were counted for each gene. The number of bases was then totalled by strand. The result is that the number of transcribed bases on the T+ strand is 77% of all transcribed bases. If the number of bases is weighted by the average expression level of the gene then the proportion rises to 82%. If transcription was the cause of the bias one would expect a value close to 100%.

Discussion of three previous papers

Touchon

Green

Histogram of ratio of length of containing segment to gene length

**Histogram of ratio of length of containing segment to gene length**. The median ratio is 16 and there is an apparent boundary to the distribution at a ratio around 1. The interquartile range of the ratio is [4.4, 72]. The x-axis is plotted using logs to base 10. The number of genes in the plot is 23878. The segment bias operates on a larger scale than the transcription bias.

If transcription causes a strand bias, it would be expected that this effect would be roughly proportional to expression level and Majewski

ACGT-skew for individual genes – long genes

**ACGT-skew for individual genes – long genes**. The y-axis is the ratio ((A+C)-(T+G))/(A+C+G+T)) for introns (and also excluding the 50 bases at each end of the intron) as used in [16]. The plot is restricted to genes of at least 10 k bases. a) ACGT-skew for individual genes by proportion of times gene is expressed: The plot shows a correlation of -0.39 (n = 8352). Even for seldom expressed genes there is an average negative bias. b) ACGT-skew for individual genes by segment bias predictor: The plot excludes genes which extend more than 100 k bases from both A+/T+ and T+/A+ boundaries, and this explains the gap in the plot. The correlation is 0.43 (n = 9040). The predictor does not use any information about transcription other than the position of the gene with respect to the segment boundaries. In particular, no information about transcription bias is used in calculating the predictor.

In many cases one can get a better predictor of the strand bias of individual genes, merely by using the knowledge of the position of the gene with respect to the segment boundaries defined here. For each base take the nearest A+/T+ or T+/A+ boundary and associate with this base the average AT-bias for that position using the line from Figure

Strand bias and DNA replication

The direct cause of the strand bias observed in this paper is not known but an appealing theory is that the strand bias comes from the mechanism of DNA replication and the A+/T+ boundaries are origins of replication. There are several reasons to think this may be so:- strand asymmetries of this type have been observed at origins of replication in bacterial and viral genomes i.e. the leading strand is

The finding that 82% of transcription is with the flow of the strand bias adds weight to this suggestion. In almost all prokaryotes studied there is a bias in that the direction of transcription is the same as that of replication

Estimates for the size of replicons (the region of DNA controlled by one origin of replication) fall into two groups: those agreeing with the traditional view that replicons are comparatively small: around 50 kb to 300 kb

Another model for the relationship with DNA replication is that the direct cause of strand bias is transcription, but the placement of genes and direction of transcription is controlled by the need to keep transcription and replication in the same direction. This model has been proposed

Strand bias and chromatin organisation

An explanation in terms of DNA replication does not explain the various relationships that have been observed between the strand bias and the placement of genes, their length, the chance that a gene is switched on and the expression level of genes. All this calls for a unifying explanation, which we suggest is to be found in the physical structure of the chromatin. Similar results (although for much larger domains) lead Huvet

H-rule measure by position with respect to segment boundary

**H-rule measure by position with respect to segment boundary**. The black line (line with peak) gives the value of the H-rule measure with respect to the A+/T+ boundary, and the red line with respect to the T+/A+ boundary. The data from both sides of the boundary have been averaged. The thickness of both lines shows 95% confidence limits. The black line has a peak at the boundary but the red line does not. This suggests that the A+/T+ boundary is a region which binds to the nuclear proteins of the matrix, in particular, SATB1. a) Mouse: b) Human (genome assembly NCBI35). For both plots the difference between the lines at the boundary is statistically significant, two-tailed z-test: a) p < 10^{-50}, n1 = 11750, n2 = 11753, z ~ 104: b) p < 10^{-50}, n1 = 12375, n2 = 12369, z ~ 96.

Conclusions

We have shown the mouse genome has a strand bias structure consisting of segments of alternating bias. These segments are much larger than coding genes. These segments influence the placement of genes, their length, the probability that a gene is expressed, and the size of the expression level. These effects are not caused by transcription even though transcription itself causes a strand bias effect. Although the direct cause of the bias may be DNA replication, the strand bias in question represents a further biological structure, such as the spatial organisation of the chromatin. The H-rule analysis gives direct evidence for this proposal.

Methods

Definition of strand bias segments

A region may be mostly T+ but contain an A+ sub-region. This region might be defined to be one T+ segment or one A+ segment and two T+ segments. In order to choose between these possibilities, we use a parameter,

The following equations give a precise description of the method. The exponential weighting factor _{L}[_{R}[

Let

Variables will be defined in pairs with suffix

and the weighted count of the number of A and T bases is defined to be:

The window score in each window is then defined as the average bias:

_{L}[_{L}[_{L}[_{R}[_{R}[_{R}[

A threshold for each window is defined by:

where

Candidate A+/T+ boundaries are then chosen as those positions

_{L}[_{L}[_{R}[i + 1] < -Z_{R}[i + 1]

and candidate T+/A+ boundaries as those positions

_{L}[_{L}[_{R}[i + 1] > Z_{R}[i + 1]

For these positions we define a measure:

_{L}[_{R}[

As a convenience in the computations, if any candidate positions of the same type are within 100 bases of each other we immediately chose the one with the more extreme value of D[i]. The A+/T+ and T+/A+ candidate positions are then ordered by position. For each group of consecutive A+/T+ boundaries the one with the greatest (most positive) value of D[i] is selected and for each group of consecutive T+/A+ boundaries the one with the least (most negative) value of D[i] is chosen. The resulting boundary positions define the strand bias segments.

We are interested in large scale effects. The following values of the parameters have been used for the results presented in this paper:

Data sources

Although the expression level of a gene is affected by a large number of variables (age of the organism, the position within the organism, phase of the cell cycle, environmental stress, etc.) and is highly variable, it is useful to consider average expression levels. Three variables have been used: a) the probability of expression, (number of experiments in which a gene is expressed divided by number of experiments), b) the average expression level if it is expressed (sum of the gene's expression levels over all experiments divided by number of experiments in which it is expressed), and c) its average expression level (sum of gene's expression levels divided by total number of experiments): for an individual gene

The data for the chromosomal sequence, the list of genes and their TSSs and TESs has been taken from ENSEMBL, which means that for each gene the transcribed unit has been taken to be the union of all alternative transcripts. The analysis includes all protein coding genes but excludes mitochondrial genes.

The mouse analysis is based on sequence assembly NCBIM36 and GEO platform GPL339, where 1744 GSM files had sufficient data to be used. This platform has 22690 probe-sets. Information on mouse genes was taken from ENSEMBL 45.

Abbreviations

AT-bias = (A-T)/(A+C+G+T); AT-skew = (A-T)/(A+T); ACGT-skew = ((A-T)+(C-G))/(A+C+G+T). TSS = Transcription Start Site; TES = Transcription End Site. The A+ strand is the strand with more As than Ts, and T+ strand is defined similarly. A DNA segment may be called the A+ segment or T+ segment, if it is clear which strand is being referred to.

Acknowledgements

I am grateful to Sascha Ott of Warwick University and Annika Hansen formerly of University College London for useful discussions, to Birkbeck College for the use of its facilities as an honorary Research Associate and to the Wellcome Trust for payment of the publication fee. I thank the referees for their comments and in particular for their advice on the kind of analysis that would be convincing.