Norwegian School of Veterinary Science, P.O. Box 8146 Dep., N-0033 Oslo, Norway

National Veterinary Institute, Pb. 750 Sentrm, N-0106 Oslo, Norway

Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, DK-2800 Lyngby, Denmark

Abstract

Background

The genomic fractions of purine (RR) and alternating pyrimidine/purine (YR) stretches of 10 base pairs or more, have been linked to genomic AT content, the formation of different DNA helices, strand-biased gene distribution, DNA structure, and more. Although some of these factors are a consequence of the chemical properties of purines and pyrimidines, a thorough statistical examination of the distributions of YR/RR stretches in sequenced prokaryotic chromosomes has to the best of our knowledge, not been undertaken. The aim of this study is to expand upon previous research by using regression analysis to investigate how AT content, habitat, growth temperature, pathogenicity, phyla, oxygen requirement and halotolerance correlated with the distribution of RR and YR stretches in prokaryotes.

Results

Our results indicate that RR and YR-stretches are differently distributed in prokaryotic phyla. RR stretches are overrepresented in all phyla except for the Actinobacteria and β-Proteobacteria. In contrast, YR tracts are underrepresented in all phyla except for the β-Proteobacterial group. YR-stretches are associated with phylum, pathogenicity and habitat, whilst RR-tracts are associated with phylum, AT content, oxygen requirement, growth temperature and halotolerance. All associations described were statistically significant with

Conclusion

Analysis of chromosomal distributions of RR/YR sequences in prokaryotes reveals a set of associations with environmental factors not observed with mono- and oligonucleotide frequencies. This implies that important information can be found in the distribution of RR/YR stretches that is more difficult to obtain from genomic mono- and oligonucleotide frequencies. The association between pathogenicity and fractions of YR stretches is assumed to be linked to recombination and horizontal transfer.

Background

Frequencies of RR and YR stretches of 10 bp or more have been associated with several genomic and DNA structural features

While RR and YR stretches are short-range correlated in archaea and bacteria, their distribution in eukaryotes is more complex

Results

To measure possible factors influencing the distribution of RR and YR stretches in prokaryotes, two regression models were fitted. For both models, AT content, phylum, oxygen requirement, habitat, temperature, pathogenicity and halotolerance were tested as predictors. The tested predictors not found significant were removed from the model. For the YR model, phyla, habitat and pathogenicity were found significant (^{2 }=

The graph depicts the genomic distribution of alternating pyrimidine/purine (YR) stretches of 10 bp or more in prokaryotic phyla

**The graph depicts the genomic distribution of alternating pyrimidine/purine (YR) stretches of 10 bp or more in prokaryotic phyla**. The expected fraction is 0.001 (0.1%). It can be seen that the β-Proteobacterial group has more than the expected fractions of YR-stretches, while all other groups have, on average, less than expected.

YR- stretches regression model

AIC

Factor

^{2}

AIC Difference

1134

Constant

1010

AT content

0.2

124

782

Phyla

0.5

228

755

Habitat

0.52

27

743

Pathogenicity

0.53

12

Results from forward fitting a regression model with genomic fractions of YR stretches from 546 chromosomes as response. Factors were added successively and those not found significant (

In Table ^{2}). Phyla was found to be the most important factor followed respectively by AT content, temperature, oxygen requirement, halotolerance and habitat. Our findings indicated that RR stretches were in general overrepresented (See Figure

The box-plot shows the distribution of genomic purine stretches consisting of 10 bp or more in prokaryotic phyla

**The box-plot shows the distribution of genomic purine stretches consisting of 10 bp or more in prokaryotic phyla**. The expected genomic fraction of RR-stretches is 0.001 (0.1%). The Actinobacterial and β-Proteobacterial groups were the only ones found to be underrepresented in terms of genomic RR-stretches.

RR- stretches regression model

AIC

Factor

^{2}

AIC Difference

1414

Constant

1043

AT content

0.49

371

725

Phyla

0.73

318

694

Oxygen requirement

0.74

31

690

Habitat

0.75

4

630

Growth temperature

0.77

60

608

Halotolerance

0.78

22

Results from forward fitting a regression model with genomic fractions of RR stretches from 546 chromosomes as response. Factors were added successively and those not found significant (

It should be noted that models based on the reverse compliments of the RR and YR-models,

In Figure ^{2 }= ^{2 }=

The graph on the left shows genomic AT content (horizontal axis) versus the genomic fractions of alternating pyrimidine/purine stretches of 10 bp (vertical axis)

**The graph on the left shows genomic AT content (horizontal axis) versus the genomic fractions of alternating pyrimidine/purine stretches of 10 bp (vertical axis)**. The graph on the right shows a similar plot, but for the fraction of purine stretches (10 bp). With all outliers removed it can be seen from the left graph that there is low linear correlation between genomic fractions of YR stretches and AT content (^{2 }= ^{2 }=

To examine possible relations between overrepresentation of YR stretches and pathogenicity, we analyzed the difference between the frequencies of YR stretches in a sliding window and the genome of

The graph shows a genomic profile of the plant-pathogen

**The graph shows a genomic profile of the plant-pathogen Xanthomonas oryzae MAFF 311018 based on the computed differences between the genomic fractions of YR-stretches and a non-overlapping sliding windows of 5 kbp**. The peeks having a difference above 0.002 (0.02%) were marked and the corresponding genetic regions were BLASTed against Genbank. The hits retrieved from BLAST indicated that all regions were linked to mobile genetic elements associated with recombination and horizontal transfer.

Discussion

The results above represents a continuation of earlier work

Analyses of the distribution of RR and YR stretches in prokaryotic chromosomes (figures

The

The finding that alternating pyrimidine/purine stretches of 10 bp or more are significantly associated with pathogenicity may indicate that YR tracts are positively correlated with genomic regions in bacteria that are susceptible to recombination or horizontal gene transfers resulting in the acquisition of pathogenicity islands. The fact that YR-stretches are underrepresented in prokaryotic genomes may suggest a counter selection of unstable regions. This is in stark contrast to what is observed in many eukaryotic organisms

Purine stretches are overrepresented in all phyla except for the γ-Proteobacteria, Bacteroidetes/Chlorobi and α-Proteobacteria groups. Actinobacteria and β-Proteobacteria are the only groups found to have a lower than expected fraction of purine stretches. From figures

Both models revealed several important factors associated with the respective distribution of RR and YR stretches. The best model, in terms of ^{2}, was obtained for the distribution of RR stretches. This implies that there may be different factors shaping the distributions of RR and YR stretches in bacterial genomes. This is supported by the regression models which found different factors significant. While AT content, extreme halotolerance, oxygen requirement, and growth temperature were significant factors in the RR based regression model, habitat and pathogenicity were found to be significant in the YR-model. The phyla factor was significantly associated with both RR and YR based regression models.

The model explaining RR stretches found oxygen requirement and growth temperature as important and significant factors (

That AT content is an important factor for oligonucleotide frequencies has been noted previously

All regression models suffer from the effect of co-linearity. That is, several predictor variables overlap to some extent in terms of explaining the variance in the model. For instance, AT content has been found to correlate with genome size

Overrepresentation of YR stretches in

Conclusion

The regression models varied in terms of goodness of fit/coefficient of determination (^{2}). The genomic distributions of YR stretches were not as adequately described by the regression model as the RR-stretches. This indicates that there are additional factors that remain to be identified for the YR-based regression model. The relatively high coefficient of determination obtained for both RR and YR-based regression models was surprising. It was of great interest to note that temperature was such an important factor in the RR-model, and that pathogenicity was significant in the YR-model.

We assume that the correlation between pathogenicity and YR-stretches is due to an increased tendency of Z-DNA formation in areas overrepresented with YR-stretches. Z-DNA formation has been associated with recombination and genetic rearrangements

Methods

The genomic DNA sequences and information used in the models as factors were downloaded from the NCBI database ^{10 }entry hash table containing the maximal number of occurring stretches. A variant of this program was made to find the difference between non-overlapping sliding windows and genome based frequencies of YR stretches. The program was used to examine overrepresented YR-stretches in

An Excel file containing the dataset used to generate the results in the article.

Click here for file

Chargaff's parity rule ^{10 }or about ~0.001. In other words, it is expected that all possible combinations of 10 bp purine and pyrimidine stretches occur with 0.1% probability. This simple background model assumes that each nucleotide is independent of its nearest neighbor.

The models were created using regression analysis with RR and YR frequencies as response variables and genome size, AT content, phyla, growth temperature, oxygen requirement, habitat, pathogenicity and halotolerance as predictors. Each response variable was log-transformed to optimize the fitting of residuals to the normal distribution. The following equation was obtained for the modeling of RR-stretches:

while the following equation was used to examine YR-stretches:

Oxygen, habitat, temperature, phyla and halotolerance were categorical factors. Oxygen requirement consisted of the factors: aerobic, anaerobic and facultative. The habitat factor consisted of the following categories: host-associated, specialized, terrestrial, multiple, and aquatic. Temperature was a factor with these given categories: psychrophilic, mesophilic and thermophilic. Halotolerance included the factors: non-halophilic, mesophilic, halophilic and extreme halophilic. Genome size was excluded from both models since it was not found to be significant.

To verify how the different predictors affected the pathogenicity factor, a binomial regression model was fitted with the dichotomous pathogenicity factor as the response and AT content, RR-streches, YR-stretches and habitat as predictors. The factors representing AT content, RR and YR stretches were numeric, while habitat was a categorical variable. Clustering was performed with respect to phyla to correct for intra-correlations within each phylum. The following model was fitted:

Statistical analyses were performed with the program R

Abbreviations

YR-stretches: Alternating pyrimidine/purine stretches of more than 10 bp; RR-stretches: purine stretches of more than 10 bp

Authors' contributions

JB wrote the manuscript, carried out statistical analyses and wrote the computer programs. SH critically drafted and revised the manuscript, DWU conceived of the study, performed analyses and critically drafted and revised the manuscript. All authors read and approved the final manuscript.

Acknowledgements

The authors wish to thank the referees for their constructive remarks and many helpful suggestions. In addition, Eystein Skjerve is thanked for help with the statistical analyses. JB is funded by the National Veterinary Institute of Norway and the Norwegian School of Veterinary Science. SH is funded by the Norwegian School of Veterinary Science, and DWU is funded by grants from the Danish Research Council.