Norwegian School of Veterinary Science, P.O. Box 8146 Dep., N0033 Oslo, Norway
Center for Biological Sequence Analysis, Technical University of Denmark, DK2800 Lyngby, Denmark
Abstract
Background
Recently there has been an explosion in the availability of bacterial genomic sequences, making possible now an analysis of genomic signatures across more than 800 hundred different bacterial chromosomes, from a wide variety of environments.
Using genomic signatures, we pairwise compared 867 different genomic DNA sequences, taken from chromosomes and plasmids more than 100,000 basepairs in length. Hierarchical clustering was performed on the outcome of the comparisons before a multinomial regression model was fitted. The regression model included the cluster groups as the response variable with AT content, phyla, growth temperature, selective pressure, habitat, sequence size, oxygen requirement and pathogenicity as predictors.
Results
Many significant factors were associated with the genomic signature, most notably AT content. Phyla was also an important factor, although considerably less so than AT content. Small improvements to the regression model, although significant, were also obtained by factors such as sequence size, habitat, growth temperature, selective pressure measured as oligonucleotide usage variance, and oxygen requirement.
Conclusion
The statistics obtained using hierarchical clustering and multinomial regression analysis indicate that the genomic signature is shaped by many factors, and this may explain the varying ability to classify prokaryotic organisms below genus level.
Background
The lowering sequencing costs are resulting in an exponentially increasing amount of available genetic data
In the present work we examine the "genomic signature" of an organism that can be found in an arbitrary fraction of genomic DNA using dinucleotide relative abundance patterns
Genomic signatures are presumed to be shaped by factors such as DNA structure, restriction and transcription systems, basestacking energies, replication and repair, and more
The OUV measure calculates the deviance between genomic oligonucleotide frequencies and approximated oligonucleotide frequencies using the considered oligonucleotide's mononucleotide frequencies. This reflects how genomic oligonucleotide usage is biased compared to what is expected from genomic AT content. In effect, since each considered oligonucleotide frequency is approximated by its corresponding mononucleotide frequencies, complete independence is assumed between the nucleotides in the approximated oligonucleotide. Hence, the OUV measure approximates genomic oligonucleotide frequencies using genomic AT content. Large OUV values are therefore indicative of strong bias or selective pressure, while low OUV values are associated with mutagenesis.
Additionally, we compared the phylogenetic signal of the genomic signature to factors such as AT content, growth temperature, habitat, and chromosome size. To do this, 867 prokaryotic chromosomes and plasmids larger than 100 kb were compared pairwise. The method of choice was hexanucleotide frequency based genomic signatures, since that particular method has been found to reflect a stronger phylogenetic signal than both di and tetranucleotide based genomic signatures
Results
Bias in oligonucleotide usage
OUV scores were calculated for observed di, tetra and hexanucleotide frequencies for all DNA sequences and fitted to regression models as response variables with genomic AT content as the predictor. The equations resulting from the regression models can be found in Table
Regression models of genomic di, tetra and hexanucleotide frequencies and AT content
DNA word size
Regression equations
Coefficient of determination
Significance
Dinucleotides
Y_{2 }= exp(6.428.64X_{AT }+ 6.59X^{2}_{AT})
Tetranucleotides
Y_{4 }= exp(8.8514.73X_{AT }+ 12.39X^{2}_{AT})
Hexanucleotides
Y_{6 }= exp(11.7421.94X_{AT }+ 19.40X^{2}_{AT})
Pairwise comparisons of genomes using genomic signatures
The prokaryotic DNA sequences compared pairwise using hexanucleotidebased genomic signatures were analyzed using cluster and multinomial regression analysis. Figure
Genomic signature based cluster diagram. JPG file containing 867 labelled prokaryotic DNA sequences compared pairwise using hexanucleotidebased genomic signatures, and clustered using hierarchical clustering.
Click here for file
Cluster diagram of 867 prokaryotic genomic DNA sequences compared pairwise using hexanucleotidebased genomic signatures
Cluster diagram of 867 prokaryotic genomic DNA sequences compared pairwise using hexanucleotidebased genomic signatures. 867 prokaryotic genomic DNA sequences were compared pairwise with hexanucleotidebased genomic signatures. Hierarchical clustering was performed on the resulting 867 × 867 correlation matrix using average linkage and Euclidean distance. The cluster diagram was grouped into different segments, Groups 17, based on the clustertree which reflected how the prokaryotic DNA sequences compared pairwise. Lighter colors mean higher correlation scores, and thus closer similarity between the compared genomes. The multicolored horizontal bar on top indicates each chromosome's respective phylum, while the vertical red and blue coloured bar shows AT/GC content, where red means GC content larger than 50% and blue AT content larger than 50%. Groups 5 and 7 are mainly populated with freeliving, GC rich, prokaryotes with diverse metabolic capabilities. Groups 1 and 3 consist predominantly of AT rich and hostassociated archaea and bacteria, while group 2 and 6 consisted mainly of larger hostassociated
Average AT scores and OUV content in cluster groups
Average AT scores and OUV content in cluster groups. The graphs shows average AT content (left) and OUV scores (right) on the vertical axis, for each group on the horizontal axis. High OUV scores indicate strong bias in genomic hexanucleotide usage, while low scores imply more random DNA composition. Freeliving archaea and bacteria (groups 5 and 7) obtain higher average OUV scores than hostassociated (groups 1 and 3), indicating pronounced differences in mutational pressures in the respective environments. Average AT content was considerably higher in the hostassociated groups than in the freeliving.
The cluster diagram was divided into seven major groups, named groups 1 to 7, based on the cluster diagram in Figure
Groups 2 and 6 contained larger host associated bacteria predominantly from the
Groups 5 and 7 contained metabolic diverse and freeliving Proteobacteria and Actinobacteria. From Actinobacteria we found genera such as
Group 4 was the smallest of the groups discussed, and contained only twelve genomes. Both average AT content and OUV scores were fairly high compared to the other groups. The group obtained, on average, low correlation scores with the other groups and was therefore treated as a separate group. Members of the group included
The model
The different cluster groups were fitted as a categorical variable to a regression model using the factors: genome size, AT content, OUV, phyla, growth temperature, oxygen requirement, and habitat. In Table
Polychotomous regression model with added predictors to the far left
Model components
LogLikelihood
McFadden R^{2}
ΔAIC
AIC
Model 0: constant
1534
0
0
3080
Model 1: Size
1475
0.04
95
2985
Model 2: AT content
796
0.48
1333
1652
Model 3: OUV
775
0.49
30
1622
Model 4: Phyla
433
0.72
455
1167
Model 5: Oxygen req.
414
0.73
15
1152
Model 6: Habitat
360
0.77
61
1091
Model 7: Temperature
320
0.79
56
1035
Final model
320
0.79

1035
The table shows a forward fitting of a set of predictors to the response variable representing the cluster groups.
It should be noted that there is some colinearity between the factors in the regression model. The predicted influences of each factor in Table
Discussion
Selection pressure as measured by OUV
The calculation of OUV gives an indication of how random or biased the occurrences of oligonucleotides are in genomes (See Methods section, as well as
Analysis of the model
The multinomial regression model gives a rough prediction of influences determining similarity with respect to the genomic signature discussed here. Figure
It has been observed
The categorical factors included in the model must be considered as rough, giving only inferential knowledge. This is especially noticeable in the factor describing a genome's habitat, where many hostassociated genomes may be found in multiple environments and vice versa.
Table
A model was also created with the addition of a pathogenicity factor. This factor was included since it is assumed that pathogenic bacteria exchange DNA with the surroundings more often than nonpathogenic ones
Analysis of the cluster groups
Figure
By clustering bacteria according to codon usage it was found that genomes grouped according to their respective habitat and lifestyle
Figure
The above examples illustrate that prokaryotic DNA composition, expressed using hexanucleotidebased genomic signatures, can be similar regardless of kinship. The similar DNA composition is, according to our results, a consequence of a collection of factors having acted on the genomes. Thus, genomic analyses of organisms undergoing evolutionary transition between different environments may give many important clues concerning how differences in DNA composition may arise in closely related organisms.
Conclusion
Our results, based on hierarchical clustering and multinomial regression, indicate that genomes compared using genomic signatures are primarily grouped according to AT content. In the model presented, AT content was more strongly associated with the clustered groups than taxonomy. Taxonomy was, in turn, found to be more strongly linked to the clustered groups than the other significant factors. The remaining factors found to significantly affect the regression model were, in order of importance, genome size, habitat, temperature, selection bias (OUV) and oxygen requirement. It can therefore be concluded that the genomic signature in prokaryotes is influenced by many factors which may explain the limited phylogenetic scope below genus level.
Methods
All genomic DNA sequences were obtained from the NCBI genome database
Data file. Excel file containing all 867 prokaryotic chromosomes and plasmids larger than 100 kb along with the corresponding list of genomic properties and phyla.
Click here for file
The computer programs used to generate the results were made according to the explanations given below. The following notation will be used throughout:
Let (
gives the overlapping empirical frequency of the oligonucleotide (
This means that:
The hexanucleotidebased relative abundances can then be calculated as follows:
Where 1 ≤
The genomic signature is then found by comparing two genomic DNA sequences with the Pearson correlation formula:
And
The nucleotides
The following formulas
represent the average hexanucleotide relative abundance values.
Hierarchical clustering based on Euclidean distance was performed on the resulting symmetric 867 × 867 correlation matrix. Average linkage was used to put emphasis on the closest matches based on group similarities.
Oligonucleotide usage variance (OUV) can be considered as a measure of oligonucleotide frequency bias, or selection pressure on the genomic DNA composition, and was calculated according to the given formula for each chromosome:
The function
The formula implicitly assumes that each nucleotide in the approximated
Linear regression analysis was performed between OUV for di, tetra, and hexanucleotide frequencies (response variable) and genomic AT content (predictor variable) using log transformation.
A conditional logistic multinomial (polychotomous) regression model was fitted to asses the individual influences of predictors: genome size, AT content, OUV, phyla, oxygen requirement, habitat, growth temperature and pathogenicity, with the cluster groups as the response variable. The AIC and McFadden
The response variable "Groups" is a categorical variable consisting of the different cluster groups (see Figure
All regression models were statistically significant with the significance level set to
Authors' contributions
JB planned the project, wrote the computer programs and the manuscript. ES contributed to the statistical analyses and critically revised the manuscript. DU drafted and critically revised the manuscript and analyzed the data. All authors read and approved the final manuscript.
Acknowledgements
Peter F. Hallin and Stein Marvold are thanked for help with the computer programs and the cluster diagram.