Leibniz Institute DSMZ – German Collection of Microorganisms and Cell Cultures, Braunschweig, Germany
EberhardKarlsUniversität, Tübingen, Germany
Abstract
Background
For the last 25 years species delimitation in prokaryotes (
Results
Correlation and regression analyses were used to determine the bestperforming methods and the most influential parameters.
Conclusions
Despite the high accuracy of
Background
DNADNA hybridization (DDH) is a wetlab method currently still used as the taxonomic gold standard for species delineation in
The DDH technique is currently established in only a few specialized labs (mainly microbial service collections) and, because it is prone to experimental deviation, requires several experimental repetitions to determine the statistical confidence of that experiment. For instance, regarding species delimitation in microbiology, the relevant question is whether or not the DDH value is significantly below or above 70%. This is particularly important in the context of a polyphasic approach, in which the evidence from DDH has to be traded off against other criteria such as phenotypic measurements
The increasing availability of genome sequences thus triggered the development of computational techniques to replace wetlab DDH
In view of the technical problems and progress the relation between the wetlab DDH procedure and digital estimation of DDH equivalents reminds very much to what happened some 30 years ago when DNA:rRNA crosshybridization melting curves
The Genome Blast Distance Phylogeny approach (
A further use of
The first goal of the present study is to improve DDH estimation from genomesequence comparisons by using a more comprehensive empirical database and by considering a broader range of numerical data transformations and statistical models. Previous studies were limited to regression models of the untransformed data and thus presupposed a linear relationship between wetlab DDH and the results of genomesequence comparisons
The second goal of these examinations is to obtain confidence intervals for insilico DDH values – an indicator showing taxonomists how uncertain a reported value is, especially if it is close to the 70% boundary. Even though it is safe to assume a priori that digital DDH values display much less variability than wetlab DDH experiments given the high sequence coverage that can be obtained with stateoftheart sequencing technology
The third topic of this study is to broaden the range of considered
The results of this study are thus likely to contribute toward progress in using the comprehensive information encoded in entire genomes for the taxonomy of prokaryotes.
Methods
Extended benchmark data set
The DDH benchmark data set was extended compared to previous studies aiming at an increased precision and significance of the ranking of the genometogenome distance methods and the models for the conversion to DDH values. In detail, the here used data set (henceforth called “DS1”) comprised 156 unique genome pairs along with their respective DDH values: 62 from Goris et al.
If several DDH/ANIb/ANIm/Tetra values were present for a single genome pair, they were averaged. A single genome pair showed a DDH value above 100% similarity (i.e., 100.9% between
Empirical data sets. CSV file holding the empirical data sets used in this study. The file can be accessed with spreadsheet programs (e.g., Excel, OpenOffice or LibreOffice) or any given text editor.
Click here for file
To detect significant deviations, if any, between the new and the previous
The GBDP principle, and its technical update
To motivate the upcoming changes such as the addition of support for
The pipeline is primarily subdivided into two phases. First, a genome X is
Overview on the examined GBDP input parameters. PDF file holding a table about all
Click here for file
The resulting matches between both genomes are called highscoring segment pairs (HSPs) and represent local alignments that are considered statistically significant if the associated expect value (evalue) is sufficiently low
In the second phase, these matches are transformed to a single distance value
The web service at
All distance formulae used by
Click here for file
Each was devised to consider distinct aspects of intergenomic relationships. Formula
However, in practice, at least some HSPs from a
An example of a hypothetical HSP layout between two genomes A and B as produced during the GBDP alignment phase
An example of a hypothetical HSP layout between two genomes A and B as produced during the GBDP alignment phase. Subsequences that are part of an HSP in either A or B are labeled with small letters ag. A special case is represented by segment “c” where both HSP 2 and HSP 3 are overlapping.
Each vector position which is not covered by
In the case of the coverage vectors n equals the length of the respective genome (i.e., 
Analogously, the number of genome positions covered by HSPs,
Conducting genome comparisons for the correlation analysis
A correlation analysis was conducted to show the overall performance of the
In general, studies of that kind are computationally challenging, because a huge number of input and result files need to be processed. This gave rise for equipping the method with an extension allowing it to be executed on compute clusters
Analyzing correlations between intergenomic distances and DDH values
According to
For
GBDP bootstrapping and jackknifing
To obtain confidenceinterval (CI) estimates,
The dependency of the resulting bootstrapping and jackknifing CIs on each genome pair’s original distance (point estimate) was investigated, as well as the effect of the
DDH prediction using sophisticated statistical models
The problems caused by linear models (see above) for predicting DDH via intergenomic distances can be solved by more sophisticated statistical models such as generalized linear models (GLMs)
GLMs belong to the parametric modeling techniques and make assumptions about the underlying distribution. For proportional response data as present here a binomial distribution is recommended (
To assess whether the fit of the overall model (determined by the model’s residual deviance) could be further improved, a log transformation was applied to the explanatory variable
The performance of the model types and data transformations was also assessed by computing error ratios in DDH prediction. For each of the 4350
Results
Performance of methods and settings in mimicking wetlab DDH
Figure
Results of the correlation analyses between GBDPderived distances and DDH as opposed to the correlations between ANI and DDH
Results of the correlation analyses between GBDPderived distances and DDH as opposed to the correlations between ANI and DDH.A: The performance of both
Results of the correlation analysis. Spreadsheet in Open Document Format (ODS) that can be accessed via common spreadsheet programs (e.g., Excel, OpenOffice or LibreOffice). The Spreadsheet contains several tabs, each one holding the results for the data sets DS1DS4 (see Materials and Methods).
Click here for file
Correlations
Settings
Dataset
Type
Estimate
Alignment tool or method
Evalue filter
Algorithm
Formula
Juxtaposition of DDH correlation values for bestperforming
DS1
Kendall
0.761
BLAT
10
Coverage
0.752
BLAST+ (WL46)
10
Coverage
0.677
BLAST+ (WL46)
10
Coverage
Pearson
0.956
BLAT
10
Greedy
0.956
BLAT
10^{−2}
Trimming
0.946
BLAST+ (WL38)
10
Coverage
0.935
BLAST+ (WL46)
10
Coverage
DS2
Kendall
0.763
BLAT
10
Coverage
Pearson
0.954
BLAT
10
Coverage
DS3
Kendall
0.783
BLAST+ (WL38)
any
Coverage
0.717
ANI



Pearson
0.980
MUMmer (MR20)

Greedy
0.973
ANI



DS4
Kendall
0.737
BLAT
10, 10^{−2}
Coverage
0.735
BLAST+ (WL45)
any
Coverage
0.693
Tetra



0.598
ANIb



0.594
ANIm



Pearson
0.957
BLAT
10^{−2}
Greedy
0.904
ANIm



0.703
ANIb



0.693
Tetra



The most influential
Input for the multiple linear regression analysis. Spreadsheet in Open Document Format (ODS) that can be accessed via common spreadsheet programs (e.g., Excel, OpenOffice or LibreOffice). The Spreadsheet contains the input data required for reproducing the multiple linear regression analysis.
Click here for file
Additional figures. PDF file holding all figures that did not fit in the main manuscript, although these help to further elucidate the study and its results.
Click here for file
Confidence intervals via bootstrapping or jackknifing
The effect of the
Distributions of the median coefficients of variation of intergenomic distances obtained by resampling GBDP
Distributions of the median coefficients of variation of intergenomic distances obtained by resampling GBDP. The depicted distributions were determined by grouping the median coefficient of variation (CV) for each setting by either algorithms (left; “greedy”, gr; “greedywithtrimming”, tr; “coverage”, cov) or formulae (right).
Figure
Juxtaposition of confidenceinterval widths for both model based DDH predictions and those induced by bootstrap replicates
Juxtaposition of confidenceinterval widths for both model based DDH predictions and those induced by bootstrap replicates. Distances were calculated under the selected wellperforming
The relationship between the intergenomic distance and the underlying set of HSPs obtained by comparing the respective pair of genomes is presented in Additional file
Models for DDH prediction and species delineation
Figure
GLM with a binary response variable
GLM with a binary response variable. The curve depicts the predictions from the model for the selected wellperforming
The results for the GLMs using wetlab DDH values as response variable are shown in Figure
Comparison of generalized linear models and data transformations for DDH prediction
Comparison of generalized linear models and data transformations for DDH prediction. All model fits were based on distances calculated with the selected wellperforming
DDH predictionbased intergenomic distances. CSV file holding sample DDH predictions under different statistical models as analyzed in this study. The file can be accessed with spreadsheet programs (e.g., Excel, OpenOffice or LibreOffice) or any given text editor.
Click here for file
In Table
Model types
GBDP settings
GLM _{ log }
GLM
LM _{ AS }
LM
Error ratios under different models and the full empirical data set. The here presented
BLAT (
0.045
0.058
0.052
0.052
BLAT (
0.045
0.090
0.097
0.084
BLAST+ (
0.090
0.065
0.071
0.052
BLAST+ (
0.090
0.213
0.187
0.316
BLAST+ (
0.039
0.052
0.052
0.052
Discussion
Bootstrapping and jackknifing GBDP
With only minor differences between bootstrapping and jackknifing, the use of different algorithms had an obvious impact on the CIs of the resulting distances. The full implementation of the “coverage” algorithm allowed its application in connection with distance formulae
That the “greedy” and “greedywithtrimming” algorithm yielded substantially higher CVs and CIs, as well as an increase of CVs and CIs with decreasing distance (and, thus, increasing DDH similarity) is most likely caused by the fact that here sets of HSPs, not genome positions are resampled. The observed
For practical purposes this indicates that in conjunction with “coverage”, bootstrapping and jackknifing
Models for DDH prediction
All previous studies
Moreover, the GLMs combined with the logtrans formed explanatory variable yielded a higher consis tency between the correlation coefficients and the prediction success at the 70% boundary, and the better correlating
The enlarged data set provided a globally increased significance of the inferred results. The comparison of selected
Both theoretical and empirical results thus favor GLMs over standard linearregression models for obtaining insilico DDH replacement methods. Its improved DDH prediction capabilities offer
The recommended GBDP method
In principle, multiple optimality criteria could be applied for selecting a
Regarding localalignment programs, only
Beyond pairwise distances
Since the dawn of computerbased approaches to phylogenetics, researchers were trying to devise solutions for assessing statistical support of the inferred phylogenies
In contrast, distance methods that avoid the construc tion of a character matrix would need to apply boots trapping or jackknifing to each pairwise comparison independently. For instance,
Apparently,
For this reason,
Nevertheless, that a single method can be applied to both genomebased species delimitation and phylogenomic inferences at other taxonomic levels, and that it can be coupled with the assessment of statistical significance at one level, already strongly indicates that
Conclusions
This update on the
By introducing (i) bootstrapping and jackknifing to the
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
JMK participated in the design of the study, carried out the experiments, performed the (statistical) analysis and wrote the manuscript. AFA contributed software methods to this study and helped carrying out the experiments. MG and HPK designed and conceived the study. MG also participated in writing the manuscript. All authors read and approved the final manuscript.
Acknowledgements
Cordial thanks are addressed to Marek Dynowski and Werner Dilling, both Zentrum für Datenverarbeitung, University of Tübingen, for granting access and for their technical support related to the compute clusters of the