Biostatistics, Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, Ås, Norway

Centre for Biological Sequence Analysis, Technical University of Denmark, Lyngby, Denmark

Abstract

Background

The size of the core- and pan-genome of bacterial species is a topic of increasing interest due to the growing number of sequenced prokaryote genomes, many from the same species. Attempts to estimate these quantities have been made, using regression methods or mixture models. We extend the latter approach by using statistical ideas developed for capture-recapture problems in ecology and epidemiology.

Results

We estimate core- and pan-genome sizes for 16 different bacterial species. The results reveal a complex dependency structure for most species, manifested as heterogeneous detection probabilities. Estimated pan-genome sizes range from small (around 2600 gene families) in

Conclusion

Analyzing pan-genomics data with binomial mixture models is a way to handle dependencies between genomes, which we find is always present. A bottleneck in the estimation procedure is the annotation of rarely occurring genes.

Background

One of the consequences of the explosion in numbers of fully sequenced and annotated microbial genomes is that we are now facing the challenges of comparative pan-genomics

Having a set of fully sequenced and annotated genomes from several strains within a species, one is interested in two sets of genes. The first is the set of core genes,

The true core- and pan-genome sizes, here denoted

One of the implications of early pan-genome estimates is that some bacterial species might have an "infinite" pan-genome

We will, however, extend the good idea of

Results

Algorithm

Gene families

For a given species ** M **= {

Mixture model

The pan-genome size, ** M **we get the number of genomes in which gene family

In order to predict _{0 }we need a model that relates _{0 }to _{1},..., _{G}. Consider ** y **= (

Using _{0 }if we can estimate _{0}. This estimate can be found by assuming some degree of smoothness across the multinomial probabilities. One way of obtaining this is by using a binomial mixture model. This means we assume

where _{k }is the

is a binomial probability mass function with _{k}. Thus, the multinomial probabilities are expressed as a combination of _{g }are related to each other, and more specifically how _{0 }relates to _{g}, _{k }of being detected (probability of "success") in a genome. If _{k }is low, these genes are typically rarely observed in the sequenced genomes, and vice versa. A binomial mixture like this was also used by

Mixture model example

**Mixture model example**. An illustration of a three component binomial mixture model when _{1 }= 1.0, i.e. the core component. In the upper right panel a second component has a binomial PMF (green) where _{2 }= 0.85, and in the lower left panel a third component (blue) has _{3 }= 0.05. The lower right panel shows their combination into 11 multinomial probabilities, using mixing proportions _{1 }= 0.2, _{2 }= 0.1 and _{3 }= 0.7.

It is natural to reserve one of the mixture components for the class of core genes. Core genes are special, since these genes should always be present in all genomes, and it is natural to assign them detection probability 1.0, as was also done by _{1 }= 1.0.

Estimation

The parameters of the binomial mixture model cannot be estimated directly from ** y**, again because

Considering a fixed _{+ }= (_{1},..., _{G}) is also a multinomial, with probability _{g}_{0}) for element

where _{0},..., _{G }depend on ** π **and

The final part of the estimation procedure is to find the proper number of components _{1},..., _{G}). Since our criterion in (4) is a log-likelihood function for the data, we have adopted the Bayesian Information Criterion (BIC) to select the proper model complexity

is minimized, where (2_{1}.

Once we have determined the proper number of components

where

We have observed that the pan-size estimate may be heavily influenced by the chosen number of components, a generic property discussed by

As an alternative to the binomial mixture model estimate, we have also included the Chao lower-bound estimate

Notice that this corresponds to _{0 }being predicted from _{1 }and _{2 }only.

Implementation

All computations, including the parsing of BLAST results, setting up the pan-matrix and performing all estimations have been implemented in R

Testing

Estimating core- and pan-sizes

We employed our method to data for 16 different bacterial species, who have all at least 5 different genomes sequenced and annotated at NCBI

Genomes and their core- and pan-genomes

**Genomes and their core- and pan-genomes**. Number of genomes refer to completed genomes at NCBI

Figure

Core- and pan-genome size estimates

**Core- and pan-genome size estimates**. Observations and estimates of core- and pan-genome sizes. The horisontal axis is on log_{2 }scale. Solid blue markers represent the observed data; squares are the core genes, circles are the median number of genes for an individual genome, and the triangles are the total number of gene families found in the data set. The red "+" represents the estimated core size, whilst the red "x" is the estimated size of the pan-genome using the binomial mixture model. The red "c" is the Chao lower-bound estimate of pan-size. The bars represents a 90% naive bootstrap confidence interval for the pan-genome, giving a rough indication of uncertainty.

Distribution of gene families

Figure

Estimated mixture models

**Estimated mixture models**. Graphical display of binomial mixture models. Each rectangle corresponds to a component, its width indicates its mixing proportion and its color indicates its detection probability (see color bar). Red areas indicate parts of the pan-genome with a small detection probability, i.e. rarely occurring genes, whilst regions towards the blue end of the scale represent conserved genes – that is, genes shared by most of the genomes.

Effect of growing data set

For one of the species,

Effect of growing E. coli data set

**Effect of growing E. coli data set**. Sample (black) and estimated population (red and blue) pan-genomes sizes for

Effect of gene prediction

The use of a mixture model makes it apparent that the estimate of pan-genome size must depend on how many gene families we observe in few genomes. Especially those gene families observed in only one genome, are most likely important. These genes are often referred to as ORFans. Upon inspection of the data, we found that the annotation "hypothetical protein" is severely over-represented among the ORFans in all 16 species (Fisher exact test p-values less than 10^{-10}). Thus, false positives from the gene prediction,

In order to quantify this effect, we made a re-analysis of the

Effect of gene predictions

Data set

Observed

ORFans

Chao

Bin. mix.

Original NCBI

12599

5438

26614

42640

Reduced 10%

11273

4470

22549

32528

Reduced 50%

9336

3272

17083

27456

Easygene

9211

3121

17041

29818

The number of observed gene families in data set, the number of ORFans (gene families found in 1 genome only), Chao estimates and binomial mixture estimates of pan-genome size for the original

Discussion

The use of a binomial mixture model for estimating the pan-genome size was introduced by

From our results in Figure

A reason for this heterogeneity in detection probabilities may be skewed sampling. If some of the sequenced genomes are sampled in the same "corner" of the population, the genes characteristic for this "corner" will occur more frequently than they should. Another reason may be that some genes are simply frequently occurring in the population, reflecting a divergence from a fairly recent ancestor. In this perspective, it must be expected that there is a large number of true detection probabilities, which is at least partly supported by the fact that the more genomes we consider the more components we estimate (see Figure

The fact that microbial genomic diversity is caused by both vertical mutations and horizontal transfer makes it also plausible to expect heterogenous detection probabilities.

From Figure

In Figure

This is due to the fact that larger data sets allow more complex models, and more complex models allow more extreme estimates. Uncertainties, as indicated by the rough confidence intervals, also tend to grow when estimates grow, which is reasonable.

In Figure

From the results in Figure

In Table

Our approach assume a closed pan-genome, i.e.

A recent publication

Conclusion

We have shown how to use binomial mixture models to estimate microbial core- and pan-genome size, and the vast literature on capture-recapture methods should be further exploited in microbial pangenomics, as it has been in closely related fields like metagenomics

Authors' contributions

LS launched the idea of using capture-recapture methods and has done all programming and data analysis. TA has contributed to the choices of statistical methods and how to present them to a broader audience. DWU formulated the problem and supervised the choice of analyses to conduct. LS and DWU drafted the manuscript. All authors have read and approved the final manuscript.

Acknowledgements

We wish to thank Carsten Friis, Centre for Biological Sequence Analysis, Technical University of Denmark, for his assistance in performing the Easygene predictions.