Radically different pan and core genome sizes cannot be estimated from sampled genomes. (A) Two species with vastly different true gene distributions: (i) Species A (blue) w/pan genome of 105 genes and core genome of 103 genes; (ii) Species B (green) w/pan genome of 107 genes and core genome of 10 genes. Each genome has 2000 genes randomly chosen from the true gene distribution according to its frequency. (B) The number of genes (y-axis) observed as a function of the number of sampled genomes (x-axis). Note that despite differences in the true distribution, the observed gene distributions are statistically indistinguishable given 100 sampled genomes. For example, there were approximately 2200 genes found in just 1 of 100 genomes for both Species A and Species B. (C) Observed pan genome size as a function of the number of sampled genomes. There is no possibility to extrapolate the true pan genome size from the observed pan genome curves. (See Additional file 1, Figure S1 for further details.) (D) Observed core genome size as a function of the number of sampled genomes. There is no possibility to extrapolate the true core genome size from the observed core genome curves.
Kislyuk et al. BMC Genomics 2011 12:32 doi:10.1186/1471-2164-12-32