Canada Research Chair in Forest and Conservation Genomics and Biotechnology, Canadian Genomics and Conservation Genetics Institute, University of New Brunswick, Faculty of Forestry and Environmental Management, 28 Dineen Drive, Fredericton, NB, E3B 6C2, Canada

Abstract

Background

Sample size is one of the critical factors affecting the accuracy of the estimation of population genetic diversity parameters. Small sample sizes often lead to significant errors in determining the allelic richness, which is one of the most important and commonly used estimators of genetic diversity in populations. Correct estimation of allelic richness in natural populations is challenging since they often do not conform to model assumptions. Here, we introduce a simple and robust approach to estimate the genetic diversity in large natural populations based on the empirical data for finite sample sizes.

Results

We developed a non-linear regression model to infer genetic diversity estimates in large natural populations from finite sample sizes. The allelic richness values predicted by our model were in good agreement with those observed in the simulated data sets and the true allelic richness observed in the source populations. The model has been validated using simulated population genetic data sets with different evolutionary scenarios implied in the simulated populations, as well as large microsatellite and allozyme experimental data sets for four conifer species with contrasting patterns of inherent genetic diversity and mating systems. Our model was a better predictor for allelic richness in natural populations than the widely-used Ewens sampling formula, coalescent approach, and rarefaction algorithm.

Conclusions

Our regression model was capable of accurately estimating allelic richness in natural populations regardless of the species and marker system. This regression modeling approach is free from assumptions and can be widely used for population genetic and conservation applications.

Background

Accurate estimation of genetic diversity parameters in large natural populations using finite sample sizes is one of the central issues in population and conservation genetic studies and applications. Small sample sizes can lead to significant errors in estimating the genetic diversity of the species in question. For effective genetic resource conservation, sufficient allelic richness and a minimum number of carriers for each allele must be present in the conservation population to ensure its self-sufficiency over generations, otherwise its entire purpose may be compromised if the sampling criteria are not met

Allelic diversity (richness) is one of the most important and commonly used estimators of genetic diversity in populations. It strongly depends on the effective population size and past evolutionary history

Based on the probability theory alone, one can calculate the sample size required to detect alleles with a certain threshold frequency

It is rarely possible to know the true number of alleles in a population unless the entire population can be analyzed, in which case the concept of "sample" is not applicable anymore _{e}) found in an ideal population can be approximately described as

where _{e }is the effective population size and _{e }→ ∞, the error in _{e }approaches zero and the parameter _{e}

or, for large

Estimating the parameter _{e}_{e}) can be estimated by the coalescent approach _{e}

Although the Ewens sampling formula and coalescent approach provide theoretical expectations for the allelic richness in a given sample, they normally assume an ideal random mating population of constant size, and without migration and selection. However, natural populations rarely conform to these and other ideal population assumptions. Selection effects may be heterogeneous in time and space, and are extremely difficult to realistically model. Random mating may be hampered by spatial genetic structure and selfing

Methods

Model development

We investigated several empirical data sets published for a wide variety of plant and animal species to understand the relationship between allelic richness and sample sizes. From the data published for a wide variety of organisms, our own experimental results and computer simulations, we found that the number of alleles observed in a given sample is approximately proportional to the logarithm of the sample size, and the logarithm base depends on the species and the marker system used. Based on these observations, we developed a non-linear regression model to predict the observed allelic richness in a given sample. The model could be defined as:

where _{S }depends on the species and the marker set used, and _{n }and _{A }are the regression coefficients for the sample size and allelic richness, respectively, which depend on the species and molecular markers used. As natural logarithms (ln) are commonly used, we replace the logarithm base to

To further simplify (4a), we introduce the variable _{S}, so the equation (4a) can be written as

The coefficients in the regression model (5) can be empirically determined using the modified random resampling procedure and non-linear regression analysis as described below. At large sample sizes, the coefficient _{n }becomes negligible, and the equation (5) can be further simplified as

The empirically derived equation (5a) is similar to the modified Ewens sampling formula in equation (3).

Model validation and comparison with other methods

We tested the regression model (5) using (i) large empirical data sets for four conifer tree species with contrasting population genetic characteristics, and (ii) simulated population genetic data sets created using Markov-chain-based algorithm with different inherent migration and selfing rates.

Empirical data comprised multilocus genotype data sets for four conifer tree species with contrasting mating systems and inherent genetic diversity levels: microsatellite genotype data for eastern white pine -

Allelic richness estimated by regression, coalescent and rarefaction

**Species**

**ID**

**Source data set**

**Estimated allelic richness**

**No. of loci**

**
N
**

**A**

**Subsampling**

**( n = 120)**

**
ρ
**

**( n = 120)**

**
θ
_{Ewens}
**

**( n = 120)**

**
θ
_{coalescent}
**

**( n = 120)**

**Rarefaction**

**( n = 120)**

Microsatellites

PR1

6

180

13.00

11.06

11.04

11.98

9.23

10.68

PR2

6

180

13.33

11.18

11.17

12.29

8.94

10.71

PR3

6

180

15.33

12.48

12.44

14.13

11.92

12.19

PR4

6

180

14.83

12.48

12.44

13.67

12.13

11.92

PG1

6

105

22.83

21.13

21.30

23.49

35.74

20.96

PG2

6

105

22.83

20.55

20.62

23.49

51.84

20.44

PS1

13

102

9.77

9.03

9.13

10.11

17.57

9.03

PS2

13

102

9.23

8.67

8.73

9.55

15.91

8.68

TO1

6

100

7.83

7.18

7.17

8.14

12.26

7.17

TO2

6

100

9.67

8.95

9.00

10.05

16.28

9.09

TO3

6

100

8.83

7.86

7.95

9.18

14.06

7.95

Allozymes

PS1

15

95

3.20

2.97

2.98

3.38

3.34

2.93

PS2

15

95

3.27

3.09

3.10

3.59

4.15

3.04

Subsampling - allelic richness estimated by repeated random subsampling in pseudosimulated population data sets based on the empirical data. _{Ewens }- Allelic richness predicted by the Ewens sampling formula (3), where _{coalescent }- Allelic richness predicted by the Ewens sampling formula (3), where

An additional data set for allozyme markers for eastern white pine was also analyzed. Allozymes have been extensively used in population and conservation genetic studies before the advent of microsatellite markers. Although other markers, such as RAPD (random amplified polymorphic DNA), and AFLP (amplified fragment length polymorphism) have been used in population genetic studies, these markers are not well suited for such studies and have fallen out of favour, primarily due to their diallelic and dominant nature. Codominant SNP (single nucleotide polymorphism) markers are being used in population genetic studies. However, most of them also suffer from the limitation of being diallelic. Since the objective of the present study was to predict the allelic richness in large populations, we used microsatellite and allozyme markers for validating our model, since these markers are codominant and multiallelic.

_{m}) are 0.924, and 0.940, respectively _{m }= 0.595, and 0.635, respectively

The allelic richness estimates predicted by our regression model were compared with the Ewens sampling formula, coalescent approach, and rarefaction algorithm predictions. Since the experimental data sets had only up to 180 individuals per population, pseudo-simulation data sets of ~10,000 individuals per population were created for each of the four conifer species from their empirical genotype data (Table

The resulting allelic richness values were used to derive the estimates of _{n}, and _{A}) in (5), is provided in the Additional file

**An example of SAS NLIN input and output for estimating the regression coefficients of Equation (5)**.

Click here for file

We also tested the simplified Ewens formula (3) as a predictor for allelic richness. First,

Additionally, to estimate the effects of sample size on the observed genetic diversity and genetic subdivision parameters, we calculated the observed and expected heterozygosity, Shannon information index, and F_{ST }for

To validate our model, we created 10 artificial data sets each containing 2 populations of 10,000 individuals, with selected combinations of inherent migration and selfing rates, using the Markov chain-based simulation algorithm implemented in the EASYPOP 2.1 program

**Allelic richness estimated by repeated random resampling in simulated population genetic data with various combinations of migration and selfing rates.**

Click here for file

Results and discussion

The allelic richess values estimated by the regression model (5), subsampling of the pseudosimulated data sets, and other methods for four conifer species are provided in Table

Allelic richness predicted by subsampling and regression modeling for microsatellite data

**Allelic richness predicted by subsampling and regression modeling for microsatellite data**. PR:

Allelic richness predicted for one

**Allelic richness predicted for one Pinus strobus population (PS1) from allozyme data**. Subsampling - allelic richness estimated by repeated random subsampling of the amplified empirical data set in 50 replicates (95% confidence intervals are provided). Regression - allelic richness predicted by equation (5). Ewens - allelic richness predicted by equation (3),

Allelic richness predicted for selected populations of four species from microsatellite data

**Allelic richness predicted for selected populations of four species from microsatellite data**. Subsampling - allelic richness estimated by repeated random subsampling (95% confidence intervals are provided). Regression - allelic richness predicted by equation (5). Ewens - allelic richness predicted by equation (3), **A**: **B**: **C**: **D**:

**Allelic richness predictions for individual populations of all four species based on our regression model (5), Ewens formula and coalescent approach. **The population names are provided in Table

Click here for file

As mentioned above, the empirically derived equation (5a) is similar to the modified Ewens sampling formula (3). Allelic richness estimates predicted by the Ewens formula (3) significantly deviated from the empirical estimates obtained by repeated random subsampling (Figure _{n}, and _{A }would provide correction for possible deviations of the experimental population from the ideal population model.

We also compared allelic richness estimates obtained for the four conifer species using our regression model equation (5) and the rarefaction procedure. Rarefaction estimates were close to the subsampling and regression results obtained for

The proposed regression model developed in the present study has been validated by comparing the allelic richness parameters estimated by using different approaches in large Markov chain simulated populations (Figure

Allelic richness estimates in the simulated data sets

**Allelic richness estimates in the simulated data sets**. Simulation-Real - allelic richness observed in the total simulated data set A created by EASYPOP 2.1. Simulation-Subsampling - allelic richness observed in data set B, created by repeated random subsampling. Regression - allelic richness predicted by equation (5). Coalescent - allelic richness predicted by equation (3),

A valid concern would be that the original sample set used for the subsampling procedure may contain only a fraction of the allelic diversity present in a large natural population. Our results indicate that allelic richness estimates obtained by the model developed here in the amplified data were consistent with that actually observed in the total simulated population. The allelic diversity of various samples drawn from the entire simulated population of 10,000 individuals (data set A) was consistent with that drawn from 50-times pseudo-replicated population of 200 individuals (data set B) (Figure ^{-2}..10^{-4}) alleles in a finite population would require sample sizes close to the entire number of individuals in the population.

It should be noted that existence of spatial genetic structure in a population can affect the observed allelic diversity estimate in a sample. In two of the four studied species, spatial genetic structure up to ~25 meters has been observed (Rajora, unpublished; Pandey and Rajora, submitted). Since the sampling distance normally used for population genetic studies in forest trees (30-50 m) is greater than the observed spatial genetic structure, the latter has little effect on the allelic richness estimates.

The logarithmic nature of the relationship between allelic richness and sample size holds true regardless of the organism and marker system used. In addition to our own data sets for conifer tree species, we observed this relationship in a number of other studies published for various taxa, e.g.

Our approach takes into account possible deviations from the ideal population model occurring in such complex systems as natural forest tree populations, where long distance gene flow, population bottlenecks, selection, varying mating systems, and overlapping generations are the norm. One of the other advantages of our model over the coalescent approach is that it does not require high computation resources.

The minimum sample size for population genetics and conservation studies has been a hotly debated topic. Although it is usually desired to capture 90-95% of allelic diversity, it is often not feasible, as the true number of alleles in the population is rarely known. A recent study by Gapare and Aitken

For conservation and adaptation studies, rare alleles may be especially important as they may represent the populations' potential to adapt in changing environmental conditions. Usually, very large sample sizes are suggested for conservation populations

For most population genetic studies, an adequate sample size would be the one that allows for reliable estimation and comparison of genetic diversity and genetic subdivision parameters among populations. The effects of sample size on other observed population genetic parameters (observed and expected heterozygosities, F_{ST}, Shannon diversity index) were illustrated using red spruce (_{ST }(Figure _{ST }values. As discussed above, at large

Effects of sample size on Shannon diversity index and F_{ST}

**Effects of sample size on Shannon diversity index and F _{ST}**. A: Shannon index,

Conclusion

Our non-linear regression model provides a simple and robust approach to estimate the genetic diversity in large natural populations based on the empirical data. Since the regression coefficients in our model are derived empirically, and there are no assumptions to violate, it allows for quick and easy estimation of allelic diversity in large natural populations based on finite sample sizes. The model is independent of the marker mutation mode and population history, and works well with high selfing and predominantly outcrossing species. It has been validated on simulated data sets, as well as on the experimental data for different species and molecular marker systems. Therefore, our model is more accurate, simple and practical than the coalescent or Ewens approach. The proposed method can be widely applicable in population genetic studies, and it may provide the missing link for conservation and management decision support.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

All authors contributed equally to the submitted work: SB generated the red spruce source data, developed the equation and drafted the manuscript; MP and OPR provided eastern white cedar, eastern white pine and white spruce empirical data, provided suggestions and revised the manuscript; and OPR is the Principal Investigator of the research program and provided funding and overall guidance and research directions. All authors have read and approved the final manuscript.

Acknowledgements

The research was funded by the Canada Research Chair Program (CRC950-201869) funds and the Natural Sciences and Engineering Research Council of Canada Discovery Grant RGPIN 170651 to O.P. Rajora. S. Bashalkhanov was supported by the University of New Brunswick start up funds provided to O.P. Rajora and a Canadian Forest Service graduate student's supplemental stipend. M. Pandey was financially supported from the Canada Research Chair Program (CRC950-201869) funds to O.P. Rajora. Genotyping for