Department of Ecology and Evolutionary Biology, University of California Irvine, Irvine, CA, USA

Abstract

Background

In several biological contexts, parameter inference often relies on computationally-intensive techniques. "Approximate Bayesian Computation", or ABC, methods based on summary statistics have become increasingly popular. A particular flavor of ABC based on using a linear regression to approximate the posterior distribution of the parameters, conditional on the summary statistics, is computationally appealing, yet no standalone tool exists to automate the procedure. Here, I describe a program to implement the method.

Results

The software package ABCreg implements the local linear-regression approach to ABC. The advantages are: 1. The code is standalone, and fully-documented. 2. The program will automatically process multiple data sets, and create unique output files for each (which may be processed immediately in

Examples of applying the software to empirical data from

Conclusion

In practice, the

Background

In many biological applications, parameter inference for models of interest from data is computationally challenging. Ideally, one would like to infer parameters using either maximum likelihood or Bayesian approaches which explicitly calculate the likelihood of the data given the parameters. While such likelihoods can be calculated for data from non-recombining regions

In the last several years, approximate methods based on summary statistics have gained in popularity. These methods come in several flavors:

1. Simulate a grid over the parameter space in order to calculate the likelihood of the observed summaries, given parameters

2. The maximum-likelihood algorithm can be modified to perform Bayesian inference by simulating parameters from prior distributions, calculating summary statistics, and accepting the parameters if they are "close enough" to the observed

3. Decide ahead of time how many random draws to take from a prior distribution, then accept the fraction of draws which generate summary statistics closest to the data, according to some distance metric. This is the rejection-sampling approach of

4. Take the parameters accepted from Method 3, and regress those acceptances onto the distance between the simulated and observed summary statistics

The latter three methods are all forms of "Approximate Bayesian Computation" (ABC), a term which generally applies to inference problems using summary statistics instead of explicit calculations of likelihoods. The three Bayesian schemes described above are the simplest form of ABC, and the approach has been extended to use Markov Chain Monte Carlo techniques to explore the parameter space

Currently, many tools are available for the rapid development and testing of summary-statistic based approaches to inference, including rapid coalescent simulations for both neutral models

Implementation

The software package is called

The algorithm implemented is identical to that of

1. Transformation of the parameters simulated from the prior distribution. Currently, the program implements both the natural-log transformation used in

where

2. Normalisation of the observed summary statistics and summary statistics simulated from the prior

3. The rejection step based on accepting the closest

4. The regression adjustment

5. Back-transformation of regression-adjusted parameter values and output to files. The program generates one output file per data set in the data file. File names are generated automatically, and the prefix of the file names is controlled by the user. The output files contain tab-delimited columns which are the regression-adjusted parameter values (

Use of the software requires two input files. The first file describes the data (either real or simulated), and contains a space-delimited list of the summary statistics. One can analyze multiple data sets by recording the summary statistics for each data set on a different line of the file. The second input file describes the results of simulating from the prior distribution on the model parameter(s). This "prior file" contains a space-delimited list of the parameters, and the corresponding summary statistics (in the same order as in the data file).

Additional features include a complete debugging mode, which helps identify cases where the linear regression may fail. In practice, the analysis of some data sets may return non-finite parameter values. Often, this is due to the predicted mean value of the regression being quite large, such that back-transformation (+/- the residuals from the regression) results in a value that cannot be represented on the machine. In debug mode, such cases immediately exit with an error. When not in debug mode, the program prints warnings to the screen.

Results

In this section, I show results from applying the ABCreg software to the inference scheme of _{r}, the time at which the population recovered from the bottleneck, _{r }and _{0 }generations, where _{0 }is the effective population size at the present time, and _{b}/_{0}, the ratio of the bottlenecked size to the current size (0 <_{0}). See ^{st }and 99^{th }quantiles of the resulting posterior distributions were used as the bounds on a new, uniform prior, and the acceptance criteria were made more strict. Three summary statistics were used: the variances across loci of nucleotide diversity (

I repeated the analysis using the local regression approach using the same data and uniform priors on parameters (see Table one of _{e}^{6 }draws from the prior distribution on the three parameters, and to record the resulting summary statistics. Simulating from the prior took 24 hours on four 2 gigahertz AMD Opteron processors. The tolerance was set such that 10^{3 }acceptances were recorded for the regression. The model has three parameters, and three summary statistics are used. Once the simulations from the prior distribution are complete, the entire ABC analysis was performed with one command:

**reg -P 3 -S 3 -p prior -d data -b data -t 0.0002 -T**,

where the arguments specify the number of parameters (-P), number of summary statistics (-S), names of files containing the prior (-p) and data (-d), the prefix of the output file names (-b), the tolerance (-t), and -T specifies the transformation described in

Figure

Estimation of bottleneck parameters for European populations of

**Estimation of bottleneck parameters for European populations of Drosophila melanogaster**. The data analyzed are described in

Because the method is quite rapid, the performance of the estimator is easily evaluated. Figure ^{3 }random samples from the prior model used for the inference in Figure

Performance of the regression ABC estimator of bottleneck parameters

**Performance of the regression ABC estimator of bottleneck parameters**. Parameters were estimated from the modes of posterior distributions from one thousand random samples from the prior model used for inference in Figure 1. Because each data set is a random sample from a distribution of parameters, the distribution of each estimator is divided by the true value, such that the distribution of an unbiased estimator would have a mean of one. A vertical line is placed at the mean of each distribution. The parameters are the same as in Figure 1. As in Figure 1, the tolerance was set to accept 10^{3 }draws from the prior, and the tangent transformation was used prior to regression

Conclusion

The linear regression approach to ABC analysis

Availability and requirements

The source code is distributed under the terms of the GNU public license and is available from the software section of the author's web site ^{++}

Authors' contributions

The author implemented and tested the code, and wrote the paper.