Computational Genomics, IBM T J Watson Research, Yorktown, USA

Institute of Evolutionary Biology (CSIC-UPF), Dr. Aiguader 88, 08003 Barcelona, Spain

Abstract

Background

Reconstructability of population history, from genetic information of extant individuals, is studied under a simulation setting. We do not address the issue of accuracy of the reconstruction algorithms: we assume the availability of the theoretical best algorithm. On the other hand, we focus on the fraction (1 -

Results

We observe that higher the rate of recombination, lower the value of

Conclusions

We present the very first framework for measuring the fraction of the relevant genetic history of a population that is mathematically elusive. Our observed results on the tested demographies suggest that it may be better to aggregate the analysis of smaller chunks of chromosomal segments than fewer large chunks. Also, no matter the richness of samples in a population, at least one-third of the population genetic history is impenetrable. The framework also opens up possible new lines of investigation along the following. Given the characteristics of a population, possibly derived from observed extant individuals, to estimate the (1) optimal sample size and (2) optimal sequence length for the most informative analysis.

Background

Every genetic event that is consequential to the genetic landscape of a population is captured in a topological structure called the Ancestral Recombinations Graph (ARG)

In this paper, we simply use the expected number of nodes in the ARG as a measure of the relevant genetic history of the population. While this may not be precise, it is a fair proxy for the amount of the relevant genetic history. Then a well-defined question to ask is: What is the largest fraction,

In this paper, we seek the value of _{1}, _{2}] with _{2 }≥ _{1}. Then the _{d}

Let tARG denote the true ARG for a given data set with

Simulating the populations

We use COSI

- mutation rate: According to different studies ^{-8 }per base pair per generation (bp/gen for short)

- sequence length: When simulating genetic population data, sequence length is one of the most important factors. While it may not computationally feasible to simulate a whole chromosome, enough polymorphisms are required in order to get meaningful results.

- sample size: The sample size needs to be large enough to capture important population features.

- recombination rate: The mean recombination rate along the genome in

Based on the above we used different parameters values, and all possible combinations of them, in order to assess the their effects on

Experimental set-up

**Parameters**

**Values**

Mutation rate (bp/gen × 10^{-8})

0.7, 1.5, 3.0

Sequence length (Kb)

5, 10, 30, 50, 75, 100, 150, 200

Sample size

5, 10, 30, 60, 120

Recombination rate (cM/Mb)

0.1, 0.3, 0.5, 0.7, 0.9, 1.1, 1.3, 1.5, 1.7, 1.9, 2.1, 2.3, 2.5, 2.8, 3.1, 3.5, 3.9, 4.5, 5.1

Each population of the COSI demography has been tested independently as well as the whole human demography (i.e. all populations together). In total, more than 22800 simulations replicates were generated combining different values for the four simulation parameters described above. This includes ten replicates for each experiment. For the highest value of sequence length used (i.e. 200 Kb), some experiments were terminated after thirty minutes, since no substantial progress was being made towards its completion. Therefore, the results for 200 Kb sequence length are not reported in the summary plots.

Method

Recall that an ARG is a phylogenetic structure that encodes both duplication events, such as mutations, as well as genetic exchange events, such as recombinations: this captures the (genetic) dynamics of a population evolving over generations. From a topological point of view, an ARG is always a directed acyclic graph where the direction of the edges is toward the more recent generation. An edge is annotated with the mutation genetic event, possible multiple events. Some simulators may give edges with empty labels. Recall that the

In

Identifying the estimable fraction (mdARG)

This is done by computing a minimal descriptor from the ARG. The input to this process is the ARG _{1}], [_{1}, _{2}], .., [_{M}_{M }_{1 }<_{2 }< .. <_{M }

The topology of an ARG

**The topology of an ARG**. (a) The topology of an ARG

Recall from

A simple example using the output of COSI

**A simple example using the output of COSI**. A simple example using the output of COSI, where the horizontal line corresponds to the age or depth of the node that it intersects. Also, the

Results and discussion

Given the genetic landscape of some extant samples, its underlying ARG is a plausible explanation of the observation, since it is the annotated topological structure that captures the genetic history in its totality, that is relevant to the extant samples. It can also be viewed as a generator that faithfully produces the genetic landscape of the different demographies and since it is a random graph ^{-8 }bp/gen in the figures. The interested reader is directed to the Additional File

**Supplementary Material**.

Click here for file

See Figure _{e}

The values of _{d }_{d }

**The values of N**. The values of (a)

See Figure

Summary plots of

**Summary plots of f**. Summary plots of

Conclusions and future directions

Reconstructability of common genetic history is a fundamental curiosity in the study of populations. While the population evolution models mature and the algorithms get more sophisticated, what fraction of the common and relevant genetic history of populations continues to be elusive? We present a framework that enables such an exploration. This is based on the random topological structure, the ARG and a method-independent (information-theoretic) structure called the minimal descriptor. This is applied to different demographics in a simulation setting. The most surprising observation is that the sum of the reconstructible history of each of the chromosomal segments, _{1}, _{2}, ..., _{m}

The framework also opens up possible new directions of investigation. Assume that the characteristics of a population can be derived, say from the linkage disequilibrium landscape and other characteristics of observed extant individuals. Then, can such a generator be used to answer the "best-practice" questions about the population: what is the (1) optimal sample size and (2) optimal sequence length for the most informative analysis.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

FU implemented the estimability algorithm. MP and FU carried out the experiments and the analysis. LP designed the study. LP and FU wrote the paper.

Declarations

The publication costs for this article were funded by the corresponding author's institution.

This article has been published as part of

Acknowledgements

MP carried out this work during an internship at IBM T J Watson Research Center.