Computer Science and Artificial Intelligence Laboratory, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139, USA

Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA

Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA

Abstract

Background

Modern genetics has been transformed by high-throughput sequencing. New experimental designs in model organisms involve analyzing many individuals, pooled and sequenced in groups for increased efficiency. However, the uncertainty from pooling and the challenge of noisy sequencing data demand advanced computational methods.

Results

We present M

Conclusions

Our increased information sharing and principled inclusion of relevant error sources improve resolution and accuracy when compared to existing methods, localizing associations to single genes in several cases. M

Background

Advances in high-throughput DNA sequencing have created new avenues of attack for classical genetics problems. A robust method for determining the genetic elements that underlie a phenotype is to gather and group individuals of different phenotypes, interrogate the genome sequences of each group, and identify elements that are present in different proportions between the groups. We describe M

Targeted experiments

We focus on model organism experiments where two strains are crossed and the progeny are grouped and pooled according to phenotype. We describe and model experiments for haploid organisms that are hybrids between two strains, but we note that the models we develop should generalize to more sophisticated crosses or diploid organisms. When two strains vary in a phenotype, analyzing progeny with extreme phenotypes should elucidate the genetic basis of the trait. The main idea is that polymorphic loci that do not affect the phenotype will segregate with approximately equal frequency in the progeny (regardless of phenotype), while loci that influence the trait will be enriched in opposite directions in the extreme individuals, according to the effect size of each locus. This approach assumes that the causal loci have sufficiently strong main effects to be detectable via any type of pooled analysis. This pooled study design is also referred to as "bulk segregant analysis"

Experimental design example

**Experimental design example**. Strains are crossed and hybrid progeny are collected. The progeny are grouped by phenotype and the pooled DNA of each group is subjected to high-throughput DNA sequencing. Loci that affect the phenotype show an enrichment for one strain in each pool, while other unlinked loci segregate evenly. The bottom two plots show simulated (unobserved) allele frequencies in the pool with blue lines and (observed) allele frequencies computed from simulated 50X sequencing coverage in red.

Bulk segregant analysis with high-throughput sequencing has been applied in yeast to study drug resistance in

Pools may be selected from a single phenotypic extreme, opposite extremes, or one extreme and a control sample. Pools may also be obtained by grouping based on binary traits rather than quantitative phenotype extremes. Early studies used microarrays for pooled genotyping

Challenges

Pooled genetic mapping studies using high-throughput sequencing present a number of unique difficulties. The core statistical quantity of interest, the allele frequency in each pool, is observed only indirectly. The strain-specific read counts that are used to estimate the allele frequencies are corrupted by sampling noise at most reasonable sequencing depths, read mapping errors

However, the unbiased nature of genotyping via high-throughput sequencing results in nearly saturated marker coverage where almost all polymorphisms are queried. This avoids the laborious process of marker discovery and assay design required by earlier genotyping technologies. The dense marker coverage also allows for a high degree of information sharing, which motivates the methods underlying M

Previous statistical methods

Previous statistical approaches to analyzing pooled genotyping data have focused on alternate regimes where genetic markers are relatively sparse and measurements are relatively accurate. Often, only single loci are tested for association, necessarily ignoring data from nearby markers. Additionally, single-locus methods encounter difficulties with missing data, such as regions that are difficult to sequence or map or have very few polymorphisms.

Earlier work applied hidden Markov models (HMMs) to fine mapping within small regions with fewer number of markers

Approach

M

• A model-based framework that allows for information sharing across genomic loci and incorporation of experiment-specific noise sources. These methods improve on previous approaches that rely on heuristic techniques to select sliding window sizes, which may sacrifice resolution.

• Statistical tests using an information-sharing dynamic Bayesian network (DBN) that report robust location estimates and confidence intervals. The multi-locus methods allow for principled inference even in regions without strain-specific markers and reduce experimental noise when many markers are available.

• Extensions of our method to any number of replicates and multiple experimental designs, within the same principled statistical framework.

Methods

We develop inference methods for the pool allele frequency at a particular genome position, given the pooled read samples. First, we propose generative models which describe the experimental process. Next, these models are used to construct likelihood-based statistics to assess the significance of associations in multiple experimental designs.

Obtaining allele frequency measurements

All sequencing reads from a particular pooling experiment are aligned to one strain's reference genome using the short read aligner

Multi-locus model

M

Model specification

The pool is composed of _{i}, _{i }are obtained from the mapped sequencing reads. We also define _{i}, the total informative reads at each locus. This quantity is determined by the local sequencing depth and number of mappable polymorphisms. The recombination frequency

Graphical model showing multi-locus dependencies

**Graphical model showing multi-locus dependencies**. Dynamic Bayesian network used by M_{2 }and its value is determined by sampling

Emission probabilities

The probability of observing a set of sequencing reads conditioned on the pool fraction at the locus and a total informative read count _{i }can be calculated using the binomial distribution:

This formulation models the read count proportion exactly with a discrete model. An approximation, applicable to high read counts, can be obtained with a Gaussian distribution:

Technical pooling variance that increases the local measurement noise, such as allele-specific PCR amplification bias, could be assumed to act in locus-independent manner and be modeled with increased variance in this expression.

Transition probabilities

In practice, the genome segments are chosen to be small enough so that _{i }to _{i+1 }by considering the _{i}, _{i }individuals with the first strain's ancestry at locus _{i}),

Employing normal approximations for the binomial distributions and dividing by

This formulation shows that the latent allele frequencies form a first-order autoregressive Gaussian process with mean

Initial probabilities

The causal locus node induces a particular distribution over hidden states, depending on the selected population allele frequency

The normal approximation is:

Inference: discrete model

Inference of the hidden state values can proceed outwards from the causal locus, using the conditional independence structure of the model. We describe the algorithms in terms of standard HMM techniques, but note that a more general treatment in terms of message passing is also possible.

The observed data likelihood _{c }(model structure) and population allele frequency

The first term in the sum operates on an HMM with rightwards arrows in its graph, while the second term operates on an HMM with leftwards arrows (see Figure _{c }use the same graphical structure over the latent states _{c}: two chains with all rightwards arrows, separated by the conditioned node _{c}. Using this fact, we can compute the desired likelihoods with intermediate computations from a single graphical model.

We compute the product of the first three terms in the sum, _{c }computed using an HMM with no causal locus

Running the forward-backward algorithm requires considering all transitions in each chromosome block, leading to a runtime quadratic in the size of the pool: ^{2}

Inference: continuous approximation

The previous inference procedure applied to discrete hidden states where the pool composition is modeled exactly, but yielded inference algorithms that require time quadratic in the size of the pool. For large pools, we can relax this requirement and avoid the quadratic burden by modeling the allele frequency as a continuous value. The graphical model is linear-Gaussian since the transitions and observations are linear functions of the latent variables, subject to Gaussian noise. In a linear dynamical systems formulation, the model is:

Where:

The per-locus observation noise _{i }can be approximated with the sample variance from the observed _{i}, depending on _{i }and _{i}, or upper bounded by

Where:

The recursions begin with the stationary distribution parameters:

The Kalman smoothing equations use the filtered results (forward estimates) to create estimates of the hidden state using the entire observation sequence, recursing backwards:

Where:

As in the discrete section, the posterior distributions of the latent states under a null model can be used to compute the desired data likelihoods for all possible causal models. Required integrals are computed numerically using a fixed number of points. Specifically:

Since the probability distributions during inference are represented with a constant number of parameters instead of a full vector (as in the discrete case), inference is more efficient. Specifically, computing the required quantities

Statistical tests

With these computations in place, we can compare all values of the causal locus and the trait association, measured by

The simplification occurs because the likelihood under the noncausal hypothesis at any locus is the same, namely

We perform the maximization over

Multiple experiments

We can analyze replicate experiments by forming a coupled dynamic Bayesian network. This analysis present two replicates, but the methods generalize to any number of coupled experiments. In this situation, the same sampling distribution is induced at the shared causal locus in two coupled chains. The joint data likelihood factors since the chains are conditionally independent given the selection node

The maximization over

The numerator is the product of two single-experiment maximizations, while the denominator is the coupled model likelihood that was presented for replicate analysis.

Using these results, _{10 }likelihood ratios (LOD scores in the genetics community), maximum-likelihood estimates (MLE) of the causal locus location, and approximate credible intervals for the location of the causal locus. Assuming a uniform prior over causal locus locations,

Results

Simulation results

In order to understand the benefit of

Mapping accuracy in simulated datasets

**Mapping accuracy in simulated datasets**. Mapping accuracy is shown as root mean square error (RMSE) in kilobases (kb) from the known location. The coverage reports the average sequencing depth (reads per marker) in the experiment. Each point is calculated using 100 simulated experiments. The DBN points show the accuracy of the MLE using the M

Experimental results

We also analyze pooled sequencing data recently generated by two groups

Analyzed experiments

**Name**

**Read length**

**Pool size**

**Coverage (rep. 1)**

**Coverage (rep. 2)**

**Source (ref.)**

4-NQO viable

76

≈10000

67.7

85.0

Control

76

≈10000

36.1

79.5

Heat tolerant

76 (paired)

≈10000

152.4

84.8

Control

76 (paired)

≈10000

79.0

75.2

Each condition was assayed with two replicates. Coverage is the average reads per marker. Due to the protocols used, precise quantification of the pool size is difficult. We used the listed values as conservative choices since the reported ranges are larger in most cases.

Single-locus comparisons

In cases where the associated region is localized to a single gene, we compare the LOD scores from

Large pool results

The first set of large pools was used to characterize the genetic basis of resistance to the DNA-damaging agent 4-NQO. The genes

Localization of known associated genes in large drug-selected pools

**Dataset**

**Target**

**DBN dist**.

**1kb window dist**.

**10kb window dist**.

4-NQO viable rep. 1

5305

18355

14605

4-NQO viable rep. 2

745

6195

3145

Combined

805

755

3145

4-NQO viable rep. 1

3223

15127

1673

4-NQO viable rep. 2

5223

15127

5423

Combined

4323

15127

5423

Distances are reported in bases from the MLE to the center of the target gene.

Localization of

**Localization of ****using 4-NQO selected replicate 2**. The red line and shaded region show the inferred allele frequencies in the pool using M

The second set of large pools was constructed to study the genetics of heat tolerance, using repeated crosses to reduce linkage disequilibrium (increased

Localization of known associated genes in large heat-selected pools

**Dataset**

**Target**

**DBN dist.**

**1kb window dist.**

**10kb window dist.**

Heat tol. rep. 1A

10589

16739

14739

Heat tol. rep. 1B

10889

20689

6389

Heat tol. rep. 2

8889

2589

17289

Heat tol. rep. 1A

311

3240

511

Heat tol. rep. 1B

961

17670

1661

Heat tol. rep. 2

340

4190

2390

Distances are reported in bases from the MLE to the center of the target gene. The results for

Localization of

**Localization of IRA1 using heat tolerant replicates 1 and 2**. The red and blue lines and shaded regions show the inferred allele frequencies in the two replicates using the DBN method, and the pluses plot the observed allele frequencies. The green line shows the LOD scores calculated using the DBN two-pool method. The gray box shows the position of

Conclusion

We presented

Future work could replace our uniform prior over possible causal locus locations with an informative prior that uses conservation data, functional information, or other relevant data types (as in

Abbreviations

DBN: dynamic Bayesian network; HMM: hidden Markov model; LOD: base 10 logarithm of odds; MLE: maximum likelihood estimate; QTL: quantitative trait locus; RMSE: root mean square error.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

MDE and DKG conceived and designed the research. MDE performed the research. MDE and DKG wrote the paper.

Acknowledgements

We thank Ian Ehrenreich for sharing sequencing data and Shaun Mahony for helpful comments on the manuscript. M.D.E. was supported by an NSF Graduate Research Fellowship under grant no. 0645960.

This article has been published as part of