Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

This article is part of the supplement: Proceedings of the Third Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-seq 2013)

Open Access Proceedings

Joint genotype inference with germline and somatic mutations

Eric Bareke1, Virginie Saillour1, Jean-François Spinella1, Ramon Vidal1, Jasmine Healy1, Daniel Sinnett12 and Miklós Csűrös3*

Author Affiliations

1 Division of Hematology-Oncology, Sainte-Justine UHC Research Centre, Montréal, QC, Canada

2 Department of Pediatrics, Faculty of Medicine, University of Montréal, QC, Canada

3 Department of Computer Science and Operations Research, University of Montréal, QC, Canada

For all author emails, please log on.

BMC Bioinformatics 2013, 14(Suppl 5):S3  doi:10.1186/1471-2105-14-S5-S3


The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/14/S5/S3


Published:10 April 2013

© 2013 Bareke et al.; licensee BioMed Central Ltd.

This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

The joint sequencing of related genomes has become an important means to discover rare variants. Normal-tumor genome pairs are routinely sequenced together to find somatic mutations and their associations with different cancers. Parental and sibling genomes reveal de novo germline mutations and inheritance patterns related to Mendelian diseases.

Acute lymphoblastic leukemia (ALL) is the most common paediatric cancer and the leading cause of cancer-related death among children. With the aim of uncovering the full spectrum of germline and somatic genetic alterations in childhood ALL genomes, we conducted whole-exome re-sequencing on a unique cohort of over 120 exomes of childhood ALL quartets, each comprising a patient's tumor and matched-normal material, and DNA from both parents. We developed a general probabilistic model for such quartet sequencing reads mapped to the reference human genome. The model is used to infer joint genotypes at homologous loci across a normal-tumor genome pair and two parental genomes.

We describe the algorithms and data structures for genotype inference, model parameter training. We implemented the methods in an open-source software package (QUADGT) that uses the standard file formats of the 1000 Genomes Project. Our method's utility is illustrated on quartets from the ALL cohort.

Background

Acute lymphoblastic leukemia (ALL) is the most common paediatric cancer and the leading cause of cancer-related death among children. Advances in the understanding of the pathobiology of ALL have led to risk-targeted treatment regimes and increased survival rates, but treatment is still far from optimal. Childhood ALL arises after the acquisition of a series of DNA sequence abnormalities. These initiating events, or so-called driver mutations, ultimately confer a selective growth advantage, and are causally implicated in cancer development. A central goal of cancer genome analysis is the identification of cancer genes that, by definition, carry driver mutations.

Next-generation sequencing (NGS) technologies [1] have enabled the genome-wide identification of human disease-related variants. Analysis pipelines have been established for the large-scale sequencing of individual tumor genomes [2]. Briefly, short sequencing reads are collected from the tumor sample, mapped to the reference genome assembly, and the set of aligned reads are used to infer variations across the genome at homologous loci covered with multiple reads. The sequence variants in the tumor genome may be the result of somatic mutations, or constitutional variants preserved in the somatic lineage. In order to distinguish somatic mutations from conserved variants, it is necessary to sequence normal and tumor samples side by side. The Cancer Genome Atlas Network [3] catalogues somatic mutations in different cancers using such normal-tumor pairs.

In general, genetic relationships (like normal-tumor pairs) can be efficiently exploited in genotype inference [4,5]. Inherited and de novo mutations can be traced through jointly sequenced family relatives [6]. Here, we consider variant detection in normal-tumor pairs coupled with parental samples. Such quartet data are used to categorize variants in the tumor and normal genomes by their origin: see Figure 1. One can readily classify inherited variants and de novo germline mutations by comparing the genotypes in the trio of normal and parental genomes. Likewise, somatic mutations correspond to differences between the normal and tumor genomes.

thumbnailFigure 1. Alignments for the normal, tumor, and parental reads samples highlight putative somatic and germline mutations. The illustration shows a region along chromosome 12, displayed in the Integrative Genomics Viewer [7]. Mismatched bases are highlighted.

While sequencing an entire human genome is still too expensive for the average research laboratory, various target-enrichment techniques [8] are available for sequencing only regions of interest. In particular, sequencing the so-called exome covering all gene-coding regions, has been a routine step in medical applications [9]. Through our ongoing paediatric oncogenomics study, we conducted whole exome deep re-sequencing of a unique cohort of over 120 exomes of childhood ALL quartets, consisting of the patient's tumor and matched-normal material as well as DNA from both parents.

Existing software tools

Various bioinformatics tools have been developed for genotyping individual genomes from sequencing data, including SNVMix [10], VarScan [11], and The Genome Analysis Toolkit GATK [12,13]. A couple of methods have been developed for the purpose of joint genotyping of paired normal-tumor samples, including SomaticSniper [14], MutationSeq [15], and JointSNVMix [16]. SomaticSniper and MutationSeq employ machine-learning techniques for variant classification; JointSNVMix is based on a full Bayesian model incorporating prior genotype distributions, somatic mutations, and sequencing base call errors. The Strelka software package [17] infers joint tumor-normal genotypes in a Bayesian model that also considers tumor sampling impurity: DNA collected from the tumor sample is usually "contaminated" to some degree with the normal tissue, and therefore the sequencing reads come from a mixture of normal and tumor genomes. To our knowledge, no existing variant caller incorporates somatic and germline mutation models simultaneously to handle quartet data as in our data sets.

Our contribution

We infer the four genotypes jointly in a framework that respects the rules of inheritance in the germline and somatic lineages. Aside from assigning belief to de novo and somatic mutations, we hypothesized, constrained patterns in one lineage have an indirect beneficial effect on the inference in other lineages. In particular, the "triangulation" of the normal genome by related genomes means that genotypes and lineage-specific mutations can be resolved more reliably: information from the parental genotypes reinforce the inference of somatic mutations, and tumor sequencing reads help to recognize constitutional mutations. We present a Bayesian framework that incorporates prior parental genotypes, inherited, de novo and somatic mutations, as well as tumor-sampling impurity and sequencing errors. All model parameters are estimated in an expectation-maximization algorithm [18].

Methods

Probabilistic model

Figure 2 illustrates our model of single-nucleotide polymorphisms at homologous loci across four genomes linked by inheritance and somatic mutations. The model quantifies the descent-by-modification relationships between the unknown genotypes via three sets of parameters. First, a genotype frequency model is assumed for the parental genotypes. Second, we assume a standard DNA substitution model for the frequency of germline mutations. The parental diploid genotypes determine the child's normal genotype by Mendelian inheritance. (For simplicity, we discuss only diploid genotypes: our implementation considers sex chromosomes in an analogous manner, but using the appropriate inheritance model.) Finally, another DNA substitution model with its own parameters determines mutations in the tumor genome.

thumbnailFigure 2. Probabilistic model for base calls covering a single locus, with dependencies among genotypes and sequencing reads. See Equation (5) for the corresponding likelihood formula.

Base calls are assumed to be independent between different loci. The input for genotype inference at a single locus consists of nucleotide base calls made with their accompanying sequencing error probabilities.

Parental genotype priors

Let π[x] denote the frequency of each allele x ∈ {A; C; G; T} at a given locus. The prior allele frequencies are computed by using a standard DNA substitution model quantifying the divergence from the reference genome assembly. Assuming the simple Jukes-Cantor model with a reference nucleotide genotype <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M56','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M56">View MathML</a> for <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M68','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M68">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M69','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M69">View MathML</a>, where v is the parental genome's divergence from the reference. More general divergence models and known SNP frequencies can be accommodated by using <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M1">View MathML</a>, where <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M77','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M77">View MathML</a> is the model's substitution probability from <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M70','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M70">View MathML</a> to <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M71','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M71">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M72','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M72">View MathML</a> is a known allele frequency.

For a diploid locus, let <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M73','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M73">View MathML</a> denote the frequency of homozygous <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M74">View MathML</a> and heterozygous <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M75','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M75">View MathML</a> genotypes. Allele frequencies immediately determine diploid allele frequencies in standard Hardy-Weinberg equilibrium: <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M57','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M57">View MathML</a> for a homozygous locus <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M74">View MathML</a>, and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M58','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M58">View MathML</a> for a heterozygous <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M75','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M75">View MathML</a>. In order to accommodate different homozygous-heterozygous ratios, we employ an additional parameter <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M76','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M76">View MathML</a>.

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M2">View MathML</a>

(1)

(In other words, γ is numerically analogous to the so-called inbreeding coefficient, or F-statistic.) It is easy to verify that haploid allele frequencies are the same at any γ setting.

Mutations and inheritance

The child's normal genotype is determined by Mendelian inheritance and de novo point mutations with probabilities vxy that occur within the parental germlines. For simplicity, germline mutations in a parental lineage (gF, gM for father and mother, respectively) are conceived of as mutations that result in a diploid genotype<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M3">View MathML</a>, which then determine the child's normal genotype by Mendel's laws.

Germline mutations follow standard molecular evolution model for substitutions in DNA. Let X denote the parent's normal allele at a locus, and X' denote the same allele at the end of the germline before gametogenesis. The mutation model specifies the probabilities that apply to every locus <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M4">View MathML</a>. Let <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M5">View MathML</a> denote the probability of normal genotype gN given the mutated parental genotypes <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M6">View MathML</a>. Then χ may be 0, 1, 1/2 or 1/4, depending on the common alleles between the three genotypes.

Tumor genotype

The tumor genome undergoes mutations following the same type of molecular evolution model as the one used for germline mutations, but has its own parameters.

Sequencing errors

The alignment at the locus is represented as a set of basecall-error probability pairs (zk, k) for k = 1, ..., m. The k-th read calls allele zk with an error probability of 0 <k ≤ 1. Typically, the aligned sequencing reads give nucleotide base calls and accompanying error probabilities on a logarithmic integer scale [19,20], called the Phred scale. (In principle, the input SAM- or BAM-format file gives the Sanger-encoded sequencing error in the QUAL column or the OQ tag.) Namely,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M7">View MathML</a>

where qk is the reported quality score, and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M8">View MathML</a>. Let Z denote the base call, and Y denote the true nucleotide. We assume that errors are unbiased in the sense that

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M9">View MathML</a>

Allele sampling and sample impurity

Aligned sequencing reads randomly sample the haploid alleles at a given locus. Let yk be the true allele for base call zk. The locus' diploid genotype determines the frequency <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M10">View MathML</a> for each possible allele y. At a homozygous locus xx, ρ[y] = 1 and ρ[x'] = 0 for all x' x. At a heterozygous locus xx', ρ[y] = 1/2, ρ[x'] = 1/2 and ρ[x"] = 0 for all x" x, x'.

Impure tumor samples have a mixed distribution, which is the linear combination of the normal and tumor genotype distributions. For tumor reads, ρ[y] comes from a mixture (0 ≤ ω ≤ 1; ω = 1 for pure tumor sampling) between the normal and tumor genotypes:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M11">View MathML</a>

(3)

Likelihood for aligned reads given the genotypes

Suppose we are given the set of basecall-error probability pairs (zk, k) for k = 1, ..., m, representing the alignment at a locus. Let Zk be the random variable for base call in read k at a fixed error rate k, and let Yk be the random variable for the true sampled allele. Define the base call probability

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M12">View MathML</a>

Hence, <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M13">View MathML</a>.

The read likelihood for a given allele distribution ρ is defined as

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M14">View MathML</a>

Since base calls are independent across reads when conditioned on the allele mixture in the sample,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M15">View MathML</a>

(4)

Complete likelihood

Let <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M16">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M17">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M18">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M19">View MathML</a>denote the diploid genotypes for father, mother, normal, and tumor samples, respectively. Let <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M20">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M21">View MathML</a> denote the parental genotypes after germline mutations. These six random variables constitute the hidden variables in our probabilistic model. The input is a set of aligned sequenced bases from each of the four samples:<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M22">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M23">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M24">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M25">View MathML</a>. Every set <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M26">View MathML</a> consists of base call and error pairs (zk, k). The likelihood for the aligned base calls is then

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M27">View MathML</a>

(5)

The <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M59','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M59">View MathML</a> factor is the likelihood for the reads, given the genotypes. By the independence of the sequencing runs,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M28">View MathML</a>

(6)

The four factors are defined by (4), via the allele frequencies ρ that are determined by genotypes and tumor sampling purity, as discussed above (see Allele sampling and sample impurity).

The <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M60','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M60">View MathML</a> factor in (5) covers all mutation and inheritance events, as well as the parental genotypes. By the dependencies depicted in Figure 2,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M29">View MathML</a>

(7)

Algorithmic techniques and data structures

Our algorithmic solutions address the efficient calculation of the likelihood formula of (5), and its use in an Expectation-Maximization (EM) framework for model parameter setting. First, we examine a straightforward decomposition of the likelihood formula dictated by the assumed probabilistic graphical model.

For the EM algorithm, we need to recompute likelihoods and posterior probabilities in a number of iterations, which can be directly achieved by storing all sequencing reads in memory, but such an approach may be costly. We scrutinize the computation of read likelihoods, in order to arrive at an economical data structure, also discussed in some detail, that eliminates the need to store all base calls in memory.

Likelihood decomposition

The summation formula for the full likelihood in Equation (5) is rearranged for efficiency, using the independencies apparent in (6) and (7). In addition, the germline mutations can be combined with Mendelian inheritance: define <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M30">View MathML</a> as

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M31">View MathML</a>

Then

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M32">View MathML</a>

(8)

If there are G possible diploid genotypes (G = 10 for DNA with four alleles), Equation (8) shows that the likelihood can be computed in O(G3) time, instead of O(G6) suggested by the definition of (5). In particular, the likelihood computation proceeds by calculating the following values.

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M33">View MathML</a>

Read likelihoods

Equation (4) is conveniently rearranged by different base calls:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M34">View MathML</a>

(9)

Each subproduct is calculated separately as

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M35','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M35">View MathML</a>

(10)

by Equation (2). Note that only the called base's frequency β = ρ[z] appears in the formula. Define

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M36','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M36">View MathML</a>

If no reads call z, then Tβ[z] = 1. Equation (10) becomes Lz(ρ) = Tρ[z][z]. For pure diploid samples, ρ[z] may be 0, 1 or 1/2, corresponding to the possible subproducts for diploid samples E[y] = T0[y] (sequencing error), C[y] = T1[y] (homozygote yy), and H[y] = T1/2[y] (heterozygote with y). Likelihood formulas become even more economical with normalized subproducts C'[y] = C[y]/E[y], H'[y] = H[y]/E[y], and scaling factor <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M37','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M37">View MathML</a>

Homozygous sample. For a pure homozygous sample with genotype yy (ρ[y] = 1 and ρ[z] = 0 for z y),

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M38">View MathML</a>

(11)

Heterozygous sample. For a pure heterozygous sample yy' (y y'; ρ[y] = ρ[y'] = 1/2 and ρ[z] = 0 for z y, y'), the likelihood becomes

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M39">View MathML</a>

(12)

Impure tumor sample. The pure tumor <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M61','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M61">View MathML</a> and normal genotypes <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M62','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M62">View MathML</a> are proper diploid genotypes. Tumor sequencing reads come from an impure sample: they sample the tumor genotype with probability ω, and the normal genotype with probability (1 - ω). By Eq. (3), identical genotypes correspond to identical allele frequencies ρ, no matter what the purity level ω is. Suppose, however, that the locus has a mixture of divergent normal and tumor genotypes. Figure 3 shows that there are up to four correct base calls appearing in the reads, depending on the tumor mutation pattern <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M63','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M63">View MathML</a>. There are 6 possible queried allele frequencies β ≠ 0, 1, 1/2 (Figure 3):

thumbnailFigure 3. Allele frequencies with impure normal-tumor genotypes that differ from each other. Reads sample the tumor with probability ω, and the normal genotype with 1 - ω.

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M40','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M40">View MathML</a>

(13)

If the normal and tumor genotypes are identical, then the purity is immaterial. The formulas with somatic mutations are:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M41','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M41">View MathML</a>

(14a)

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M42','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M42">View MathML</a>

(14b)

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M43','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M43">View MathML</a>

(14c)

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M44','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M44">View MathML</a>

(14d)

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M45','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M45">View MathML</a>

(14e)

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M46','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M46">View MathML</a>

(14f)

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M47','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M47">View MathML</a>

(14g)

Data structure for storing base calls

Storing individual base calls at each locus is costly, because the sequencing coverage may be large across the four samples. It suffices, however, to store the partial likelihood factors appearing in Equations (11), (12) and (14). In particular, for each locus with mapped base calls, it is enough to store the sample-specific H'[x], C'[x], and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M48','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M48">View MathML</a> values for each called base x and the six possible values of β, in addition to a single scaling value E. For a sample with m base calls at the locus, (1 + 8m) variables thus suffice, independently of read coverage. The stored partial likelihoods are reused throughout the iterations optimizing the model parameters at a fixed tumor purity level ω.

Recurrent base calls

In our experience, loci with identical sets of base calls reoccur at an appreciable frequency, especially at lower coverages (less than about 15×) that characterize exon boundaries in exome sequencing. We exploit recurrent patterns in the piled-up base calls to achieve even better memory usage and speed. Namely, we sort the base calls <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M49','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M49">View MathML</a> at a given locus for a given sample (by allele and quality score), and use run-length encoding to achieve a compact characterization <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M50','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M50">View MathML</a>. The encoding is not used for higher coverages or widely varying quality scores, where h would take too many bits. Information-theoretic considerations [21] suggest that compactly encoded <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M51','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M51">View MathML</a> occur more often (<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M64','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M64">View MathML</a> is our proxy for Kolmogorov complexity). Short codes <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M65','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M65">View MathML</a> are placed in a small hash table to find recurrent calls (in our experiments with 20-30× total coverage by AB SOLiD sequencing reads, 20-30% savings can be achieved this way).

Parameter optimization and genotype inference

The genotypes, germline and tumor mutations are inferred by carrying out the summation of (5) for a restricted set of genotypes in order to calculate posterior probabilities. For example, in order to infer the child's normal genotype, calculate first

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M52','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M52">View MathML</a>

Then the child's normal genotype has posterior probabilities <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M53','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M53">View MathML</a>.

Model parameters are optimized using the EM algorithm [18]. In one iteration, likelihoods and various posterior probabilities are computed across all loci in the so-called E-step, which are then used to set the model parameters for the next iteration in the so-called M-step. The iterations continue until convergence is achieved. Among the optimized model parameters, the parental genotype priors, the germline and tumor mutation parameters are optimized through multiple iterations using the same set of precalculated partial likelihoods.

Setting the single parameter of the Jukes-Cantor model for the germline and tumor mutations is fairly straightforward by using posterior probabilities. For example, the tumor mutation parameter <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M66','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M66">View MathML</a> is set by summing the posterior probabilities for allele substitutions across all loci:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M54','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M54">View MathML</a>

(15)

where N is the number of loci, <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M67','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M67">View MathML</a> is the posterior probability for a normal-tumor genotype pair at locus j, and α(g, g') is the expected number of substitutions for the two alleles given the diploid genotypes.

In order to set the tumor purity ω, the partial likelihoods need to be recomputed (for different Tβ values) by reading the input read-mapping files at each iteration. At the same time, we compute a calibrated map <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M55','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S5/S3/mathml/M55">View MathML</a> from reported base-calling qualities to sequencing errors in the same framework. Note that both μ and ω have well-estimated initial values (μ starts with the canonical Phred-scaled values, and ω is estimated experimentally).

Decomposing zygosity and divergence

For the purposes of parameter inference, consider the following machine realizing the formulas for the parental genotype priors of (1). Upon receiving a heterozygous genotype xy, it flips either allele to output homozygous xx or yy with γ /2 probability each. Otherwise, with probability (1 - γ), the output is the same heterozygote xy. Homozygous genotypes are output without any change. Clearly, if the input genotype distribution is for Hardy-Weinberg, then the machine's output is distributed by the probabilities of (1). Accordingly, the divergence and heterozygosity parameters for the parental genotype prior ϕ are inferred by treating the machine's input genotype as a hidden variable. Expected frequencies for divergent input genotypes are used to estimate the divergence parameter, and expected frequencies for heterozygous → homozygous "flips" are used to estimate γ.

Sequencing data

Exome sequencing

We conducted validation experiments using exome-sequencing reads for two sets of quartets (A and B) generated on the Child Health Genomics Platform of the Sainte-Justine UHC Research Center. Sequencing reads were produced on Applied Biosystem's SOLiD sequencer and mapped with the accompanying software. Table 1 summarizes statistics on the mapped sequencing reads.

Table 1. Coverage statistics for two quartets used in validation experiments.

Whole-genome sequencing

Tumor and normal DNA samples from Quartet B were submitted to Illumina, Inc, for deep whole-genome sequencing using the standard operating procedures of the HiSeq 2000 sequencing platform. Table 1 summarizes coverage statistics and tumor impurity. The whole-genome data was further analyzed for somatic mutations with CASAVA and Strelka [17] by Illumina, Inc.

Implementation

We incorporated the presented methods into an open-source Java software package called QUADGT, using the standard file formats of the 1000 Genomes Project (SAM v1.4 [20] for input and VCF v4.1 [22] for output). Any of the input files may be missing, which makes QUADGT suitable to analyze sets with just normal-tumor samples, or just parental-offspring trios.

The probabilistic framework enabled us to couple the inference with confidence measures in the form of quality scores computed from posterior probabilities. The quality scores accompany sample-specific genotype calls (VCF's GQ field), as well as the joint genotyping for the four samples (VCF's QUAL column). The posterior probabilities for germline and somatic mutations are calculated, as well, by summing across all pertinent genotype assignments. The program specifically introduces ambiguity in the genotyping calls to meet prescribed quality scores: a definite x/y base call in a sample is replaced by x/. or ./y, and then by ./. in a greedy manner, in order to achieve high specificity.

Parallelization

The E-step of the optimization, where sums of posterior probabilities are calculated across loci, is well-suited to parallelization. In our implementation, we use a hash key computed at each locus to assign the calculations to different computing threads running in parallel.

Availability

The QUADGT software package is publicly available at http://www.iro.umontreal.ca/~csuros/quadgt/ webcite.

Results and discussion

We used two quartet data sets (A and B) to compare independent and joint variant detection. Figure 1 summarizes the coverage statistics for the quartets. Exome-sequencing reads at 5-6× coverage per sample were mapped to the human reference, and we inferred genome variants using our software package QUADGT. The entire analysis pipeline for one quartet set, including model training and genotyping, took about 12 hours (wall-clock time) on standard multi-core computer workstations with 16 Gbytes of memory.

Exome-sequencing data from Quartet A was used to assess the concordance of genotyping calls by QUADGT and a well-established variant caller, The Genome Analysis Toolkit [12]. We used independently produced whole-genome (WG) sequencing reads for the normal-tumor pair in Quartet B (with 124× total coverage, see Table 1) to gauge the two variant callers' sensitivity.

Concordance experiments

Table 2 compares individual genotyping calls made by the Genome Analysis Toolkit [12] and QUADGT on a small example consisting of calls on chromosome 12 (parameters were set to result in a comparable number of genotyping calls). The table illustrates that most calls are made in agreement between the two programs. The known relationships between the samples ensure the consistency of calls made by QUADGT, resulting in only 4 putative de novo germline mutations. GATK, ignorant of the relations, has 327 cases where a normal allele does not appear at either parent, which is by at least two magnitudes higher than what one would expect based on human intergeneration mutation rates [6]. GATK genotypes imply a large number (520) of somatic mutations, as well. As with de novo mutations, the joint calls by QUADGT are more conservative: only 14 somatic mutation calls are made.

Table 2. Comparison of calls made by the Genome Analysis Toolkit and QUADGT on Quartet A.

Sensitivity assessment

The normal-tumor pair in Quartet B was submitted to Illumina, Inc. for deep whole-genome (WG) sequencing and somatic mutation calling. Table 3 tallies the WG somatic calls with coverage by exome data. Based on the WG sequencing data, 1817 loci have somatic mutations, of which 40 are covered by exome reads to sufficient depth (Table 3, b). Some of the 40 WG somatic calls have no or weak support (lines c and e) in exome reads, since with at most one exception, all normal and tumor base calls are identical with the reference. The remaining 24 WG somatic calls (line f) with sufficient exome read coverage and non-reference base calls are not beyond the reach of the variant callers to discover somatic mutations. On closer inspection, 5 WG somatic calls fall into a 150 bp region (line g), which, judged by largely divergent base calls, is likely to be misaligned; 4 WG somatic calls (line h) may even be erroneous. For instance, at chr8:10078796, which has a fairly low WG somatic quality score, the parental exome reads suggest that all four samples contain a heterozygous SNP that is in fact found in dbSNP (rs112078536). The remaining 15 WG somatic calls (line i) have support in the exome reads, and QUADGT discovers them all at some quality threshold cutoffs.

Table 3. Whole-genome somatic loci and exome genotyping on Quartet B.

Table 4 compares the sensitivity of joint and independent genotyping using the 15 WG somatic calls with support in exome base calls (Table 3, line i). First, it is notable that QUADGT's tumor purity estimation (by Expectation-Maximization) is close to the Illumina's estimate from whole-genome data (see Table 1).

Table 4. Exome somatic calls supported by whole-genome data.

Table 4 suggests that the joint variant-calling in QUADGT leads to better sensitivity than GATK, which does not consider the relations between the genomes. In particular, 9 out of 276 (3%) SOMATIC calls by QUADGT with quality score at least 30 have support in the WG data, whereas only 7 of GATK's 667 (1%) divergent genotypes of same quality threshold are validated. At a quality score cutoff of 20, 12 out of 555 (2%) and 7 out of 1312 (0.5%) are validated QUADGT and GATK predictions.

Conclusions

Sequencing multiple genomes with known pedigrees or clonal relationships has a great promise for understanding the development of particular diseases. Our experiments with sequenced quartets of parents and normal-tumor pairs illustrate that the joint calls improve the reliability of inferred de novo and somatic mutations. The constraints imposed by the known relationships greatly improve the consistency of calls between different samples, and ultimately help to delineate the single nucleotide polymorphisms that can be associated with the disease.

A future release of the software is now under development that incorporates more nuanced substitution models with variable rates, transition-transversion ratios and nucleotide composition, as well as site-specific priors relying on public variant databases and gene annotations.

Competing interests

The authors declare that they have no competing interests.

Acknowledgements

The authors are indebted to the patients and their parents for participating in this study. This study was supported by research funds provided by the Terry Fox Research Institute, the Canadian Institutes for Health Research, and Canada's National Sciences and Engineering Research Council. JFS is the recipient of a Cole Foundation studentship. RV is the recipient of a post-doctoral research fellowship from the Government of Canada through the "Foreign Affairs and International Trade Canada." DS holds the François-Karl-Viau Research Chair in Pediatric Oncogenomics.

Declarations

Funding for the publication of this article was provided by a grant from the National Sciences and Engineering Research Council.

This article has been published as part of BMC Bioinformatics Volume 14 Supplement 5, 2013: Proceedings of the Third Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-seq 2013). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S5.

References

  1. Shendure J, Li H: Next-generation DNA sequencing.

    Nature Biotechnology 2008, 26(10):1135-1145. PubMed Abstract | Publisher Full Text OpenURL

  2. Wood LD, Parsons DW, Jones S, Lin J, Sjöblum T, et al.: The genomic landscapes of human breast and colorectal cancers.

    Science 2007, 318(5853):1108-1113. PubMed Abstract | Publisher Full Text OpenURL

  3. The Cancer Genome Atlas Research Network: Comprehensive genomic characterization defines human glioblastoma genes and core pathways.

    Nature 2008, 455:1061-1068. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  4. Le SQ, Durbin R: SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples.

    Genome Research 2011, 21(6):952-960. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  5. Li B, Chen W, Zhan X, Busonero F, Sanna S, Sidore C, Cucca F, Kang HM, Abecasis GR: A likelihood-based framework for variant calling and De Novo mutation detection in families.

    PLoS Genetics 2012, 8(10):e1002944. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  6. Roach JC, Glusman G, Smit AFA, Huff CD, Hubley R, Shannon PT, Rowen L, Pant KP, Goodman N, Bamshad M, Shendure J, Drmanac R, Jorde LB, Hood L, Galas DJ: Analysis of genetic inheritance in a family quartet by whole-genome sequencing.

    Science 2010, 328(5978):636-639. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  7. Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, Mesirov JP: Intergrative genomics viewer.

    Nature Biotechnology 2011, 29:24-26. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  8. Mamanova L, Coffey AJ, Scott CE, Kozarewa I, Turner EH, Kumar A, Howard E, Shendure J, Turner DJ: Target-enrichment strategies for next-generation sequencing.

    Nature Methods 2010, 7:111-118. PubMed Abstract | Publisher Full Text OpenURL

  9. Teer JK, Mullikin JC: Exome sequencing: the sweet spot before whole genomes.

    Human Molecular Genetics 2010, 19:R145-R151. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  10. Goya R, Sun MG, Morin RD, Leung G, Ha G, Wiegand KC, Senz J, Crisan A, Marra MA, Hirst M, Huntsman D, Murphy KP, Aparicio S, Shah SP: SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors.

    Bioinformatics 2010, 26(6):730-736. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  11. Koboldt DC, Chen K, Wylie T, Larson DE, McLellan MD, Mardis ER, Weinstock GM, Wilson RK, Ding L: VarScan: variant detection in massively parallel sequencing of individual and pooled samples.

    Bioinformatics 2009, 25(17):2283-2285. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  12. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA: The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data.

    Genome Research 2010, 20(9):1297-1303. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  13. DePristo MA, et al.: A framework for variation discovery and genotyping using next-generation DNA sequencing data.

    Nature Genetics 2011, 43:491-498. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  14. Larson DE, Harris CC, Chen K, Koboldt DC, Abbott TE, Dooling DJ, Ley TJ, Mardis ER, Wilson RK, Ding L: SomaticSniper: identification of somatic point mutations in whole genome sequencing data.

    Bioinformatics 2012, 28(3):311-317. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  15. Ding J, Bashashati A, Roth A, Oloumi A, Tse K, Zeng T, Haffari G, Hirst M, Marra MA, Condon A, Aparicio S, Shah SP: Feature based classifiers for somatic mutation detection in tumour-normal paired sequencing data. [http:/ / bioinformatics.oxfordjournals.org/ content/ early/ 2011/ 11/ 13/ bioinformatics.btr629.abstract] webcite

    Bioinformatics 2011. OpenURL

  16. Roth A, Ding J, Morin R, Crisan A, Ha G, Giuliany R, Bashashati A, Hirst M, Turashvili G, Oloumi A, Marra MA, Aparicio S, Shah SP: JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data.

    Bioinformatics 2012, 28(7):907-913. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  17. Saunders CT, Wong WSW, Swamy S, Becq J, Murray LJ, Cheetham RK: Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs.

    Bioinformatics 2012, 28(14):1811-1817. PubMed Abstract | Publisher Full Text OpenURL

  18. Dempster AP, Laird NM, Rubin DP: Maximum likelihood from incomplete data via the EM algorithm.

    Journal of the Royal Statistical Society Series B 1977, 39:1-38. OpenURL

  19. Ewing B, Green P: Base-calling of automated sequencer traces using Phred: II. error probabilities.

    Genome Research 1998, 8:186-194. PubMed Abstract | Publisher Full Text OpenURL

  20. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Subgroup GPDP: The Sequence Alignment/Map format and SAMtools.

    Bioinformatics 2009, 25(16):2078-2079. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  21. Li M, Vitányi P: An Introduction to Kolmogorov Complexity and Its Applications. 3rd edition. Springer Science+Business Media; 2008. OpenURL

  22. Danecek P, Auton A, et al.: The variant call format and VCFTools.

    Bioinformatics 2011, 27(15):2156-2158. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL