Open Access Highly Accessed Methodology article

Population-based rare variant detection via pooled exome or custom hybridization capture with or without individual indexing

Enrique Ramos1, Benjamin T Levinson1, Sara Chasnoff1, Andrew Hughes1, Andrew L Young1, Katherine Thornton1, Allie Li2, Francesco LM Vallania1, Michael Province1 and Todd E Druley134*

Author Affiliations

1 Center for Genome Sciences and Systems Biology, Washington University School of Medicine, 660 S. Euclid Ave., St. Louis, MO, 63108, USA

2 Department of Internal Medicine, Division of Cardiology, Washington University School of Medicine, 660 S. Euclid Ave., St. Louis, MO, 63108, USA

3 Department of Pediatrics, Division of Hematology and Oncology, Washington University School of Medicine, 660 S. Euclid Ave., St. Louis, MO, 63108, USA

4 Present address: Center for Genome Sciences and Systems Biology, Washington University School of Medicine, 4444 Forest Park Avenue, Box 8510, St. Louis, MO, 63108, U.S.A

For all author emails, please log on.

BMC Genomics 2012, 13:683  doi:10.1186/1471-2164-13-683

Published: 6 December 2012

Additional files

Additional file 1:

Additional Table 1. A summary of the individuals, multiplexing, DNA input / person, hybridization reagents, sequencing platform, and use of indexing for the experiments described in this report. LLFS = Long Life Family Study participants. Additional Table 2. Individual and pooled exome sequencing metrics. Each of the five individual exomes was sequenced on 2 separate lanes of the Illumina Genome Analyzer (IIx) platform. Raw reads were aligned by Novoalign against the human NCBI reference sequence hg18 and variants were called by SAMtools using a ≥5-fold coverage threshold. As shown here, the amount of raw data and percentage of aligned reads was highly uniform for each independent sample. The pooled sample was sequenced on the Illumina HiSeq 2000 platform and aligned using Novoalign with only the exome reference target sequences with 76 bp flanks for each target. Given the difference in reference sequence, more raw data was discarded which artificially reduces the expected fold-enrichment. However, as evidenced by the mean target coverage of 1,389-fold (278-fold/person), we had more than ample data for variant calling at each target sequence. Additional Figure 1. SPLINTER error models generated in custom pooled sequencing analysis. These graphs model the percentage of sequencing errors (Y-axis) for each sequencing cycle (X-axis) for every type of substitution in the forward read (panel A) and the reverse read (panel B). The black line is the sum of all individual error rates. Similar plots were generated from 392 non-variant bases in the FABP1 locus for the pooled exome experiment. Because this error modeling is done with each experiment, SPLINTER’s accuracy should remain consistent regardless of the Illumina sequencing platform used. Pr = probability (e.g. Pr (T∣C) means the probability of seeing an erroneous T given that the wild type base at that position is a C). Additional Figure 2. Using the NCBI CCDS exome definitions (http://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi), there are a maximum of 7,087 positions in common between the Agilent 38 Mb exome and Affymetrix 6.0 GWA. Due to “no-calls” on either platform, there was an average of 6,725 positions/individual used to calculate sensitivity and specificity. Sequencing calls were made using a coverage threshold of 5-fold/chromosome. For pooled exome sequencing, we generated 1.44x109 total paired-end reads, of which 66.82% aligned to the annotated reference. At 76 bp per read, this equates to an average of 192-fold coverage of every base in the 38 Mb capture per allele in our 10 allele pool. Aligned reads in “Native” format from Novoalign were converted to a format compatible with SPLINTER for SNV calling using a custom Perl script (found online at http://druleylab.wustl.edu). A) Sensitivity for individual and pooled exome variant calling as compared to an average of 2,937 true variant positions detected by Affymetrix 6.0 array. To be considered accurate, copy number must match (i.e. heterozygous vs. homozygous) between the array and the sequencing. The average sensitivity (defined as identifying the same genotype) is 97% for single exomes and 99.4% for the pooled sample. The pooled sequencing compares an aggregate of minor allele frequencies from the array data against the SPLINTER calls for these variants. SPLINTER called 2,937 variant positions (of 6,725). B) Specificity was determined by comparing the remaining wild-type array positions, an average of 4,337 for each sample, to sequencing calls. The average specificity across all five individual exomes was 99.8%. Specificity for the pooled sample (99.7%) is calculated by dividing the number of true negative base calls (by array and sequencing; N = 3,788) by the sum of the number of true negatives and false positives (wild type by array, but variant by sequencing). Additional Table 3. Individually genotyped variant positions from pooled exome sequencing. The asterisk (*) denotes that in this pool of 4 females and 1 male, we assumed nine X-chromosomes and the minor allele frequencies are rounded accordingly. There were a total of 127 variant positions not-included on the genome-wide array that were individually validated by individual genotyping using Sequenom MassArray. Six of these positions were novel variants not found in dbSNP 132. NC = no call made by the SPLINTER algorithm. Additional Figure 3. Discordance between single and pooled exome minor allele frequency estimates is correlated to a lack of sequencing coverage. The 2,937 variant positions analyzed in Figure 1 were graphed against the relative coverage per variant position (vertical axis). The broader base of the plot (dark blue cubes) demonstrates that outliers are most likely due to a lack of sequencing coverage in either the individual or pooled sequencing rather than the introduction of systemic false positive artifacts. Additional Table 4. Candidate genes chosen for resequencing from the Long Life Family Study (LLFS). These 464 genes were chosen by LLFS investigators for resequencing based on published data relevant to the study aims. Additional Figure 4. Concordance in INDEL identification between pooled custom capture sequencing analysis by SPLINTER and SAMtools from the same indexed individuals. Of the 110 INDELS called with ≥20-fold coverage in individual analysis, we identified 99 (90%) with SPLINTER. For rare INDEL calling, 54 of the 110 were found at a frequency of <5%, and SPLINTER detected 45 (83.3%). Additional Table 5. Blockers for hybridization. Additional Figure 5. Raw and aligned read uniformity. The raw and aligned, unique read counts per individual in four hybridizations of 22–24 individuals (Additional Table 1, rows 7–10) and two hybridizations of 48 individuals (Additional Table 1, rows 14–15). Individuals are ranked from lowest total read count to highest total read count by total reads per person (black and dark blue symbols). Dark and light blue symbols: the open squares, circles, triangles, and closed circles indicate the four different captures of 22–24 individuals. Black and gray symbols: the open squares and circles indicate the two captures of 48 individuals each. The distribution of symbols across each dataset demonstrates that individual results did not cluster within captures. The table lists the metrics for each lane of sequencing, the fold difference between individuals within each lane and the percentage of indexes identifiable within each raw data set. Additional Figure 6. PCR duplicates and capture efficiency for multiplexes of 22–24 (Additional Table 1, rows 7–10). The graph shows the individual percent of: PCR duplicates (red symbols), aligned data on target (green symbols) and aligned data on and near targeted intervals (blue symbols). The two samples with very low read counts are not shown and the rest of the samples (n=90) are ranked from lowest total read count to highest total read count. Different shapes and filling indicate the four independent captures that were combined for sequencing. “On and near target” is defined as bases that align on or within 250 bp of target intervals after removing duplicates. “On target” is defined as bases that align directly on targeted intervals after removing duplicates. The “Individual samples” numbers correspond to the ones used in Additional Figure 3. Additional Figure 7. Targeted intervals ranked by coverage achieved between different samples and captures. One sample from each of the four multiplexes of 22–24 was chosen for this plot. Observed indexes ACAGATA, ATCAGCA, TGCTGGG, TTACGAT where chosen, with read counts ranging from (1,580,586 - 1,599,797). Of the 6,966 distinct targeted intervals, the 6,604 that achieved greater than 0x average coverage in all 4 of these samples were considered for this plot. All intervals were ranked by average coverage within each sample with 1 being the highest and 6,604 being the lowest and plotted. Between all four samples, the average pairwise R2 = 0.936, showing a high level of uniformity between captures. To determine if candidate regions were enriched uniformly across the four multiplexes, we ranked the 6,966 intervals that were covered by baits by coverage in one sample from each of the four captures. Comparing the ranks of all 6,604 intervals with over >0X coverage in each of the four samples, the average pairwise R2 = 0.936 (Additional Figure 4). Thus, our method performs uniformly with respect to both aggregate metrics of an entire capture as well as bait-to-bait capture efficiency. Additional Figure 8. PCR duplicates and capture efficiency for multiplexes of 48 (Additional Table 1, rows 14–15). The graph shows the individual percent of: PCR duplicates (red symbols), aligned data on target (green symbols) and aligned data on and near targeted intervals (blue symbols). The two samples with very low read counts are not shown and the rest of the samples (n=90) are ranked from lowest total read count to highest total read count. Different shapes indicate the two independent captures that were combined for sequencing. “On and near target” is defined as bases that align on or within 250 bp of target intervals after removing duplicates. “On target” is defined as bases that align directly on targeted intervals after removing duplicates. The “Individual samples” numbers correspond to the ones used in Additional Figure 3. We see an improvement in overall on/near target percentage compared to our smaller multiplexing experiments due to improving our blocking strategy by adding 1 uL of the 58 bp post-hybridization PCR primer Illumina PE 1.0 (the longer oligo listed under Post-hybridization PCR primers on Additional Table 5) to the Hybridization Capture step outlined in the Supplemental Methods. Additional Figure 9. Comparing genome-wide array position coverage to sequencing coverage. This plot was generated using the observed index TAGTATT from the set of 22–24 multiplexed captures, which achieved 1,404,000 million reads, closest to the average for all 92 samples. Blue circles: percent of targeted bases at specified coverage threshold. Red circles: percent of array positions at specified coverage thresholds. Squares: median coverage achieved when considering only positions at 3-fold coverage or greater. Triangles: mean coverage achieved when considering only positions at 3-fold coverage or greater. The close overlap of the mean and median indicate the array-based positions accurately reflect the sequencing positions with respect to the overall coverage achieved. Additional Figure 10A. Coverage achieved depends on GC content of designed baits. This plot was generated using the observed index TAGTATT from the set of 22–24 multiplexed captures, which achieved 1,404,000 million reads, closest to the average for all 92 samples. The average coverage for each bait was plotted against the% GC content of the bait with the boxplot function in R. Additional Figure 10B. Coverage achieved depends on target interval size. This plot was also generated using the observed index TAGTATT. The average coverage for each bait was plotted against the target interval size that the bait belonged to with the boxplot function in R. Higher coverage in targeted intervals ≥240 bp was observed relative to 120 bp and 180 bp intervals. While partially due to having a greater number of bases targeted by two baits in these intervals due to specifying 2X tiling frequency, we feel this is more largely a result of flanking sequence from a fragment for one bait yielding coverage for an adjacent bait. Additional Figure 11. Bioanalyzer traces of subsequent steps during indexed library preparation. Red line: individual sonicated and purified sample. Blue line: individual end-repaired and purified sample. Green line: individual adapter ligated and purified sample. The three different populations beyond 100bp may indicate fragments with 0, 1, and 2 adapters ligated. The bimodal peaks <100 bp correspond to unligated adapters. Cyan line: pre-capture, purified PCR product. Magenta line: post-capture, purified PCR product. The fact that the majority of our fragment sizes were >200 bp supports our previous conclusion that our fragments of ≥240 bp demonstrated higher total coverage due to flanking sequence from one bait contributing to total coverage for an adjacent bait (see Additional Figure 10B). Additional Figure 12. Strand bias is not adversely affecting sequencing calls. To explore whether strand bias in sequencing output was adversely affecting variant calls in the custom capture sequencing from the multiplexes of 48 individuals (Additional Table 1, rows 14–15), we summed all variant calls from all 96 individuals and compared the total number of positions (Y-axis) against the percentage of raw sequencing reads generated from the forward strand (X-axis). Each column represents a “bin” of a given number of variants with a given percentage of raw reads generated from the forward strand. We performed this comparison for all called variants, known and novel, at all MAFs (Panel A; n = 148,972), against the gold standard dataset of all positions called homozygous wild-type by GWA (Panel B; n = 308,842). The two profiles are very similar and the central peak indicates that the majority of positions had similar amounts of raw data from each strand, suggesting that strand bias is not adversely affecting sequencing calls for all positions not called homozygous wild-type by GWA. Additional Table 6. The percentage of misattributed indexes is improved by an additional purification of unligated adapter sequences. To test whether unligated adapters were the source of misattributed indexes in the multiplexes of 48 (Additional Table 1, rows 14–15), we implemented an additional purification with 9.45% PEG and 1.25M NaCl after pooling the adapter-ligated samples and prior to the subsequent PCR enrichment to more thoroughly remove any unligated adapters in the multiplexes of 30–32 (Additional Table 1, rows 11–13). We focused on positions with a MAF around 50% (range 39-59%), which provides more positions to query (215,580 reads mapped to wild type alleles, 3,828 reads mapped to variant alleles) where multiple individuals should possess one of three distinct genotypes (AA, AB, BB). Of the 96 individuals captured in two multiplexes of 48, 94 had valid array genotyping. All positions with 188 valid allele calls were binned according to the number of GWA-called variant alleles in the cohort. Of the 92 individuals captured in three multiplexes of 30–32, all 92 had valid array genotyping. Analysis was identical to the multiplexes of 48, with two samples excluded due to low coverage and one sample excluded due to having a very high mismatch rate, possibly due to genotyping performing poorly or a sample handling mishap not indicative of the method as a whole. Introducing the extra purification step improved the percentage of misattributed indexes. With the additional purification in place, we find an average of 1.8% of variant reads that are attributed to an individual who was homozygous wild type by array. At base positions where no variant alleles were called by array, 0.13% of reads erroneously contain a sequence variant, likely due to sequencing error or alignment artifacts. Subtracting this background from the 1.8% of misattributed indexes at positions with a validated variant in approximately half of all alleles yields 1.67%. We assume that this index switching is a random and stochastic process, suggesting that we only “see” half of reads with an inappropriate index because the other half would adjoin with a read having a matching genotype (e.g. instead of an index switching from a wild type read to a variant read, it switches from a wild type read to another wild type read). Thus, we conclude that with the additional purification a total of ~3.4% of reads contain misattributed indexes. Additional Figure 13. Schematic diagram demonstrating how indexed adapters may become affixed to a different source molecule. A) Primers, Y-shaped adapter-ligated fragments, and unindexed ligated adapters are mixed together prior to PCR amplification. B) After denaturation during the first PCR cycle, the reverse primer sits down on its complementary site during the annealing step. The non-indexed strand of the unligated Y-shaped adapter is not shown. C) During the extension phase of the first PCR cycle, the reverse complement strand of the template is synthesized. D) After denaturation during the second PCR cycle, the primers sit down on their complementary site during the annealing step. Additionally, any unligated adapter can sit down and serve as a primer. An indexing strand with a different index is shown. E) During the extension phase of the second PCR cycle, any individual-specific sequencing alterations can be attributed to incorrect indexes. F) After the second PCR cycle, the strand attributed to the wrong index can be amplified exponentially. Supplemental Methods. We are starting with 70-375ng of purified genomic DNA per person in the protocol, suspended in 15ul of TE. [Note: For multiplexes of 48, we have started with 350 ng, 280 ng, 210 ng, 140 ng, and 70 ng of DNA/person and achieved equivalent results in terms of percent duplication, data on target, coverage achieved, sensitivity, and specificity (data not shown).]

Format: PDF Size: 1.7MB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data