Simultaneous SNP identification and assessment of allele-specific bias from ChIP-seq data
Center for Systems and Synthetic Biology, Institute for Cellular and Molecular Biology, Section of Molecular Genetics and Microbiology, University of Texas at Austin, Austin, TX 78712, USA
BMC Genetics 2012, 13:46 doi:10.1186/1471-2156-13-46Published: 5 September 2012
Additional file 1:
Figure S1. Diagram of SNP discovery pipeline. Figure S2. Numbers of Pilot 2 SNPs rediscovered correlate with ChIP-seq coverage. For each trio cell line, the percentage of Pilot 2 SNPs rediscovered with ChIP-seq data is plotted together with percent of the genome with at least 5X coverage from ChIP-seq. Figure S3. Validation of de novo discovered SNPs by genomic sequencing. The top row shows examples of SNPs discovered de novo from ChIP-seq data that were also genotyped in that individual by the 1000 Genomes Pilot 2 Project. The remainder are examples of SNPs discovered de novo from ChIP-seq data but missed in the 1000 Genomes Pilot 2 set in that individual (GM cell lines) or found in ungenotyped lines (HUVEC, Progeria). The top of each panel shows the genomic DNA sequence, with the SNP at the center in bold. Chromosomal coordinates, transcription factor/histone modification, and cell line are listed below the chromatogram. Figure S4. SNP calling in low coverage regions. (A) Location overlap and genotype overlap between CTCF ChIP-seq SNPs and Pilot 2 SNPs. Location overlap is when the SNP location and alleles match, but sometimes only one allele of a heterozygous genotype is observed in the other set. Genotype overlap refers to an exact genotype match. (B) Percent heterozygosity for CTCF ChIP-seq discovery SNPs and Pilot 2 SNPs. (C) Read number filtering increases discovery SNP heterozygosity and genotype overlap with Pilot 2 SNPs. SNPs covered by less than the indicated number of reads were filtered out. Blue bars represent the number of SNPs passing the filter. Red squares represent SNP heterozygosity and green triangles represent the percent genotype overlap with Pilot 2 SNPs, both on the secondary Y axis on the right. Figure S5. Individual distribution of SNPs. G1000 SNPs and novel SNPs discovered in the indicated GM cell lines. CTCF ChIP-seq samples were categorized according to their individual distribution. ‘1’ represents SNPs found in only one of the six individuals, ‘2’ represents SNPs found in two people and so on. Figure S6. Pilot 2 SNP distribution around (A) CTCF and (B) RNAPII ChIP peak centers and (C) transcription start sites. (D) Conservation scores around transcription start sites. All distances are in bp. Figure S7. CTCF allelic binding bias at Pilot 2 SNPs was plotted similarly as in Fig. 5. The inset tables show the Spearman correlation coefficients (top) and Spearman P values (bottom). Figure S8. CTCF allelic binding bias at discovered SNPs in Progeria and FB8470 (normal) fibroblast cells. Table S1. Description of apparent errors. This table lists all 6 discrepancies that we observed between genotypes called from ChIP-seq data and our genomic Sanger sequencing validation (127 out of 133 were exactly correct). For errors 1 and 3, the ChIP-seq data recovered the alternate allele and called it homozygous, but the reference allele was apparently not observed at sufficient coverage. Errors 2 and 4 are discrepant between the ChIP-seq and Sanger genotyping, but our ChIP-seq call matched the 1000 Genomes Pilot 2 genotype. For errors 5 and 6, the ChIP-seq data called it heterozygous and Sanger sequencing reported homozygous (similar to errors 2and 4), but the two alleles reported by ChIP-seq correspond to the two alleles known to occur at that position (in other individuals) according to dbSNP 129. Table S2. Indels called from ChIP-seq data overlap with 1000 Genomes Project indel calls. Table S3. Novel SNPs found by ChIP-seq overlap with SNPs found in other individuals in the same population in the 1000 Genomes Project low coverage data. Table S4. Overlap between biased (that is, allele-specific) SNPs discovered from ChIP-seq data and biased Pilot 2 SNPs.Table S5. Significantly biased allele-specific CTCF binding sites within 500 bp of a GWAS SNP locus. P-val refers to the significance of the allele-specificity binding bias at a heterozygous SNP. Table S6. SNP calling from H3K4me3 and/or H3K27me3 ChIP-seq data in 17 additional cell lines (ENCODE data) as well as from RNA-seq data in GM12891 (from Toung et al., Genome Res. (2011) 21:991-8).
Format: PDF Size: 822KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 2:
CTCF allele-specific binding at SNPs discovered de novo from CTCF ChIP-seq. SNPs with an FDR corrected bias P value of less than 0.05 are included. Each tab contains information for one individual.
Format: XLSX Size: 137KB Download file
Additional file 3:
CTCF allele-specific binding at Pilot 2 SNPs. SNPs with an FDR corrected bias P value of less than 0.05 are included. Each tab contains information for one individual.
Format: XLSX Size: 204KB Download file