Open Access Methodology article

Discovery of novel variants in genotyping arrays improves genotype retention and reduces ascertainment bias

John P Didion123, Hyuna Yang4, Keith Sheppard5, Chen-Ping Fu6, Leonard McMillan6, Fernando Pardo-Manuel de Villena123* and Gary A Churchill5*

Author Affiliations

1 Department of Genetics, University of North Carolina at Chapel Hill, CB 7264, Chapel Hill, North Carolina, 27599-7264, USA

2 Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, CB 7295, Chapel Hill, North Carolina, 27599-7295, USA

3 Carolina Center for Genome Science, University of North Carolina at Chapel Hill, CB 7264, Chapel Hill, North Carolina, 27599-7264, USA

4 Department of Biostatistics and Bioinformatics, Duke University Medical Center, Box 2721, Durham, NC, 27710, USA

5 Center for Genome Dynamics, The Jackson Laboratory, 600 Main Street, Bar Harbor, Maine, 04609, USA

6 Department of Computer Science, University of North Carolina at Chapel Hill, CB 3175, Chapel Hill, North Carolina, 27599-3175, USA

For all author emails, please log on.

BMC Genomics 2012, 13:34  doi:10.1186/1471-2164-13-34

Published: 19 January 2012

Additional files

Additional file 1:

Description of 351 mouse samples. Inbred samples used in the VINO analysis are identified in column E. CEL ID corresponds to the name of the Affymetrix CEL file containing raw intensity data, which are available for download [11]. Genetic distance is calculated as the fraction of non-reference allele calls out of all genotype calls.

Format: XLS Size: 116KB Download file

This file can be viewed with: Microsoft Excel Viewer

Open Data

Additional file 2:

Non-homozygous genotype call rates increase with divergence from the reference genome. A) Genetic distance from the mouse reference genome for 143 laboratory inbred strains (additional file 1). Each strain is shown as a vertical tick mark. Strains are grouped according to their origin are arranged left-to-right in increasing order of genetic distance from the reference. Genetic distance is computed as the fraction of non-reference (non-A allele) genotype calls. B) Genotype calls for each strain. For each strain, the number of SNP probe sets assigned each of the four possible calls (A, B, H or N) are shown as four points of different colors that sum to 526363 SNP probe sets.

Format: PDF Size: 489KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 3:

Summary of sequencing of predicted VINOs. A) Sequencing results for 15 SNPs with samples having predicted OTVs. Forward and reverse strands are shown aligned and the target base is shown in dark black. Each SNP has a different color that corresponds to the mismatches shown in the V1, V2 and V3 columns. B) VINO prediction accuracy. An unrecognized SNP is a probe with an OTV that was not predicted to be a VINO. C) Samples sequenced for each SNP. Colors indicate concordant prediction (red, blue and green), incorrect VINO prediction (yellow) or unrecognized SNP.

Format: PDF Size: 131KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 4:

Overall concordance of MouseDivGeno calls with events observed in the Sanger data. MouseDivGeno Genotypes for 14 Sanger strains classified by the type of event(s) observed in the Sanger data underlying the probe sets. All hybridizations: Both strands affected by the event, or only one strand was affected and the other strand was excluded due to non-alignment; One strand only: Both strands included, but only one strand affected; central OTV: Off-target variant in the center 15-19 bp; edge OTV: Off-target variant in the three bp at either edge of the probe; Inaccessible: SNP falls within an inaccessible region of the Sanger sequence; RFLP 1-1.5K: An RFLP that increases the minimum fragment size to between 1 kb and 1.5 kb; RFLP > 1.5 k: An RFLP that increases the minimum fragment size to greater than 1.5 kb; Cut in Probe: An RFLP that introduces a cut site within the probe sequence.

Format: PDF Size: 46KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 5:

Genotype calls by Alchemy and BRLM-P 2D for probes called VINO by MouseDivGeno despite lack of evidence in the Sanger data. ALCHEMY and BRLMM-P 2D call correct genotypes at a much-reduced rate for the 7073 probe sets for which MouseDivGeno called a VINO with no corresponding evidence in the Sanger data.

Format: PDF Size: 39KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 6:

Per-strain concordance of MouseDivGeno calls with events observed in the Sanger data. MouseDivGeno Genotypes for 14 Sanger strains classified by the type of event(s) observed in the Sanger data underlying the probe sets. Event descriptions are the same as for additional file 4.

Format: XLS Size: 85KB Download file

This file can be viewed with: Microsoft Excel Viewer

Open Data

Additional file 7:

Observed vs. predicted genotype calls in (C57BL/6JxCAST/EiJ)F1, grouped by OTV position. Genotype calls in (C57BL/6JxCAST/EiJ)F1 are categorized by whether they are concordant (first panel) or discordant (remaining panels), the observed vs. expected genotypes, and the position of the OTV (if any) within the probe set. F1 genotypes are predicted based on CAST/EiJ genotypes, as C57BL/6J is always expected to be AA homozygous.

Format: PDF Size: 43KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 8:

MouseDivGeno identifies population-specific VINOs in human samples. Contrast plots of 9 VINOs identified in HapMap 3 data. Samples in low-intensity clusters are colored by population [20]. Most VINOs are specific to one population or a small number of related populations.

Format: PDF Size: 335KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 9:

Concordance between MouseDivGeno calls and 1000 Genomes Project data. A) Concordance of MouseDivGeno and 1000 Genomes Project sequencing calls for 54 SNPs. B) Breakdown of genotype calls vs. genotypes observed from sequencing data. C) MouseDivGeno VINO calling rate.

Format: PDF Size: 98KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 10:

Fraction of VINO calls in each HapMap population. Each human SNP analyzed in this study is divided into population groups, and the fraction of VINOs called by MouseDivGeno is shown. CEU: Caucasians of European descent from Utah; CHB: Han Chinese from Beijing; JPT: Japanese from Tokyo; YRI: Yoruba in Ibadan, Nigeria.

Format: PDF Size: 17KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 11:

The distance between consecutive SNPs follows a geometric distribution. Histogram of distance between consecutive SNPs in 14 Sanger strains using a bin size of 12 bp. Distances greater than 300 bp are combined in the right-most bin.

Format: PDF Size: 33KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 12:

Genotype may be resolved for the target position in some VINOs. Two examples of SNP probe sets (from the set of VINOs verified by direct sequencing, see additional file 3) for which there are two different low-intensity clusters (red circles) differentiated by the genotype at the target position. A) SNP JAX00258870, for which the low-intensity cluster V1 (RBF/DnJ, TIRANO/EiJ, ZALENDE/EiJ) is homozygous for the G allele at its target SNP, and the low-intensity cluster V2 (BXSB/MpJ and SB/LeJ) is homozygous for the A allele. B) SNP JAX00442587, for which the low-intensity cluster V3 (JF1/Ms, MSM/Ms) is homozygous for the G allele at its target SNP, and the low-intensity cluster V4 (DIK) is homozygous for the A allele.

Format: PDF Size: 43KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 13:

VINOs can be used to identify structural variation. A region of chromosome 12 (approx. 90.847-90.949 Mb) containing a deletion in strain BALB/cJ. Center: sequencing coverage map created from the Sanger data. Each red tick represents a SNP on the Mouse Diversity Array. Top and bottom: contrast plots of intensities for consecutive SNPs. BALB/cJ is highlighted as a red circle, and is located in the low-intensity cluster for the range corresponding to low/no coverage in the Sanger data.

Format: PDF Size: 325KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 14:

Summary of unaligned probe sets. Probe-set sequences were aligned to the imputed genomes for each of 14 Sanger strains using BWA. The fraction of probe non-aligning probe sets is shown. Well-performing probe sets are those included in the present study, while excluded probe sets were removed due to poor performance across the 351 samples in this study. Excluded probe sets are an order of magnitude more likely to be non-aligning.

Format: PDF Size: 39KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data