Email updates

Keep up to date with the latest news and content from BMC Genetics and BioMed Central.

This article is part of the supplement: Genetic Analysis Workshop 14: Microsatellite and single-nucleotide polymorphism

Open Access Proceedings

Investigation of altering single-nucleotide polymorphism density on the power to detect trait loci and frequency of false positive in nonparametric linkage analyses of qualitative traits

Alison P Klein12*, Ya-Yu Tsai3, Priya Duggal1, Elizabeth M Gillanders1, Michael Barnhart3, Rasika A Mathias1, Ian P Dusenberry1, Amy Turiff1, Peter S Chines4, Janet Goldstein3, Robert Wojciechowski1, Wayne Hening5, Elizabeth W Pugh3 and Joan E Bailey-Wilson1

Author Affiliations

1 Inherited Disease Research Branch, NHGRI/NIH, Baltimore, MD, USA

2 Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, MD, USA

3 CIDR, Johns Hopkins Medical School, Baltimore, MD, USA

4 Genome Technology Branch, NHGRI/NIH, Bethesda, MD, USA

5 Department of Neurology, UMDNJ-RW Johnson Medical School, New Brunswick, NJ, USA

For all author emails, please log on.

BMC Genetics 2005, 6(Suppl 1):S20  doi:10.1186/1471-2156-6-S1-S20

The electronic version of this article is the complete one and can be found online at:


Published:30 December 2005

© 2005 Klein et al; licensee BioMed Central Ltd

This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Genome-wide linkage analysis using microsatellite markers has been successful in the identification of numerous Mendelian and complex disease loci. The recent availability of high-density single-nucleotide polymorphism (SNP) maps provides a potentially more powerful option. Using the simulated and Collaborative Study on the Genetics of Alcoholism (COGA) datasets from the Genetics Analysis Workshop 14 (GAW14), we examined how altering the density of SNP marker sets impacted the overall information content, the power to detect trait loci, and the number of false positive results. For the simulated data we used SNP maps with density of 0.3 cM, 1 cM, 2 cM, and 3 cM. For the COGA data we combined the marker sets from Illumina and Affymetrix to create a map with average density of 0.25 cM and then, using a sub-sample of these markers, created maps with density of 0.3 cM, 0.6 cM, 1 cM, 2 cM, and 3 cM. For each marker set, multipoint linkage analysis using MERLIN was performed for both dominant and recessive traits derived from marker loci. Our results showed that information content increased with increased map density. For the homogeneous, completely penetrant traits we created, there was only a modest difference in ability to detect trait loci. Additionally, as map density increased there was only a slight increase in the number of false positive results when there was linkage disequilibrium (LD) between markers. The presence of LD between markers may have led to an increased number of false positive regions but no clear relationship between regions of high LD and locations of false positive linkage signals was observed.

Background

Genome-wide linkage analysis using microsatellite markers has been successful in the identification of numerous Mendelian and complex disease loci. Recently available high-density single-nucleotide polymorphism (SNP) maps theoretically provide greater information content (IC), which should help to both identify and narrow linkage regions. This is supported by a few published reports comparing genome-wide linkage analysis using microsatellites to studies of the same dataset using dense SNP maps [1,2]. Yet questions remain about the optimal density of SNP marker sets for linkage studies. Additionally, current algorithms for linkage analysis assume that adjacent markers are in linkage equilibrium. However, there may be significant linkage disequilibrium (LD) between adjacent markers in dense SNP marker sets, which can lead to false positive results [3]. To explore these issues we used the simulated and Collaborative Study on the Genetics of Alcoholism (COGA) datasets to examine how altering the SNP density impacted the overall IC, the power to detect trait loci, and the number of false positive results. We compared these results to analyses performed using microsatellite markers.

Methods

Simulated data

Analyses were performed (separately for each population and replicate) using all replicates of the Aiputo, Danaca, and Karanga populations. The full marker sets for both the MS (7.5 cM) and SNP (3 cM) maps were used. Additional fine mapping markers were purchased for chromosomes 8 and 9 (packets 400–406 and 416–419) to increase the density of the SNPs (0.3 cM). We had knowledge of the answers.

Trait definition (simulated)

Dominant or recessive traits were created using these marker loci: B08T8044, B08T8045, B08T8050, and B08T8051. Affection status for a dominant trait was defined as individuals with ≥ 1 copy of allele 1 at the marker and for a recessive trait as individuals with 2 copies of allele 1.

COGA data

Using a perl script, we created an interpolated genetic map that used MS markers from the deCode map and SNPs from both Illumina and Affymetrix. For each SNP, 2 MS markers from the deCode map were identified that flanked the SNP using the physical positions of these markers obtained from sequence build 34. From the physical and genetic position of the 2 flanking microsatellites and assuming a linear interpolation between the markers, the genetic position of the SNP was determined. Any MS or SNP without a physical position was removed. If SNP markers mapped to the same genetic location, the SNP with the largest physical location was kept.

Trait definition (COGA)

The following markers (and risk alleles) were used to create a dominant and/or a recessive trait: rs0041510 (allele 2), tsc2832191 (allele 1), tsc0061481 (allele 1). To avoid errors due to differences in allele frequencies between ethnic groups, analysis was limited to the white/non-Hispanic families, which comprised the largest ethnic subgroup.

Creation of SNP maps

Using a perl script, we selected a subset of the SNP markers to create maps that were less dense. Our goal was to select markers with desired inter-marker distances. To avoid tight clusters of markers, we moved at least the desired distance minus 10% of that distance before another marker was selected. If there were multiple markers within ± 10% of the desired distance, the marker with the major allele frequency (MAF) closest to 0.5 was selected. For example, for the 0.3-cM map, markers were forced to be at least 0.27 cM apart, and if there were multiple markers located between 0.27 cM and 0.33 cM from the last marker, the marker with the MAF closest to 0.5 was selected.

Statistical analysis

We used the analysis program MERLIN for all linkage analyses [4]. Allele frequencies were estimated from all founders. Kong and Cox LOD scores [5] and the associated p-values for Whittemore and Halpern's NPLAll [6] statistic were used for the analysis of qualitative traits. Entropy, a measure of IC, was used. Multipoint evaluation was performed at each of the marker loci (between-marker evaluations were not performed). For the evaluation of power and type I error we used 4 standard p-value thresholds (0.05, 0.01, 0.001, and 0.0001) and 2 Lander-Krugylak [7] genome-wide significance levels. We calculated power as the number of replicates with a p-value less than the threshold within a 20 cM region (10 cM in either direction) of the trait loci. To assess the frequency of false positive results, we counted the number of regions where a p-value less than the above-mentioned cut-off occurred on chromosomes not containing the trait loci. In order to ensure that adjacent makers with p-values below the given level were not counted as multiple false positive results, a region with a p-value greater than or equal to 0.2 was required to occur between two false positive regions.

Results

Table 1 presents the results of our comparison of the IC for the various map densities. In the simulated data, the average IC of the MS map was 0.934. There is a loss in information when we compared the 3-cM SNP map (0.833) to the MS map. Conversely, a very dense SNP map showed a modest increase in IC (0.986); the mean IC was highest in the very dense (0.3 cM) SNP map (0.986). In the COGA dataset IC increased with increasing map density and was lowest in the MS marker set. The overall IC was a bit lower in the COGA data; this could be due in part to the presence of missing data in the COGA dataset or overall marker heterozygosity. Note that the MS map in the COGA dataset (13.6 cM) is less dense than the MS map in the simulated dataset (7.5 cM).

Table 1. Information content

There was a modest increase in power with increasing SNP map density in the simulated data (Table 2). Power was greatest for the 0.3-cM density. Power for the MS map seemed to fall between the 1 cM and 3 cM SNP map. Overall power was quite low when we used a genome-wide significance level of 0.000049. However, in the COGA dataset (Table 3) there were less consistent trends in the ability to detect the trait loci as map density increased. In fact, the denser maps sometimes gave smaller LOD scores as compared with less dense maps (e.g., Drs0041510). This could be due to errors in marker order or inter-marker distance for the denser map sets. It is important to note that our created traits were homogenous and had complete penetrance, and thus overall power was very high, possibility masking any true variations in power due to differences in map density. For all map sets disease frequencies had a large impact on power. Additionally, given we only performed analysis at the marker loci and not between marker loci, we cannot evaluate if denser maps yielded smaller confidence intervals for the linkage peaks because 1-LOD confidence intervals are dependant upon the density of analytic evaluations.

Table 2. Power in simulated data

Table 3. Power in COGA data

The number of false positive linkages (p-value below a given level in a region unlinked to the trait loci) for the simulated data is in Tables 4 and 5. When we compare the results for the 3-cM SNP map to the MS map or the 0.3 cM to the 1-cM SNP map, the number of false positive results remains similar. Although the 0.3-cM map has a slight increase in the number of false positive results compared to the 1-cM map, it is hard to interpret this because such a dense map was only available in one 18-cM region. We also examined the number of false positive regions for each of the traits in the COGA dataset (Table 6) by tabulating significant linkages on 18 unlinked chromosomes. Overall, the number of false positive regions at the 0.05 level was greater in the combined 0.25-cM SNP map than it was in the less dense maps. At the more stringent p-value levels there were only a few false positive results, and no false positives were observed for any of the traits at genome-wide significant p-values (0.000049) [7].

Table 4. Type I error count in simulated data for full dataset

Table 5. Type I error count in densely mapped simulated data

Table 6. Type I error in COGA data

Conclusion

Overall, IC was higher for the dense SNP maps as compared with the less dense SNP and MS maps. In the simulated data, there was a modest increase in power with increasing SNP map density. However in the COGA data, no consistent trends were observed in our ability to detect trait loci with increasing map density. There was variation in the LOD scores across maps, with more dense maps sometimes yielding lower LOD scores. This could be due to errors in map order and supports the need for precise genetic maps when using dense SNP maps for linkage. Unsurprisingly, power was dependent on disease prevalence for these homogeneous, completely penetrant traits.

In the simulated data, in which there was no significant LD between markers, the number of false positives did not increase with increasing map density. In the COGA data, more false positives were observed for the densest map set, 0.25 cM, in which there was significant intermarker LD. Huang et al. [3] reported that the presence of intermarker LD caused an increase in false positives, particularly when there is missing parental data. This is of particular concern because others have reported that SNPs are more powerful than microsatellites when there is missing parental data. To examine this, we calculated the LD between all SNPs up to 500 kb apart. Twenty-one percent of all pairwise SNPs had a D' > 0.70 (high LD). Of those SNPs with a D' > 0.70, 89% were <200 kb apart, 9% were 200–400 kb apart and 2% >400 kb apart. The LD between SNPs diminished as distance increased, suggesting maps with an average marker distance >200 kb would have limited intermarker LD. Comprehensive review of the locations of all type I errors observed for two of these traits (created from marker tsc006148 on chromosome 13) showed that while 90% of these regions contained markers exhibiting LD, the LD patterns in these regions did not differ markedly from the LD on the remainder of the chromosomes. Interestingly, 20% of the false positives occurred at the telomeres of chromosomes. While some of the increases in numbers of type I errors could be due to increased intermarker LD in the densest maps, they could also be caused by the fact that more evaluations of linkage were performed for the dense maps, since we evaluated linkage at each marker location and did not perform any intermarker evaluations. Thus, the densest map had the largest number of linkage tests performed (see Table 1), so increased type I errors could be due to LD or to increased tests.

Abbreviations

COGA: Collaborative Study of the Genetics of Alcoholism

GAW14: Genetic Analysis Workshop 14

IC: Information content

LD: Linkage disequilibrium

MAF: Major allele frequency

MS: Microsatellite

SNP: Single-nucleotide polymorphism

Acknowledgements

We thank Robert Nussbaum for helpful discussions concerning this study

References

  1. John S, Shephard N, Liu G, Zeggini E, Cao M, Chen W, Vasavda N, Mills T, Barton A, Hinks A, Eyre S, Jones KW, Ollier W, Silman A, Gibson N, Worthington J, Kennedy GC: Whole-genome scan in a complex disease, using 11,245 single-nucleotide polymorphisms: comparison with microsatellites.

    Am J Hum Genet 2004, 75:54-64. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  2. Schaid DJ, Guenther JC, Christensen GB, Hebbring S, Rosenow C, Hilker CA, McDonnell SK, Cunningham JM, Slager SL, Blute ML, Thibodeau SN: Comparison of microsatellites versus single-nucleotide polymorphisms in a genome linkage screen for prostate cancer-susceptibility loci.

    Am J Hum Genet 2004, 75:948-965. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  3. Huang Q, Shete S, Amos C: Ignoring linkage disequilibrium among tightly linked markers induces false-positive evidence of linkage for affected sib pair analysis.

    Am J Hum Genet 2004, 75:1106-1112. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  4. Abecasis GR, Cherny SS, Cookson WO, Cardon LR: MERLIN – rapid analysis of dense genetic maps using sparse gene flow trees.

    Nat Genet 2002, 30:97-101. PubMed Abstract | Publisher Full Text OpenURL

  5. Kong A, Cox NJ: Allele-sharing models: LOD scores and accurate linkage tests.

    Am J Hum Genet 1997, 61:1179-1188. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  6. Whittemore AS, Halpern J: A class of tests for linkage using affected pedigree members.

    Biometrics 1994, 50:118-127. PubMed Abstract | Publisher Full Text OpenURL

  7. Lander E, Kruglyak L: Genetic dissection of complex traits: guidelines for interpreting and reporting linkage results.

    Nat Genet 1995, 11:241-247. PubMed Abstract | Publisher Full Text OpenURL