| Influence of genotyping error in linkage mapping for complex traits – an analytic studyDepartment of Medical Statistics and Bioinformatics, Leiden University Medical Center, Postzone S-05-P, PO Box 9600 2300 RC Leiden, The Netherlands
BMC Genetics 2008, 9:57doi:10.1186/1471-2156-9-57 The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2156/9/57
©
2008 Lebrec et al; licensee BioMed Central Ltd. AbstractBackgroundDespite the current trend towards large epidemiological studies of unrelated individuals, linkage studies in families are still thoroughly being utilized as tools for disease gene mapping. The use of the single-nucleotide-polymorphisms (SNP) array technology in genotyping of family data has the potential to provide more informative linkage data. Nevertheless, SNP array data are not immune to genotyping error which, as has been suggested in the past, could dramatically affect the evidence for linkage especially in selective designs such as affected sib pair (ASP) designs. The influence of genotyping error on selective designs for continuous traits has not been assessed yet. ResultsWe use the identity-by-descent (IBD) regression-based paradigm for linkage testing to analytically quantify the effect of simple genotyping error models under specific selection schemes for sibling pairs. We show, for example, that in extremely concordant (EC) designs, genotyping error leads to decreased power whereas it leads to increased type I error in extremely discordant (ED) designs. Perhaps surprisingly, the effect of genotyping error on inference is most severe in designs where selection is least extreme. We suggest a genomic control for genotyping errors via a simple modification of the intercept in the regression for linkage. ConclusionThis study extends earlier findings: genotyping error can substantially affect type I error and power in selective designs for continuous traits. Designs involving both EC and ED sib pairs are fairly immune to genotyping error. When those designs are not feasible the simple genomic control strategy that we suggest offers the potential to deliver more robust inference, especially if genotyping is carried out by SNP array technology. BackgroundLinkage analysis of family data have been extensively used in the past in the search for genetic determinants. Nowadays, investigators favor large epidemiological studies of unrelated individuals, however several family datasets are currently being re-analyzed and/or pooled (e.g. [1]). The persistance of interest for linkage is partly triggered by the advent of single-nucleotide-polymorphisms (SNP) array genotyping technology in the field, indeed SNP arrays hold the promise of more reliable linkage maps [2,3]. Although less prone to genotyping error than microsatellites when viewed as singlepoint markers, SNP arrays heavily rely on multipoint algorithms for accurate determination of the identical by descent (IBD) status of alleles. The gain in singlepoint reliability might therefore be annihilated by the propagation of errors across the many SNPs required to infer IBD status. In the search for genetic determinants of complex traits by linkage, the use of selective designs appears to be an efficient way to gain adequate power for detection of typically small gene effects. A few authors have shown by simulation that the impact of genotyping error on evidence for linkage could be particularly severe in affected sib-pair (ASP) designs [4-6], virtually masking most of the evidence for linkage. The impact of error on quantitative traits appears to be less dramatic in random samples, however it is unclear whether the same dramatic power losses hold in selected samples. A method of choice is now emerging for the analysis of quantitative traits arising from selected sib pairs. This method is essentially a regression through the origin of excess identical by descent (IBD) sharing on a function of the trait value, whose slope is an estimate of the linkage parameter. It was first proposed by Sham et al. [7] and turns out to be equivalent to a score test [8]. In a numerical comparison of methods for selected samples, Skatkiewicz et al. [9] and Cuenco et al. [10] showed that this method had good properties in finite samples for extreme proband ascertained sib-pair and discordant sib-pair designs. By use of simple genotyping error models (population frequency error model and false homozygosity model), we show analytically what effects such error generating processes (occurring at rate ϵ per sib pair) induce for an idealized fully informative marker. It is shown that it results in a reduction of the slope estimate (i.e. of the estimated linkage parameter) by a factor 1 - ResultsGenotyping error modelsWe consider two mechanisms for the generation of errors in marker data, namely the population frequency error model and the false homozygosity model. In those two models, we consider a single marker with m alleles and further assume that a maximum of one allelic error per sib pair can be made and that this happens with probability ϵ. This restriction to 'one error per sib pair' is just a first order approximation, for small ϵ, of a process where all four alleles would be allowed to be independently erroneous and does not restrict the generalizability of our results. The population frequency error model re-assigns the erroneous allele (chosen at random among the four forming the sib-pair genotype) to one of the possible m alleles with probability equal to population allele frequency. One mathematical advantage of this model is that the marginal distribution of alleles and genotypes is unaltered. The false homozygosity model keeps homozygotes unchanged but re-assigns heterozygotes to homozygotes with alleles equal to one of the two original alleles chosen according to probabilities proportional to population allele frequencies. To our knowledge, false homozygosity is a common type of error: fairly rare alleles go un-reported in samples. The population frequency error model provides an approximation to a process whereby alleles are misread. Errors at the two alleles of a marker's genotype might be correlated, we do not consider this type of process in details here although the effect on linkage will be qualitatively the same as in the two other models. We refer the reader to Sobel et al. [11] for a detailed exposé on genotyping error mechanisms. Note that the two models that we have chosen have been used in the past in order to identify potential genotyping errors [4,11]. Impact on IBD sharingLet's denote by π the proportion of alleles shared identical by descent (IBD) at a certain locus by two siblings. Tests for linkage are based on the IBD sharing distribution and although errors as described earlier are made at the genotype level (G is read as Gϵ), the effect of errors on linkage will be entirely mediated via the distortion of the IBD distribution (the true IBD status π of two siblings may be incorrectly inferred as πϵ). We are therefore interested in deriving the probability distribution P(πϵ|π), this is done by conditioning on both the true and observed genotypes as follows: Let us consider the case of complete information. This can be conceptualized by means of an idealized marker whose number of alleles is infinite, in particular identity by state (IBS) status is equivalent to IBD status. The unordered genotypes of a sib pair can be partitioned into seven exclusive classes denoted ii/ii, ii/ij, ii/jj, ii/jk, ij/ij, ij/ik and ij/kl depending on the number of homozygous sibs in the pair and the number of distinct alleles in the sib-pair genotype. Sharing 0 alleles IBD corresponds to a sib-pair genotype of the ij/kl class, should an error occur according to the population frequency error model then one of the four alleles would be transformed into yet another type (since the number of alleles is infinite, the probability that the new allele is read as one of i, j, k or l tends to 0), therefore the sib pair genotype will remain in the ij/kl class and the observed IBD status πϵ will still be 0. For the same starting genotype, an error according to the false homozygosity model produces an ii/jk class and πϵ also equals 0 therefore P(πϵ = 0|π = 0) = 1 whatever the genotyping error mechanism considered previously. The same line of reasoning leads to P(πϵ = 0.5|π = 0.5) = 1 - The overall effect of genotyping error is thus to reduce the observed IBD sharing, indeed E(πϵ|π) = (1 - ϵ/2)π and E(πϵ) = Impact on linkage testingRegression-based linkage testingWe assume that the sib pair phenotypic data x = (x1, x2)' have been adjusted for any relevant covariates (e.g. sex, age, country, ...) and have been standardized so that the (known) population mean, variance and sib-sib correlation are 0, 1 and ρ respectively. Under the additive variance components model, x given IBD information p follows a bivariate normal distribution with zero mean and variance-covariance matrix given by where γ ≥ 0 denotes the proportion of total variance explained by the putative locus. Under this model, an optimal testing strategy first advocated in [7] (and sometimes referred to as the optimal Haseman-Elston regression) is to regress (through the origin) excess IBD sharing π - This test turns out to be a score test for the linkage parameter γ [8] and is based upon the following approximate relation which is valid for small locus effects [12]: where Impact of genotyping error on regressionBy conditioning on the true IBD sharing values, we can compute P(πϵ|x, γ, ϵ) = ∑πP(πϵ|π) P(π|x, γ), using the transition probabilities P(πϵ|π) derived earlier, while the P(π|x, γ)'s are given in [12]. This permits computation of the new regression line in presence of genotyping error as As mentioned earlier, the corresponding variance under the null hypothesis is only slightly altered. The effect of genotyping error is thus to shrink the regression line by a factor 1 - Note that taking γ = 0 in this formula gives the type I error rate. Since Bias and impact on power and type I errorSince In the left-hand side of Table 1, we have computed the values of A and Table 1. Bias in selective designs In Table 2, we report the power and type I error for realistic genotyping error rates [14] equal to 0.005 and 0.01 for the same designs as in Table 2. The equivalent sample size used corresponds to samples with Fisher's information equal to 2500 which provides 90% power to detect a QTL explaining 10% of the total variance in absence of genotyping error (pointwise nominal error rate = 10-4). The most visible impact is on type I error rates in ED design which is up to 7 times its nominal value. The Table 2. Impact of genotyping error (rate = ϵ) on type I error and power Genomic control for genotyping errorAs we have seen in previous sections, the main effect of genotyping error is to modify the intercept in the regression used to test for linkage. Although an unconstrained regression would correct most of the bias due to genotyping error, the inefficiency of this strategy makes it impractical. In order to obtain an efficient and robust inference, it therefore seems natural to try and constrain the regression through its correct origin a. In this section, we propose a completely data-driven strategy for doing this. At any position, the sample mean IBD sharing has variance 1/8n where n is the number of sib pairs available. If we knew that the position is unlinked or if the sample of sib pairs was random then the deviation of this mean from Unfortunately, detection of a position-specific intercept corresponding to typical error rates would require a sample size of order 104, a number that is almost never reached in linkage studies. In order to obtain an intercept estimate Let's assume that the proportions of alleles shared IBD π is computed at a series of approximately regular positions indexed by t across the whole genome. Let yt be the sample mean (among families) excess IBD at position t i.e. In random samples or in any sample where DiscussionUnder two basic error models, we were able to predict quantitatively the consequences of genotyping error on inference in linkage analysis. In the idealized situation of complete IBD information, both error models have the same impact on linkage analysis. As we have seen, the effect is due to a decrease in IBD sharing. A contrario, an error process which would increase IBD sharing would produce opposite results. The true error processes involved in practice are complicated mixtures of the models alluded to here. In our experience however, it seems that processes which lower IBD sharing are predominant. Because genotyping error tends to decrease the estimated number of alleles shared IBD, the effect on evidence for linkage is opposite in EC (reduced power) and ED (increased type I error) designs, it can be dramatic in typical designs and paradoxically less severe for more extreme ascertainment schemes. By analogy, for a dichotomous trait, this means that the effect of genotyping error is less severe in ASP designs for rare diseases than for common diseases. Remarkably, in designs combining both ED and EC pairs like the Our study used an idealized model where IBD information is assumed to be complete. In practice, IBD is uncertain and it is inferred using marker data and multipoint algorithms as implemented in publicly available software [16,17], the general effect is to shrink the IBD estimate The genomic-control strategy that we have proposed, although triggered by the specific issue of genotyping error, potentially offers a general robust method for carrying out linkage analysis. It is nonetheless important to recognize its limitations. Firstly, if the trait is highly polygenic with contributing genes scattered across the genome, the high correlation between linkage positions will make it impossible to estimate the IBD sharing at null positions. The genomic control strategy should therefore only be considered with oligogenic traits. Secondly, the concept of genomic control relies on the assumption that the genotyping error rates are similar across markers. For markers with a similar degree of polymorphism (number of alleles and frequencies), this assumption might be acceptable. In a multipoint setting, an additional assumption required to ensure the validity of a genomic control strategy is that inter-marker distances be approximately equal. With microsatellite markers, both these assumptions might fail resulting in differences in the IBD sharing reduction across markers. The 'regression-based linkage testing' view allows one to qualitatively assess how deviation from these assumptions will impact linkage testing. For example, in ASP or EC designs, wrongly assuming that IBD is uniformly reduced across markers will result in inflated type I error at marker positions with low genotyping error rate compared to other markers. The advent of SNP chips in linkage mapping holds the promise of regular marker maps with less variable information content than in classical microsatellites maps [2,3]. The many SNPs used are likely to be subject to similar genotyping error processes, this makes the critical assumption of the genomic control strategy all the more plausible. Alternatives to this genomic-control strategy are possible and they also consist in constraining the linkage regression through a new origin as in the ad-hoc method, the estimation procedure can be adapted to suit particular circumstances. Firstly, in random samples, the assumption regarding exchangeability of positions might be relaxed. Indeed, the reduction in IBD sharing at each position may be used as estimates of the position-specific intercepts (a study sufficiently powered to detect linkage in random samples should have a huge sample size which would ensure sufficient precision of the position-specific intercepts). However, it must be stressed that the advantage of using a genomic control in random samples is limited because the impact of genotyping error is small in such designs. Secondly, one could use previous lab data to estimate by how much IBD sharing deviates from its expected value, this could also be done at each position separately provided sufficient data are available. In practice, such data might not be available or they might not trustfully reflect current error mechanisms. Elston et al. [18] have pointed out that the implicit assumption made in ASP designs, that randomly sampled sib pairs share half of their alleles IBD, might not hold in practice and have argued for including discordant pairs in such studies. The genomic control approach suggested here may be an alternative solution to this issue. Finally we note that, although we have only considered designs involving sib pairs, the approach naturally extends to other types of relative pairs. ConclusionUnder realistic genotyping error scenarios, power losses observed in extremely concordant designs are modest but the effect on type I error in extremely discordant designs can be dramatic. Our analytic approach provides some understanding of the differences in influence of genotyping errors across study designs. The advent of SNP arrays does not eliminate the impact of genotyping errors but it makes genomic control a feasible option with the potential to deliver more robust inference in linkage analysis data subject to genotyping errors or other mechanisms distorting the IBD signal. AbbreviationsASP: affected sib pair; EC: extremely concordant; ED: extremely discordant; EDAC: extremely concordant and extremely discordant; IBD: identical-by-descent; QTL: quantitative trait locus; SNP: single-nucleotide-polymorphism. Authors' contributionsJJPL participated in the method development, carried out the simulations summarized in Table 1, drafted and finalized the manuscript. HP participated in method development and in drafting the manuscript. JJH-D and HCvH both participated in method development. All authors read and approved the final manuscript. AcknowledgementsThis paper originates from the GENOMEUTWIN project which is supported by the European Union Contract No. QLG2-CT-2002-01254. We are grateful to Dr. Bas Heijmans from the section Molecular Epidemiology, Dept. of Medical Statistics and Bioinformatics, Leiden University Medical Center for discussions on genotyping error mechanisms. References
Have something to say? Post a comment on this article! |



on Google Scholar







author email
corresponding author email






























