Email updates

Keep up to date with the latest news and content from BMC Genetics and BioMed Central.

Open Access Methodology article

Interval estimation of genetic susceptibility for retrospective case-control studies

Dmitri V Zaykin1*, Zhaoling Meng12 and Sujit K Ghosh2

Author Affiliations

1 Department of Population Genetics, GlaxoSmithKline Inc., Five Moore Drive, P.O. Box 13398, Research Triangle Park NC 27709, USA

2 Department of Statistics, North Carolina State University, Raleigh, NC 27695-8203, USA

For all author emails, please log on.

BMC Genetics 2004, 5:9  doi:10.1186/1471-2156-5-9

The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2156/5/9


Received:13 August 2003
Accepted:11 May 2004
Published:11 May 2004

© 2004 Zaykin et al; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.

Abstract

Background

This article describes classical and Bayesian interval estimation of genetic susceptibility based on random samples with pre-specified numbers of unrelated cases and controls.

Results

Frequencies of genotypes in cases and controls can be estimated directly from retrospective case-control data. On the other hand, genetic susceptibility defined as the expected proportion of cases among individuals with a particular genotype depends on the population proportion of cases (prevalence). Given this design, prevalence is an external parameter and hence the susceptibility cannot be estimated based on only the observed data. Interval estimation of susceptibility that can incorporate uncertainty in prevalence values is explored from both classical and Bayesian perspective. Similarity between classical and Bayesian interval estimates in terms of frequentist coverage probabilities for this problem allows an appealing interpretation of classical intervals as bounds for genetic susceptibility. In addition, it is observed that both the asymptotic classical and Bayesian interval estimates have comparable average length. These interval estimates serve as a very good approximation to the "exact" (finite sample) Bayesian interval estimates. Extension from genotypic to allelic susceptibility intervals shows dependency on phenotype-induced deviations from Hardy-Weinberg equilibrium.

Conclusions

The suggested classical and Bayesian interval estimates appear to perform reasonably well. Generally, the use of exact Bayesian interval estimation method is recommended for genetic susceptibility, however the asymptotic classical and approximate Bayesian methods are adequate for sample sizes of at least 50 cases and controls.

Background

Association mapping of complex phenotypes in case-control samples involves analysis of tables of genotype/allele counts collected at a large number of genetic loci. Relating single-locus genotype and allele frequencies to the outcome is a basic analysis step even if complex interactions among loci are expected. Indeed, biologically realistic models that involve multiple interacting polymorphisms may induce considerable "marginal effects" associated with individual loci. Often the numbers of cases and controls are fixed in advance by the experimental design, and the multiple markers are typed. Then case/control proportions remain the same for all markers, whereas genotype and allele numbers in cases and controls are subject to the random sampling variation. The reverse is however of greater interest: what is the genetic susceptibility, or the probability that a random individual will have a particular outcome given that a particular genotype or an allele is observed at a locus? Unlike the odds ratio, this parameter is not invariant with respect to the prospective vs. retrospective sampling schemes. Therefore, the point estimate of this probability can only be obtained from genotype counts in cases and controls by assuming a particular value of the population prevalence. Another issue is the degree of uncertainty in the estimate, which is being investigated in this article via interval estimation. Classical, or "frequentist" confidence interval (CI) is a well established framework for interval estimation. Such an interval is a random quantity, and the probability statements are made about proportions of times a random CI covers the fixed population parameter. A more relevant question is often about the variability, or uncertainty associated with the estimate of the population value. In other words, we would like to be able to make statements about the lower and upper bounds for genetic susceptibility, thus interpreting the interval as fixed, and the susceptibility as random, where randomness is due to the limited amount of data about the parameter. Bayesian (e.g. "credible") intervals provide such interpretation, but sometimes are criticized for subjectivity associated with the choice of prior distributions. It is not generally the case that classical and Bayesian intervals should correspond to each other, however such correspondence is possible for certain combinations of the likelihood and prior distributions. For example, P. Altham [1] showed that when testing for the difference of two binomial proportions, frequentist Fisher's exact test can be viewed as a Bayesian test when assuming binomial likelihoods and Beta priors for the distributions of proportions. Thus, although no priors are explicitly assumed by Fisher's test, one can recover prior distributions that are indirectly implied by the test. Datta and Mukerjee [2] provide an extensive review of such matching probability problems and show that in most parametric cases suitable priors can be constructed that would match the Bayesian credible intervals of a given size to that of a frequentist confidence interval up to second order. As we show here, under a beta-binomial model there is a close connection between the two types of intervals for genetic susceptibility, which allows flexibility in the interpretation.

Results

Susceptibility intervals

Association studies are often "retrospective" in the sense that samples of cases and controls are determined by recruitment ("fixed"), and the genetic variants, e.g. genotypes AA and (not-AA) at a genetic marker, are random. Estimated genotype frequencies among cases or controls, e.g. are obtained directly from the following table as .

Assume AA is the risk genotype. If the samples were random with respect to both rows and columns, then the estimated susceptibility (penetrance, or the positive predictive value), φ = Pr( | AA), would simply be estimated by x/(x + y). Because the rows are fixed, we need the population prevalence, w. Then,

where w = Pr(), p = Pr(AA | ), and q = Pr(AA | ). If the rows are assumed to arise independently from binomial distributions, the maximum likelihood estimates of the parameters p and q are given by = x/n, and = y/m, respectively. The population prevalence, w, has to be estimated externally, rather than from the data provided by retrospective case-control samples. Let denote such an estimate of w. Hence, a point estimate of φ can be obtained as .

The expression given above involves the ratio of random variables, y/x, and hence obtaining the exact sampling variance is problematic. Approximate methods have been suggested for the ratio of binomial variables [3,4]. An approximate variance using the logit transformation allows us to obtain a simple expression as well as to separate the terms for p, q and w in the variance expression. To see this, define

Classical interval

The first-order Taylor series approximation gives the variance of as

where the first two terms refer to the variance of the relative risk of AA on the log scale. Pepe [5] presented extensive discussion on estimation of similar quantities in the context of biomedical research. The last term of (3) can be further approximated as

and requires the knowledge of which must come from an external source. When the range (r = wu - wl) of w is known, we may define , the range on the logit scale. Then can be approximated by (r*/6)2, since three standard deviations from the mean cover over 99% of the (centered) normal distribution, f (x, 0, σ), i.e. , where f(x, 0, σ) denotes the probability density function of a Normal distribution with mean 0 and standard deviation σ. Henceforth, unless otherwise mentioned we use . Browne [6] discussed this issue and similar approaches of relating the range to the standard deviation.

The CI for susceptibility is obtained by inverting the endpoints of the asymptotic normal interval for η,

e.g. the upper point is inverted as . As we know that , and , the estimated variance becomes

Approximate Bayesian interval

The variance given by (6) is infinite when the observed value of either x or y is zero. Before describing a method of dealing with this, we note that the usual asymptotic variance formula for the log odds ratios, based on the sum of reciprocals of the 2 × 2 table counts, {n11, n12, n21, n22}, has a similar deficiency, i.e. it results in an infinite variance when one of the observations is zero. To avoid this problem, it is common to add 1/2 to each cell. Haldane [7], Gart and Zweifel [8] justified this on the basis of minimizing the bias in the estimation of the logit variance (also see discussion in [9]). The variance becomes

We propose a similar modification of (6) and justify it by showing that it is an approximate Bayesian variance estimator. In passing, we note that (7) can also be obtained by using the same argument as in the derivation of the approximate posterior variance of that follows.

When the sampling distribution of genotypes is binomial, i.e. x|p ~ Bin(p, n) and y|q ~ Bin(q, m) and the prior distributions for p and q are independent Beta(γ 1, β1) and Beta(γ2, β2) respectively, the posterior distributions of p and q are independent Beta's with

p | x ~ Beta(x + γ1, n - x + β1)

q | y ~ Beta(y + γ2, m - y + β2)    (8)

The binomial likelihood arises as a consequence of independence between genotypes in a random sample. Although a Beta prior for allele frequencies is sometimes justified from purely mathematical convenience, it does provide a sufficiently flexible shape. More importantly, certain population-genetic models result in Beta, or more generally Dirichlet distributions for allele frequencies. For example, population frequencies at mutation-drift equilibrium follow this distribution [10]. A reasonable assumption is that a typical value from the susceptibility distribution is around the prevalence value, which implies similarity in the prior distributions of p and q, i.e. γ1 = γ2, β1 = β2.

Different parameters would generally assume prior deviation of the typical susceptibility value away from the population prevalence.

Posterior expectations and variances of p and q are respectively given by,

Using posterior variances for and , and substituting posterior expectations for parameter estimates and in (3), we obtain

Again, the approximation (4) is appropriate when is available.

When γ1, γ2 → 0 and β1, β2 → 0 (Haldane's prior, see [11]) we are essentially back to (6),

because n/(n + 1) and m/(m + 1) approach one as m and n increase. Therefore (6) is also justified from the Bayesian point of view.

When γ1 = γ2 = β1 = β2 = 1/2 (Jeffreys' prior, [11]),

Welch and Peers [12] established that in general Jeffreys' priors provide mathematical correspondence between frequentist and Bayesian intervals. In particular, these authors proved that under some regularity conditions (which are trivially satisfied for this problem) Jeffreys' prior is the unique prior for a parameter for which a Bayesian credible interval and a frequentist confidence interval match up to the second order.

The asymptotic normal credible interval based on can be obtained as:

where , and the endpoints, l and u, are inverted as and , respectively to produce an approximate (1 - α)% interval estimate of φ.

Exact Bayesian interval

It is also possible to obtain an "exact" Bayesian credible interval using Monte Carlo samples generated from the posterior distribution of φ. The sample is obtained by repeatedly drawing p and q from their posterior distributions given by (8) and calculating φ via (1) for each realization of p, q. This generates an empirical posterior distribution for φ and a (1 - α)% interval is given by (α/2), (1 - α/2) quantiles of this distribution. Values w can be drawn uniformly from the range of its possible values. A bell-shaped Beta distribution can also be assumed with parameters based on the reported range, wl - wu, as described in Methods, section 3. An algorithm is as follows:

1. Generate p(i) ~ Beta(x + γ1, n - x + β1), i = 1,.., B.

2. Generate q(i) ~ Beta(y + γ2, m - y + β2), i = 1,.., B.

3. Generate w(i) ~ Uniform (wl, wu), i = 1,.., B.

4. Compute , i = 1,..., B.

Then the α/2 and 1 - α/2 quantiles of the generated empirical posterior distribution of φ provide the (1 - α)% credible interval for the susceptibility. This algorithm is easily implemented in R, a language and environment for statistical computing and graphics available as Free Software at http://www.r-project.org/ webcite. For example, setting γ1 = γ2 = β1 = β2 = 0.5, the entire code using 100,000 samples, 95% credible interval, and w distributed uniformly between 0.04 and 0.06 is

p <- rbeta(100000, x + 0.5, n - x + 0.5)

q <- rbeta(100000, y + 0.5, m - y + 0.5)

w <- runif(100000, 0.04, 0.06)

phi <- w*p / (w*p + (1 - w)*q)

phiL <- quantile(phi, prob = 0.025)

phiU <- quantile(phi, prob = 0.975)

cat(c("Interval estimate", phiL, phiU), fill = T)

Common beta priors on p and q (γ1 = γ2, β1 = β2) may be interpreted as vague with respect to the prior distribution for φ. On the logit scale, this distribution is bell-shaped symmetric, and centered around ln [w/(1 - w)]. Its variance decreases as the common (γ, β) parameters for p and q increase. For large values of γ, β this variance can be obtained by the Taylor series approximation, 2β/[γ (γ + β + 1)]. A mixture of beta distributions can take an arbitrary shape with the great advantage that the resulting distribution is bounded between the lower and upper values – which is the case for p, q, and φ parameters. In Methods (section 1) we outline the sampling scheme for the mixture.

Interval estimation of allele or haplotype susceptibilities

Methods described so far could be applied without modifications to the estimation of allele or haplotype susceptibilities when the probability of obtaining a sample of alleles follows the multinomial likelihood. However, such usage would require the assumption of independence between alleles. This can be justified under the assumptions of Hardy-Weinberg equilibrium (HWE) in the population, as well as the multiplicative effects of haplotype susceptibilities [13]. When these assumptions are reasonable, the intervals are obtained in the same way as described above using counts of alleles/haplotypes instead of genotypes. Otherwise the variance of allele frequencies, is no longer binomial (x, y are now the counts of alleles A in cases and controls respectively, and n, m are the total numbers of alleles). It is instead given by

where Dp, Dq are the Hardy-Weinberg disequilibrium (HWD) coefficients [14], defined as the deviation of the observed frequencies of AA from the expected under random union of alleles. For example, , and is estimated by using the observed frequencies, . These variances are most easily derived by re-coding the genotypes (AA, , ) as G = (-1, 0, 1), which keeps track of the number of alleles A minus one in a corresponding genotype, and taking expectations in V(G) = E(G2) - (E(G))2 (see Appendix 1 in [15], or two alternative derivations in [14]).

These coefficients (quantifying HWD among cases or controls) may be non-zero even if the population is in HWE [13]. Then the variance (6) becomes

with two new additional terms to account for HWD. Consider the effect of . If is negative, the resulting variance becomes smaller. If it is positive, the maximum possible value of [14] is

The maximum values for the terms with HWD in (15) are

Therefore, when or reach their maximum possible values, the inflation of the corresponding part of the variance is twice the value assuming the equilibrium. To set up an exact Bayesian interval for allelic/haplotypic susceptibility we may start with distributions for genotypes/diplotypes instead of that for alleles/haplotypes. Random samples of genotypes for cases, {n1j}, and for controls, {n2j}, follow the multinomial distribution. A conjugate prior distribution is Dirichlet(γij), so the posterior genotype frequencies Pij are sampled from Dirichlet(γij + nij) distribution. Posterior allele/haplotype frequencies are given as and the method proceeds as above. Jeffrey's prior is obtained when all γij are 1/2.

Coverage probabilities

The intervals using (6, 10) and the exact Bayesian interval have good frequentist coverage probabilities. To illustrate this, we simulated population values of p and q as p ~ Beta(1, 1) and q ~ Beta(1, 1) for each of 10,000 simulation runs. We used γ = β = 1/2 for the interval calculations and assumed that w = 0.5 or w = 0.04 is known and fixed. To compare intervals based on (6) and (12) we only considered samples with x, y > 0, obtained as x ~ Bin(p, n), y ~ Bin(q, m). In addition, we considered binomial samples with at least one of x, y equal to zero to check coverage properties of (12). We used m = n = 10,50,100,500 and 90% coverage intervals. Frequentist intervals cannot be used with zero counts of n11 or n21, which may be problematic only for sample sizes of 10 (17% of the samples). In this small percentage of cases, we resort to adding 1/2 to the counts as in (10). Agresti [9] considered a similar approach for calculation of CIs for odds ratios, that is to use Gart's formula (7) only when zero counts are encountered. Results for coverage probabilities are shown in Table 1. Note that the coverage of all three methods is around the nominal 90% for all sample sizes and both values of population prevalence. Results in Table 1 are averages across the distribution of p and q, however similar coverage results are obtained when the population values of p, q are set to specific fixed values. We considered the following fixed sets of parameters:

Table 1. Coverage probabilities for nominal 90% intervals based on 10,000 simulations.

1. p = 0.96, q = 0.06, w = 0.04. This set results (equation 1) in the population susceptibility value of φ = 0.4 (ten-fold increase from the population prevalence value).

2. p = 0.9, q = 0.15, w = 0.04. This set results in the population susceptibility value of φ = 0.2.

3. p = 0.8, q = 0.2, w = 0.5. This set results in the population susceptibility value of φ = 0.8.

The coverage probabilities for all three methods were found to be sufficiently close to the nominal values of 90% and 95% for sample sizes of 50 or larger (Table 2).

Table 2. Coverage probabilities for nominal (90%/95%) intervals based on 10,000 simulations, based on three population settings of (p, q, w). Setting 1: p = 0.96, q = 0.06, w = 0.04. Setting 2: p = 0.9, q = 0.15, w = 0.04. Setting 3: p = 0.8, q = 0.2, w = 0.5.

Table 3 shows the mean length and the standard deviation of the interval length across 10,000 simulations used to produce Table 1. This standard deviation reflects estimated variability of the interval length and decreases with sample size, but not with the number of simulations. The corresponding standard error could be obtained by dividing the standard deviation by the square root of the number of simulations. Table 4 shows the same results among the intervals that include the population value of w. Numbers for each parameter combination that contrast three intervals in a given table are obtained using the same data sets. For example, once a sample of size n is obtained for a specific value of w, all three intervals are calculated using this sample. Therefore, comparisons between methods are quite precise given the number of simulations. Again, the mean and standard deviation are similar among three methods. Nevertheless, the classical CI is never the shortest of the three intervals for the parameter values that we have considered and typically it has the highest length variability. Using data summarized in Table 3, the exact interval was found to have the smallest length 5 times out of 8 considered parameter combinations and has the lowest variability 5 times out of 8. These numbers are respectively 5 out of 8 and 4 out of 8 for data used to construct Table 4. There is some dependency between results from Tables 3 and 4 due to the same combination of parameters.

Table 3. Average length (standard deviation) of the intervals based on 10,000 simulations.

Table 4. Average length (standard deviation) of those intervals that contained the true parameter value.

We do not present numerical results based on mis-specification of w or the uncertainty in it, because the behavior of intervals is clear. When w is mis-specified, the intervals actively worsen as the sample size increases (in the sense of the probability of including the population parameter), since they shrink around the wrong value. On the other hand, the uncertainty in w widens the interval length. As the sample size becomes large, the minimum length simply reflects the variability in w. We conducted simulation experiments to illustrate this point, using the population value of w = 0.045 assuming the range 0.03 to 0.06 is reported. For the exact Bayesian interval, the distribution of w was modelled as Beta(ψ1 = 77.31, ψ2 = 1640.69) with parameters estimated by (29). These values are taken to reflect the reported prevalence of the hypersensitivity reaction to an antiretroviral drug abacavir, for which genetic predisposition has been described [16]. Table 5 reports results for the coverage probabilities and the interval length of 10% intervals, assuming p = 0.8 and q = 0.2, which corresponds to the population susceptibility value of 0.159. The results show that the average length is somewhat increased when the calculation are based on the range of w, and the interval coverage is above the nominal value of 90%, because of the increased interval width.

Table 5. Effect of uncertainty in the prevalence, w: the population value is w = 0.045; the assumed range is either 0.03–0.06 ( > 0, first three columns), or zero ( = 0, last three columns). AF, AB, EB refer to the asymptotic frequentist, approximate Bayesian, and exact Bayesian 10% intervals.

Pharmacogenetic application

Hypersensitivity reaction to the antiretroviral drug abacavir affects approximately 4.5% of patients. The hypersensitivity symptoms are varied with fever and rash among the most common. A small number of fatalities have also been reported. The exact mechanism of hypersensitivity is not established, but the pattern of symptoms suggests that it is an immunological reaction, triggered by specific genetic polymorphisms. Hetherington et al. [16] studied the association of HLA-B57 haplotype with hypersensitivity. One or two copies of HLA-B57 haplotype were present in x = 39 of n = 84 individuals with hypersensitivity (cases) and in y = 4 out of m = 113 controls. The prevalence of the hypersensitivity reaction is estimated to be between 0.03 and 0.06 [16]. The estimate of φ using (1) and the midrange w (that is equal to 0.045) is 0.382. Assuming w = 0.045 is fixed (i.e. V(w) = 0), the intervals are as follows. The frequentist 95% interval is 0.187 - 0.624, the asymptotic Bayesian interval is 0.180 - 0.584, and the exact Bayesian interval is 0.203 - 0.648. Taking the range of w into account, the frequentist and Bayesian asymptotic intervals become wider, 0.183 - 0.631 and 0.175 - 0.591, correspondingly, and the exact Bayesian interval assuming the uniformity of w on the range (0.03 - 0.06) is 0.186 - 0.658. This interval is estimated from repeatedly (i = 1,..., 100000) generating p(i) ~ Beta(39 + 1/2, 84 - 39 + 1/2), q(i) ~ Beta(4 + 1/2, 113 - 4 + 1/2), w(i) ~ Uniform(0.03,0.06), and calculating . Then the credible interval was obtained from this (empirical) distribution of φ by determining its 2.5% and 97.5% quantiles. It is not necessary to assume the uniformity of w. Indeed, it may be reasonable to assume that the midrange value is more plausible than the endpoints and a realistic distribution can be described as bell-shaped. Taking w ~ Beta(57.8572,1227.86) generates a bell-shaped distribution with the mean at 0.045 and 99% of this distribution is contained between 0.03 and 0.06 (Methods, section 3). The resulting interval is somewhat smaller, 0.195 - 0.653. Results for this example are summarized in Table 6.

Table 6. Summary of susceptibility intervals for the pharmacogenetic example. AF, AB, EB refer to the 95% asymptotic frequentist, approximate Bayesian, and exact Bayesian intervals.

Discussion

Asymptotic frequentist and Bayesian intervals for genetic susceptibility as well as the exact Bayesian interval are quite similar in properties. Given that the sampling from the posterior distribution can be done directly, there is no particular reason to resort to approximations and the exact Bayesian interval can be recommended. Nevertheless, approximations are useful in that they reveal connections between the confidence and credible intervals as well as allow for sample size and power calculations as we outline in Methods, section 2. The mean length and the interval variability appear somewhat smaller for the Bayesian intervals. The frequentist coverage of Bayesian intervals is satisfactory. One may even argue that the operational usage of CIs is more often Bayesian than it is not, because the practical issue is ultimately about the confidence on the plausible range of parameter values. Algebraic relations between classical and Bayesian intervals as well as similarities in numerical properties allow Bayesian interpretation for the CIs. On the other hand, Bayesian intervals for this problem can be described in the classical sense as random, covering the fixed population parameter α% of the time in repeated samples.

Caution should be taken when the inference is about allele or haplotype susceptibilities. When assumption of population HWE does not hold, or when there are substantial deviations from multiplicativity, the variance of susceptibility contains additional terms that account for population or model-induced deviations from Hardy-Weinberg equilibrium.

Conclusions

We found that the classical interval for genetic susceptibility can be considered as an approximate Bayesian interval under the beta-binomial model. This algebraic similarity between the proposed classical and approximate Bayesian intervals allows an appealing Bayesian interpretation of the usual confidence limits. Simulation studies also confirm similarities in coverage probabilities and interval lengths among asymptotic classical and approximate and exact Bayesian intervals.

Methods

Posterior sampling using the mixture of beta distributions

We derive explicitly the posterior distribution of θ under a Binomial sampling using a mixture of Beta priors. As Beta is a conjugate prior for the binomial likelihood, we show that mixture of Beta's is also conjugate. To see this, consider the problem

where p(x| θ) denotes the sampling density of X = x given θ. Now consider a prior for θ given by a mixture of k beta distributions Beta(γj, βj) with weights vj, such that and vj ≥ 0 for all j. More explicitly, the prior density π(θ), of θ is given by,

where denotes the density of a Beta (γj, βj) distribution and denotes the Beta function. Using Bayes' rule it follows that the posterior distribution is given by,

where = γj + x, = βj + n - x and

Thus it follows that if X | θ ~ Bin(θ, n) and θ ~ vj Beta(γj, βj) then θ|X = x ~ . It may be noted that any continuous density on [0,1] can be approximated by an appropriate mixture of Beta densities (see [17] for more general results on approximating any prior using natural conjugate priors). Besides such theoretical properties, sampling from the mixture of Beta's is also straightforward. For instance, in order to generate θ ~ , we can apply the following two-stage sampling:

1. Sample J = j ∈ {1,..., k} with probability , i.e., Pr [J = j] = .

2. Conditional on the sampled value of J = j, sample

This generates a sample from the mixture . Thus, our sampling scheme described in "Exact Bayesian Interval" section can be easily modified to sampling from mixture of Beta's instead.

Power and sample size

Under the hypothesis of equality of genotype frequencies in cases and controls, H0 : Pr(AA | ) = Pr(AA | ), the expected susceptibility in (1) is equal to the population prevalence, w. Correspondingly, the logit transformation η has the mean equal to , and

The two-sided alternative hypothesis is HA : Pr(AA | ) ≠ Pr(AA | ). The values of Pr(AA | ), Pr(AA | ) and the ratio under the alternative are denoted as pA, qA, and ΔA, respectively. Once these values are specified, we can calculate the power of detecting the difference of the susceptibility from the population prevalence as well as the sample size required to achieve the required power under an acceptable type I error rate. We will illustrate the sample size calculation assuming the equal number of cases and controls (m = n). Power calculation given the fixed sample size is obtained similarly. We have

where

is the non-centrality parameter corresponding to 1 - β power under a two-sided level-α test. Therefore, the necessary sample size is

Estimating parameters of the Beta distribution for the prevalence from the range of reported values

Suppose the range of the prevalence values, wl - wu, is reported. We would like to model the uncertainty with a bell-shaped Beta(ψ1 > 1, ψ2 > 1) distribution, such that a certain percentage of the distribution (e.g. 99%) is between wl and wu with the expected value given by wl + (wu - wl)/2. The parameters of this distribution can be found as the solution for ψ1 of

where denotes the density of a Beta(ψ1, ψ2) distribution and is the Beta function. The second parameter for b(·) in (28) is given by the constraint that the expectation for w is ψ1/(ψ1 + ψ2).

As a sample calculation, consider an antiretroviral drug abacavir for which the hyper-sensitivity reaction range is reported as wl = 0.03 to wu = 0.06 [16], so that we can take = wl + (wu - wl)/2 = 0.045. If we assume that these bounds cover 99% of the prevalence distribution that results from sampling errors in estimation, possibly confounded with other factors such as genuine population heterogeneity of samples, then the resulting distribution reflecting the uncertainty in w is Beta(57.8572, 1227.86), using (28).

Much more simply calculated parameter estimates can be obtained by the normal approximation to the beta distribution. Let = wl + (wu - wl)/2 and approximate the variance by = [(wu - wl)/6]2. By the method of moments, estimates of the parameters (ψ1, ψ2) are

For the abacavir example, these estimates are found as = 77.31 and = 1640.69. Using these estimates, 99.7% of the resulting Beta(77.31, 1640.69) distribution is contained within the range 0.03 - 0.06. Thus, the normal approximation yields a somewhat more condensed distribution.

Authors' contributions

All authors made substantial contributions to this paper, including conceiving of the ideas, discussion and writing. All authors read and approved the final manuscript.

Acknowledgements

Clive Bowman and three anonymous reviewers provided valuable comments that improved the manuscript.

References

  1. Altham P: Exact Bayesian analysis of a 2 × 2 contingency table, and Fisher's "exact" significance test.

    J of the Royal Statistical Society, Series B 1969, 31:261-269. OpenURL

  2. Datta GS, Mukerjee R: Probability matching priors: higher order asymptotics. Lecture Notes in Statistics. Springer Verlag; 2004. OpenURL

  3. Katz D, Baptista J, Azen SP, Pike MC: Obtaining confidence intervals for the risk ratio in cohort studies.

    Biometrics 1978, 34:469-474. OpenURL

  4. Koopman PAR: Confidence intervals for the ratio of two binomial proportions.

    Biometrics 1984, 40:513-517. OpenURL

  5. Pepe MS: The statistical evaluation of medical tests for classification and prediction. Oxford University Press; 2003. OpenURL

  6. Browne RH: Using the sample range as a basis for calculating sample size in power calculations.

    Am Statist 2001, 55:293-298. Publisher Full Text OpenURL

  7. Haldane JBS: The estimation and significance of the logarithm of a ratio of frequencies.

    Ann Human Genet 1955, 20:309-311. OpenURL

  8. Gart JJ, Zweiful JR: On the bias of various estimators of the logit and its variance with applications to quantal bioassay.

    Biometrika 1967, 54:181-187. PubMed Abstract OpenURL

  9. Agresti A: On logit confidence intervals for the odds ratio with small samples.

    Biometrics 1999, 55:597-602. PubMed Abstract | Publisher Full Text OpenURL

  10. Wright S: The genetical structure of populations.

    Ann Eugen 1951, 15:32-354. OpenURL

  11. Geisser S: On prior distributions for binary trials.

    The American Statistician 1984, 38:244-247. OpenURL

  12. Welch BL, Peers HW: On formulae for confidence points based on integrals of weighted likelihoods.

    J of the Royal Stat Soc (B) 1963, 25:318-329. OpenURL

  13. Nielsen DM, Ehm MG, Weir BS: Detecting marker-disease association by testing for Hardy-Weinberg disequilibrium at a marker locus.

    Am J Human Genet 1999, 63:1531-1540. Publisher Full Text OpenURL

  14. Weir BS: Genetic data analysis II. Sinauer, Sunderland MA; 1996. OpenURL

  15. Meng Z, Zaykin DV, Xu C-F, Wagner M, Ehm MG: Selection of genetic markers for association analyses, using linkage disequilibrium and haplotypes.

    Am J Human Genet 2003, 73:115-130. Publisher Full Text OpenURL

  16. Hetherington S, Hughes AR, Mosteller M, Shortino D, Baker KL, Spreen W, Lai E, Davies K, Handley A, Dow DJ, Fling ME, Stocum M, Bowman C, Thurmond LM, Roses AD: Genetic variations in HLA-B region and hypersensitivity reactions to aba-cavir.

    Lancet 2002, 359:1121-1122. PubMed Abstract | Publisher Full Text OpenURL

  17. Dalai SR, Hall WJ: Approximating priors by mixtures of natural conjugate priors.

    J of the Royal Stat Soc (B) 1983, 45:278-286. OpenURL