Email updates

Keep up to date with the latest news and content from BMC Genomics and BioMed Central.

Open Access Methodology article

Incorporation of genetic model parameters for cost-effective designs of genetic association studies using DNA pooling

Fei Ji1, Stephen J Finch2, Chad Haynes1, Nancy R Mendell2 and Derek Gordon3*

Author Affiliations

1 Lab of Statistical Genetics, Rockefeller University, New York, NY, USA

2 Department of Applied Math and Statistics, Stony Brook University, Stony Brook, NY, USA

3 Department of Genetics, Rutgers University, Piscataway, NJ, USA

For all author emails, please log on.

BMC Genomics 2007, 8:238  doi:10.1186/1471-2164-8-238


The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2164/8/238


Received:19 December 2006
Accepted:16 July 2007
Published:16 July 2007

© 2007 Ji et al; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Studies of association methods using DNA pooling of single nucleotide polymorphisms (SNPs) have focused primarily on the effects of "machine-error", number of replicates, and the size of the pool. We use the non-centrality parameter (NCP) for the analysis of variance test to compute the approximate power for genetic association tests with DNA pooling data on cases and controls. We incorporate genetic model parameters into the computation of the NCP. Parameters involved in the power calculation are disease allele frequency, frequency of the marker SNP allele in coupling with the disease locus, disease prevalence, genotype relative risk, sample size, genetic model, number of pools, number of replicates of each pool, and the proportion of variance of the pooled frequency estimate due to machine variability. We compute power for different settings of number of replicates and total number of genotypings when the genetic model parameters are fixed. Several significance levels are considered, including stringent significance levels (due to the increasing popularity of 100 K and 500 K SNP "chip" data). We use a factorial design with two to four settings of each parameter and multiple regression analysis to assess which parameters most significantly affect power.

Results

The power can increase substantially as the genotyping number increases. For a fixed number of genotypings, the power is a function of the number of replicates of each pool such that there is a setting with maximum power. The four most significant parameters affecting power for association are: (1) genotype relative risk, (2) genetic model, (3) sample size, and (4) the interaction term between disease and SNP marker allele probabilities.

Conclusion

For a fixed number of genotypings, there is an optimal number of replicates of each pool that increases as the number of genotypings increases. Power is not substantially reduced when the number of replicates is close to but not equal to the optimal setting.

Background

Case/control genetic association studies are used as a means of localizing susceptibility genes for a complex disease. With the recent development of technologies that can determine the genotypes for hundreds of thousands of single nucleotide polymorphisms (SNPs) across the human genome, such studies are now being reported in the literature [1-3]. Design issues such as power to detect association using these technologies are also being published [4,5]. Since a critical requirement for such studies to be sufficiently powered is that the disequilibrium among the disease allele and neighboring marker alleles be large, marker density needs to be high. If the effect size for a complex disease is small (e.g., genotype relative risks [6] on the order of 1.5 to 2), the sample size required to detect association may be thousands of cases and controls [4,5,7-9]. Therefore, researchers often consider genotyping technologies such as DNA pooling [10-13] as an initial strategy to identify genomic regions that may harbor susceptibility loci in an effort to reduce cost (time and money) (e.g.,[14,15]). Advantages of DNA pooling technologies include (a sometimes substantial) reduction in genotyping cost when performing multi-stage association studies to identify disease susceptibility genes. Potential disadvantages include reliance on a number of assumptions related to statistical design and analysis. For example, a key assumption is that the intensity measure has an expected value equal to the allele frequency. Another potential disadvantage is that DNA pooling techniques may not detect disease mode of inheritances that deviate from dominant or recessive modes. For example, DNA pooling techniques will be underpowered to detect disease genes that operate in an over-dominant form.

Sham et al. reviewed currently available technologies for DNA pooling [10]. The statistical analysis of data from pooled DNA studies uses analysis of variance (ANOVA) procedures that have algorithms for calculating power to detect unequal allele probabilities. A major design issue when using DNA pooling technologies is the measurement error as compared with the gold standard method of individual genotyping.

Research has been done regarding specification of study parameter settings to maximize power [10,16,17]. The research question addressed in this work is: assuming a certain level of measurement error, what settings of study design parameters maximize the power to detect association? More specifically, we study the sensitivity of power to changes in design parameters (e.g., total sample size, differing numbers of genotypings, number of pools, and genetic model parameters). We present a closed form approximation to the power in terms of the genetic model, pooling measurement error model, and the study parameters (e.g., number of pools, number of replicates per pool, sample size) and we perform a systematic study of the design parameters to identify which have the greatest effect on power to detect association for DNA pooling studies.

Results

The pooled DNA association studies considered here have equal number of cases and controls N. For a fixed number of total subjects (cases and controls), an equal number of cases and controls yields maximal power for association [7,8]. The N subjects in each group are randomly assigned to one of J pools, each of size T (so that N = J × T). Each of the J pools has K replicate measures, so that the number of case genotypings is equal to the number of control genotypings (G = J × K). The data analyzed in the study are the estimated allele frequencies Yijk of the more common allele (called "2"), where the index i is 0 for cases and 1 for controls, the index j ranges from 1 to J, and the index k ranges from 1 to K. The variance of Yijk has two components, one due to the sampling variability of the frequency of allele 2 in each pool (denoted by <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M1">View MathML</a> here) and the other due to the variability of the measurement process of the pooled material (denoted by <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M2">View MathML</a> here). We refer to the term m as the measure of the machine replicability variance factor. The quality of the estimate of the pooled frequency as measured by its variance is parameterized so that is proportional to the sampling variance of the allele 2 frequency and is assumed to be independent of pool size or other pooling parameters. When the number of pools is J ≥ 2, the structure of a pooled DNA study is an example of a two-stage nested design [18]. Its statistical analysis is conventionally organized in an ANOVA table as in Table 1, with the null hypothesis that the case allele 2 frequency is equal to the control allele 2 frequency. This hypothesis is tested using the statistic <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M3">View MathML</a>. Here, SSA is the sum of squares of the case/control averages, SSP is the sum of squares of the pool averages within a group about the group pool mean, and is the basis of an estimate of the variance of a pool average frequency. The term MSA is the mean square of SSA, which by definition is just SSA divided by the degrees of freedom (df), and similarly for MSP. Under the null hypothesis, MSA is also the basis of an estimate of the variance of a pool average frequency. When the null hypothesis is false, on average, MSA is increased as shown in its expected mean square.

Table 1. The analysis of variance table for a two-stage nested design

The power calculation of the F-test, the standard statistical procedure used when testing allele frequency differences for DNA pooling, requires the non-centrality parameter (NCP) of the test. Its approximate value is given in equation 1 of the Methods and Technical Issues section below. The NCP is a function of the difference between the case and control allele 2 frequencies, the quality of the pooling estimate of these probabilities, the number of cases and controls, the number of replications of DNA measurements of each pool, and the size of each pool.

When the number of replicates K is fixed, the approximate NCP is constant with respect to the number of pools (J). When the number of pools J is larger, the denominator degrees of freedom (df) are larger, so that the power of the F-test is greater. That is, smaller pool sizes T = N/J for larger J, have greater power. The protocol of genotyping each subject has T = 1, which is the most powerful allele frequency testing protocol. That is, if genotype cost is not an issue, it is always most powerful to individually genotype all subjects.

When the total number of genotypings (G = J × K) is fixed, as is the situation for a fixed budget, the optimal choice of J and K is more complex. When one knows the genetic model parameters, one can examine the power using a range of values of J and K (and hence T) to find settings with high power. We seek to find Ko(G), the number of replicates that has greatest power when there are G genotypings. For example, Figure 1 is based on a recessive mode of inheritance (MOI) with N = 10,000, prevalence φ = 0.05, disease allele frequency pd = 0.15, relative risk of homozygous for disease allele (R2) is 3, linkage disequilibrium <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M16">View MathML</a>, (where <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M17">View MathML</a> is the maximum disequilibrium value between the disease allele and the coupling SNP allele; also see PAWE-3D website Helpfile [19], minor SNP allele frequency q1 = 0.35, and machine replicability variance factor m = 2.25. We set G = J × K to 80, 160, 320, and 640 with significance level 0.0001. The power increases substantially as G increases. For example, the maximum power is 0.73 with 80 genotypings when Ko(80) = 4; that is, 4 replicates of each of 20 pools. It increases to 0.91 with 640 genotypings when Ko(640) = 13; that is 13 replicates each of 49 pools. The power of the chi-squared 2 × 2 test of independence when each subject is individually genotyped is 0.97. With 640 genotypings, the power with K = 4 is 0.85. The increase of power from 0.73 to 0.91 is obtained through additional genotyping effort rather than increased sampling of subjects. Also note that the power when K = 1 is always substantially less than the power using the optimal choice of K; that is, replication of pool measurement is always advantageous.

thumbnailFigure 1. Power as a function of number of replicates (K) for fixed number of genotypings (G = J × K) with recessive mode of inheritance. Power values presented here are for studies with N = 10000, prevalence φ = 0.05, disease allele frequency pd = 0.15, relative risk of homozygous for disease allele R2 = 3, minor SNP marker allele frequency q1 = 0.35, machine replicability variance factor m = 2.25, linkage disequilibrium <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M16">View MathML</a> and significance level alpha = 0.0001. *The horizontal line represents the power for specified parameters with individual genotyping using the 2 × 2 test of independence. Power with individual genotyping was computed using the method implemented in the Power for Association With Error (PAWE) website [27].

Figure 2 is based on a dominant MOI with N = 5,000, prevalence φ = 0.05, disease allele frequency pd = 0.15, relative risk of a genotype with at least one copy of the disease allele is 1.5, linkage disequilibrium <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M16">View MathML</a>, minor SNP allele frequency q1 = 0.35, and machine replicability variance factor m = 2.25. We set the number of genotypings J × K to 40, 80, 160, 320, and 640. The pattern is similar to that of Figure 1. A program is available from the corresponding author to produce these numbers for user specified settings.

thumbnailFigure 2. Power as a function of number of replicates (K) for fixed number of genotypings (G = J × K) with dominant mode of inheritance. Power values presented here are for studies with N = 5000, prevalence φ = 0.05, disease allele frequency pd = 0.15, relative risk of a genotype with at least one copy of the disease allele = 1.5, minor SNP marker allele frequency q1 = 0.35, machine replicability variance factor m = 2.25, linkage disequilibrium <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M16">View MathML</a> and significance level alpha = 0.0001. *The horizontal line represents the power for specified parameters with individual genotyping using the 2 × 2 test of independence. Power with individual genotyping was computed using the method implemented in the Power for Association With Error (PAWE) website [27].

We note that, although results are not presented, we performed analyses similar to those presented in Figures 1 and 2 for a multiplicative MOI. The conclusions were the same, with results being very similar to the dominant MOI results (Figure 2). We omit these results in the interest of brevity.

The program mentioned above was used to create Table 2, which considers the robustness of design choices when studying a disease with prevalence equal to 0.05. We consider both dominant and recessive MOI with genetic relative risk (GRR) values ranging from 1.5 to 2.2 for specified levels of significance, linkage disequilibrium, sample size, minor SNP marker allele frequency and quality of pooling measurement m. We examine the range of numbers of genotypings J × K between 40 and 640. Table 2 gives the maximum power for each number of genotypings, Ko(G), and the range of K settings that produce power within 95% of the maximal power. As in Figures 1 and 2, the most important result is that increasing G always substantially increases power. For example, in scenario 1 with 10,000 subjects in each group, recessive MOI, relative risk 2.2, and level of significance 0.0001, the maximal power is 38% with 40 genotypings compared to 77% with 640 genotypings. Similar patterns hold for the other situations considered. The value of Ko(G) increases at a less than linear rate as G increases. Typically, the decrease in power associated with using a value of K slightly different from Ko(G) is relatively small; that is, the power of the procedure is relatively insensitive to choice of K. While K = 4 is optimal or close to optimal when the number of genotypings is small (i.e. G = 40 or 80), Ko(G) increases with G and can have appreciable greater power than with 4 replicates. The value of Ko(G) is not substantially affected by whether the MOI is dominant (see scenarios 8–10) or recessive (see scenarios 1–7).

Table 2. Maximum power as a function of the number of genotyping(G = J × K), number of replicates giving maximum power (Ko(G)), number of replicates (K) at 95% of the maximum power at specific experimental and genetic parameters and the power at K = 1 when assuming no machine replicability variability (m = 1)

Regression modeling results

We use ordinary least squares (OLS) regression analysis with power at the 0.0001 significance level as the dependent variable for each of the 44 × 23 × 3 × 5 (30720) model specifications. We consider the 9 factors listed in Table 3 and all possible two-way combinations in our regression model to assess the relative importance of the factors in determination of power to detect association. We also use the square of the number of replicates to model the optimal number of replicates. The analysis finds a significant fit (F55,30664 = 1348.07, p-value < 0.0001) with R2 equal to 0.71. Genotype relative risk (R2) has the largest F-statistic (34333.5 with 1 df), with increasing R2 associated with greater power. Sample size has the second largest F-statistic (15002.4 with 1 df). The MOI also has a highly significant F-statistic (5869.7 with 2 df). For a fixed genotype relative risk R2, the median power is greatest for dominant MOI, followed by multiplicative and then recessive MOIs. The prevalence of the disease (φ), the minor marker allele frequency, and the measurement quality of the pooling are the factors that have the smallest F-statistic values. Measurement error explains less of the variance than genetic parameters. In general, increased measurement error reduces the power of the procedure. Further, with genetic parameters fixed, the decrease in power from increased measurement error can be offset either by an increase in K or decrease of the number of individuals in each pool.

Table 3. List of parameters considered in the multiple regression analysis

Among the interaction terms not involving K, pd × q1, pd × MOI, R2 × T, pd × R2, and N × T are highly significant (sorted in increasing P-values). The most significant interaction term is pd × q1. This finding is not surprising as there has been extensive documentation in the statistical genetics literature that power for genetic association is maximized when the difference between the disease allele frequency and the SNP marker allele frequency in coupling with the disease allele is 0, with decreasing power occurring as the difference increases [20-23]. The finding of a significant interaction pd × MOI between disease allele frequency and disease MOI has also been documented previously, most recently in the work by Skol et al. [4]. The finding underscores the fact that, when all other factors are fixed, the disease allele frequency that gives optimal power differs depending upon the disease MOI.

Discussion

Our results have produced two types of conclusions. The first is that the genetic parameters of the disease being studied are the most important determinants of the power to detect association. This fact is consistent with the association of ApoE with late onset Alzheimer's Disease [24] and recent association results for age-related macular degeneration [1,3]. In each of these studies, estimated genotype relative risks are approximately 3 for the heterozygote and greater than 9 for the homozygote. In all studies, highly significant associations were observed with less than 500 total cases and controls. Furthermore, for age-related macular degeneration [24], associations were observed for SNP alleles in linkage disequilibrium (LD) with the functional variants. The results from the OLS regression analysis are consistent with this history. The genetic relative risk is the most significant parameter, followed by the sample size. For a fixed genotype relative risk R2, the median power is greatest for dominant MOI, followed by multiplicative and then recessive MOIs. The linear and quadratic terms in the number of replicates K and a number of interactions with K are significant. Since there is an optimal setting of K, this result is expected.

The second type of conclusion is guidance about the choice of the number of genotypings G = J × K and the simultaneous setting of the number of replicates K of the J pools. We have shown that the number of genotypings G = J × K should be as large as possible (holding all other factors constant) to have the greatest power. When G is fixed, we have shown that there is a setting Ko(G) that maximizes the power when all genetic model parameters are specified. The optimal setting increases as G increases. These differences are practically important and suggest that those conducting pooled studies use the program available from the corresponding author to determine optimal settings. In all situations studied, for fixed value of G, power is relatively insensitive to choice of K near Ko(G). Further, when the machine replicability variance factor m is larger than 1, the setting K = 1 has power much less than replicated designs. This suggests that such extensions of these designs as staggered nested designs [18] may have little value in genetic pooling studies.

Our work provides the basis for extending recommendations such as those of Sham et al. [10] to include genetic model parameters. For the very large studies possible with pooling, there is strong evidence that increasing the number of genotypings and increasing the number of replicate measurements of each pool can increase power noticeably. This approach is dependent on the assumption that Ei) = E(Yijk), where Πi is the fraction of the major allele 2 in a randomly selected subject from the ith group; that is, the pooled estimate of the intensity of an allele is in fact an unbiased estimate of the allele 2 frequency. Further work will incorporate designs that formally include validation of this assumption.

Conclusion

Our work extends that of previous researchers who have considered power and sample size calculations for genetic association studies with pooled DNA samples (e.g., [16]). Our extension involves inclusion of genetic model parameters such as disease MOI, disease allele frequency, disease prevalence, marker allele frequency, and genotype relative risks. It is clear from the results of our regression analysis that incorporation of such parameters is important in the design of more powerful genetic association tests. We recommend that researchers incorporate information into their power and sample size calculations for genetic association with pooled DNA, such as choice of numbers of genotypings and the number of replicates that can increase power from such relatively low levels as 40% to 50% to 75% to 80% using the same cases and controls.

Methods

Definitions

N: number of case (control) subjects; we assume equal numbers of cases and controls (balanced design).

J: number of pools; J ≥ 2.

T = N/J: number of individuals in each pool; we assume that case subjects are assigned randomly to case pools and control subjects are assigned randomly to control pools.

K: number of replicates of each pool; we assume that there is no reassignment of subjects in the replications.

G = J × K: number of case (control) genotypings.

Genetic model parameters

We consider a disease associated with a di-allelic gene with allele d associated with increased risk of disease and allele + associated with no increased risk.

pd: allele frequency of disease locus d allele.

p+ = 1 - pd: allele frequency of disease locus wild-type (+) allele.

φ: prevalence of the disease.

f2: probability of having disease with 2 disease alleles in the genotype = penetrance of dd.

f1: probability of having disease with 1 disease allele in the genotype = penetrance of d+.

f0: probability of having disease with 0 disease alleles in the genotype = penetrance of ++.

Genotype relative risks (GRR)

<a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M19">View MathML</a>

<a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M20">View MathML</a>

Modes of Inheritance (MOI)

The three MOIs are characterized by the parameter R.

Multiplicative MOI: The penetrances satisfy the equation <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M21">View MathML</a>; that is, <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M22">View MathML</a>.

Dominant MOI: R = R1 = R2.

Recessive MOI: <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M23">View MathML</a>; that is, R1 = 1.

SNP marker parameters

q1: allele frequency of minor SNP marker allele 1 (that is, 0 <q1≤ 0.5).

q2: the frequency of the major SNP marker allele 2.

Disequilibrium parameters

Dmax = min(pdq2, p+q1).

<a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M24">View MathML</a> (see, e.g., [25]).

pr: measure of linkage disequilibrium between disease gene and SNP marker; here it is a fraction of <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M17">View MathML</a> (0 <pr ≤ 1); the examples use pr = 0.9.

The detailed computation of case and control genotype probabilities which are functions of the disease allele frequency, minor SNP allele frequency, and linkage disequilibrium parameters are documented in the PAWE-3D Helpfile [19].

We use method [26] implemented in the PAWE software [27] to calculate the power of the 2 × 2 test of independence when each subject is individually genotyped and we report these value in Figures 1 and 2.

Case-control frequency of allele 2

Πi: the fraction of the major allele 2 in a randomly selected subject from the ith group, i = 0 for cases, i = 1 for controls. It follows that the expectation of Πi is given by:

<a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M25">View MathML</a>

where Pi1 is the frequency of the heterozygous genotype with allele 2 in the ith group and Pi2 is the frequency of the homozygous genotype with allele 2 in the ith group. In addition,

<a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M26">View MathML</a>

Analysis of variance (ANOVA) table for two-stage nested design

Specification of ANOVA model

Aijk: intensity level of allele 2 in the ith group (i = 0 for cases, 1 for controls), jth pool (j = 1,...,J), kth replicate (k = 1,...,K).

Bijk: intensity level of allele 1 in the ith group (i = 0 for cases, 1 for controls), jth pool (j = 1,..., J), kth replicate (k = 1,..., K).

<a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M27">View MathML</a>: fraction of SNP allele 2 estimated in the ith group (i = 0 for cases, 1 for controls), jth pool (j = 1,..., J), kth replicate (k = 1,..., K).

Model:

Yijk = μ + αi + Pj(i) + σEEijk,

where the case or control effect is <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M28">View MathML</a>, i = 0,1, subject to the constraint ∑αi = 0. The random sampling effect of the allele 2 frequency associated with the jth pool in either cases or controls is <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M13">View MathML</a>, with <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M29">View MathML</a>. Finally, {Eijk} are independent N(0,1) random variables incorporating the additional variability due to the measurement process. See below for more details regarding the specification <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M30">View MathML</a>. It follows that

<a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M31">View MathML</a>

Here, var(Yijk) is modeled as the sum of two components of variance. The first, <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M14">View MathML</a>, is due to the sampling variation of the frequency of allele 2 in the subjects assigned to each pool. The second, <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M32">View MathML</a>, is due to the measurement error of the processing of the pooled material. Under an ideal measurement process, <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M32">View MathML</a> = 0; we define a parameter m to capture the departure from this ideal. The parameter m (machine replicability variance factor), m ≥ 1, is defined by <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M2">View MathML</a>, so that m = 1 represents the ideal measurement process and m > 1 models additional variability due to a less than perfect measurement process. The fraction of var(Yijk) due to the measurement process is <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M33">View MathML</a>.

This model is dependent on the assumption that <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M34">View MathML</a>. Also, let <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M35','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M35">View MathML</a>. This value is an indication of the adequacy of the approximation of the NCP in equation (1) below [28].

Let

<a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M36','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M36">View MathML</a>

Following Scheffé [29], the means used in the sums of squares can be expressed in terms of the ANOVA model as

<a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M37','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M37">View MathML</a>

<a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M38">View MathML</a>

<a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M39">View MathML</a>

Then,

<a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M40','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M40">View MathML</a>

where <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M41','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M41">View MathML</a>.

If we let Wi represent n independent and identically distributed N(μ, σ2) random variables, then <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M42','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M42">View MathML</a> has the distribution <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M43','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M43">View MathML</a>[18]. Consequently,

<a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M44','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M44">View MathML</a>

where <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M45','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M45">View MathML</a>. The sum of squares SSP therefore has the distribution <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M46','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M46">View MathML</a> when the null hypothesis is true, with <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M47','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M47">View MathML</a>. Further <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M48','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M48">View MathML</a> under both the null and alternative hypotheses with <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M49','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M49">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M50','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M50">View MathML</a>. The distribution of SSP under the alternative is a weighted sum of independent central chi-square distributions.

To obtain the distribution of SSA, consider

<a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M51','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M51">View MathML</a>

where <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M52','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M52">View MathML</a>. Then,

<a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M53','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M53">View MathML</a>

The null distribution of SSA is a scaled central chi-squared random variable with scaling factor <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M54','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M54">View MathML</a> so that

<a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M55','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M55">View MathML</a>

has a central F-distribution with 1 numerator degree of freedom and 2(J - 1) denominator df when H0: αi ≡ 0 is valid. Under the alternative hypothesis, the distribution of SSA is a weighted sum of non-central chi-squared random variables. The approximation to the alternative distribution of the F-test proposed here is that it is a non-central F with 1 numerator degree of freedom, 2(J - 1) denominator df, and non-centrality parameter (NCP) δ2, where

<a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M56','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M56">View MathML</a>

(1)

As shown by Gronow [28], the inequality in variance does not affect the power approximation when p ≤ 1.5 Since <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M57','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M57">View MathML</a>, where 1 ≤ m,

<a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M58','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M58">View MathML</a>

which is not dependent upon J, assuming this model. This result is due to the fact that we assumed <a onClick="popup('http://www.biomedcentral.com/1471-2164/8/238/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/8/238/mathml/M30">View MathML</a>, which is an assumption that each individual's variance contributes equally to the variance of the pool. The factor (m - 1) includes the cumulative effect of such sources of additional variability as experimental error, differential variability in processing of individuals, and other sources.

Multiple regression analysis of approximate power

We calculated the approximate power of the experimental design under various values of parameters (Table 3). We then used OLS multiple regression analysis to identify the parameters that had the greatest impact on power, using SAS software [30]. For independent variables, we used all variables listed in Table 3, all two way interactions of these variables, and K2, the square of the number of replicates to incorporate the existence of an optimal number of replicates. We considered type I errors at 0.01, 0.001 and 0.0001 levels. It might be argued that researchers should use 0.0001 or less as a stringent significance level if the design is applied in a genome-wide association study. Since DNA pooling techniques are normally used as 1st stage screening and for 1st stage design, researchers may be more concerned with false negatives than false positives [9,31].

Authors' contributions

FJ, SJF, NRM, and DG conceived of the study design. FJ performed all statistical analyses presented in the manuscript. CH wrote software to assist FJ in her analyses. FJ, SJF, and DG wrote the original manuscript and all revisions. All authors have read and approve the final manuscript.

Acknowledgements

This research was supported by the US National Institutes of Health grants NIMH R01 MH071523 (NRM), NIMH 2R01 MH04480114A1 (SJF), and MH44292 (FJ). The authors thank Dr. Yaning Yang for providing comments on our revised version of the manuscript.

References

  1. Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, Henning AK, Sangiovanni JP, Mane SM, Mayne ST, Bracken MB, Ferris FL, Ott J, Barnstable C, Hoh J: Complement factor H polymorphism in age-related macular degeneration.

    Science 2005, 308(5720):385-389. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  2. Ozaki K, Tanaka T: Genome-wide association study to identify SNPs conferring risk of myocardial infarction and their functional analyses.

    Cell Mol Life Sci 2005, 62(16):1804-1813. PubMed Abstract | Publisher Full Text OpenURL

  3. Dewan A, Liu M, Hartman S, Zhang SS, Liu DT, Zhao C, Tam PO, Chan WM, Lam DS, Snyder M, Barnstable C, Pang CP, Hoh J: HTRA1 promoter polymorphism in wet age-related macular degeneration.

    Science 2006, 314(5801):989-992. PubMed Abstract | Publisher Full Text OpenURL

  4. Skol AD, Scott LJ, Abecasis GR, Boehnke M: Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies.

    Nat Genet 2006, 38(2):209-213. PubMed Abstract | Publisher Full Text OpenURL

  5. Wang H, Thomas DC, Pe'er I, Stram DO: Optimal two-stage genotyping designs for genome-wide association scans.

    Genet Epidemiol 2006, 30(4):356-368. PubMed Abstract | Publisher Full Text OpenURL

  6. Schaid DJ, Sommer SS: Genotype relative risks: methods for design and analysis of candidate-gene association studies.

    Am J Hum Genet 1993, 53(5):1114-1126. PubMed Abstract | PubMed Central Full Text OpenURL

  7. Purcell S, Cherny SS, Sham PC: Genetic power calculator: design of linkage and association genetic mapping studies of complex traits.

    Bioinformatics 2003, 19(1):149-150. PubMed Abstract | Publisher Full Text OpenURL

  8. Gordon D, Haynes C, Blumenfeld J, Finch SJ: PAWE-3D: visualizing power for association with error in case-control genetic studies of complex traits.

    Bioinformatics 2005, 21(20):3935-3937. PubMed Abstract | Publisher Full Text OpenURL

  9. Satagopan JM, Elston RC: Optimal two-stage genotyping in population-based association studies.

    Genet Epidemiol 2003, 25(2):149-157. PubMed Abstract | Publisher Full Text OpenURL

  10. Sham P, Bader JS, Craig I, O'Donovan M, Owen M: DNA Pooling: a tool for large-scale association studies.

    Nat Rev Genet 2002, 3(11):862-871. PubMed Abstract | Publisher Full Text OpenURL

  11. Kirov G, Nikolov I, Georgieva L, Moskvina V, Owen MJ, O'Donovan MC: Pooled DNA genotyping on Affymetrix SNP genotyping arrays.

    BMC Genomics 2006, 7:27. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  12. Meaburn E, Butcher LM, Liu L, Fernandes C, Hansen V, Al-Chalabi A, Plomin R, Craig I, Schalkwyk LC: Genotyping DNA pools on microarrays: tackling the QTL problem of large samples and large numbers of SNPs.

    BMC Genomics 2005, 6(1):52. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  13. Allison DB, Schork NJ: Selected methodological issues in meiotic mapping of obesity genes in humans: issues of power and efficiency.

    Behav Genet 1997, 27(4):401-421. PubMed Abstract | Publisher Full Text OpenURL

  14. Simonic I, Gericke GS, Ott J, Weber JL: Identification of genetic markers associated with Gilles de la Tourette syndrome in an Afrikaner population.

    Am J Hum Genet 1998, 63(3):839-846. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  15. Simonic I, Nyholt DR, Gericke GS, Gordon D, Matsumoto N, Ledbetter DH, Ott J, Weber JL: Further evidence for linkage of Gilles de la Tourette syndrome (GTS) susceptibility loci on chromosomes 2p11, 8q22 and 11q23-24 in South African Afrikaners.

    Am J Med Genet 2001, 105(2):163-167. PubMed Abstract | Publisher Full Text OpenURL

  16. Barratt BJ, Payne F, Rance HE, Nutland S, Todd JA, Clayton DG: Identification of the sources of error in allele frequency estimations from pooled DNA indicates an optimal experimental design.

    Ann Hum Genet 2002, 66(Pt 5-6):393-405. PubMed Abstract | Publisher Full Text OpenURL

  17. Le Hellard S, Ballereau SJ, Visscher PM, Torrance HS, Pinson J, Morris SW, Thomson ML, Semple CA, Muir WJ, Blackwood DH, Porteous DJ, Evans KL: SNP genotyping on pooled DNAs: comparison of genotyping technologies and a semi automated method for data storage and analysis.

    Nucleic Acids Res 2002, 30(15):e74. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  18. Montgomery DC: Design and Analysis of Experiments. Sixth edition. Hoboken , J. Wiley and Sons; 2005.

  19. PAWE-3D [http://linkage.rockefeller.edu/pawe3d/] webcite

  20. Zondervan KT, Cardon LR: The complex interplay among factors that influence allelic association.

    Nat Rev Genet 2004, 5(2):89-100. PubMed Abstract | Publisher Full Text OpenURL

  21. Pfeiffer RM, Gail MH: Sample size calculations for population- and family-based case-control association studies on marker genotypes.

    Genet Epidemiol 2003, 25(2):136-148. PubMed Abstract | Publisher Full Text OpenURL

  22. Ji F, Yang Y, Haynes C, Finch SJ, Gordon D: Computing asymptotic power and sample size for case-control genetic association studies in the presence of phenotype and/or genotype misclassification errors.

    Stat Appl Genet Mol Biol 2005, 4(1):Article 37. OpenURL

  23. Gordon D, Finch SJ: Factors affecting statistical power in the detection of genetic association.

    J Clin Invest 2005, 115:1408-1418. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  24. Corder EH, Saunders AM, Strittmatter WJ, Schmechel DE, Gaskell PC, Small GW, Roses AD, Haines JL, Pericak-Vance MA: Gene dose of apolipoprotein E type 4 allele and the risk of Alzheimer's disease in late onset families.

    Science 1993, 261(5123):921-923. PubMed Abstract | Publisher Full Text OpenURL

  25. Pritchard JK, Przeworski M: Linkage disequilibrium in humans: models and data.

    Am J Hum Genet 2001, 69(1):1-14. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  26. Gordon D, Finch SJ, Nothnagel M, Ott J: Power and sample size calculations for case-control genetic association tests when errors are present: application to single nucleotide polymorphisms.

    Hum Hered 2002, 54(1):22-33. PubMed Abstract | Publisher Full Text OpenURL

  27. PAWE [http://linkage.rockefeller.edu/pawe/] webcite

  28. Gronow DG: Test for the significance of the difference between means in two normal populations having unequal variances.

    Biometrika 1951, 38(1-2):252-256. PubMed Abstract | Publisher Full Text OpenURL

  29. Scheffe H: The Analysis of Variance. In Wiley Classics Library. New York , Wiley-Interscience ; 1999:477. OpenURL

  30. SAS, version 9.1 [http://www.sas.com] webcite

  31. Elston RC, Guo X, Williams LV: Two-stage global search designs for linkage analysis using pairs of affected relatives.

    Genet Epidemiol 1996, 13(6):535-558. PubMed Abstract | Publisher Full Text OpenURL