Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong, China

Abstract

Background

The Cochran-Armitage trend test (CATT) is powerful in detecting association between a susceptible marker and a disease. This test, however, may suffer from a substantial loss of power when the underlying genetic model is unknown and incorrectly specified. Thus, it is useful to derive tests obtaining the plausible power against all common genetic models. For this purpose, the genetic model selection (GMS) and genetic model exclusion (GME) methods were proposed recently. Simulation results showed that GMS and GME can obtain the plausible power against three common genetic models while the overall type I error is well controlled.

Results

Although GMS and GME are powerful statistically, they could be seriously affected by known confounding factors such as gender, age and race. Therefore, in this paper, via comparing the difference of Hardy-Weinberg disequilibrium coefficients between the cases and the controls within each sub-population, we propose the stratified genetic model selection (SGMS) and exclusion (SGME) methods which could eliminate the effect of confounding factors by adopting a matching framework. Our goal in this paper is to investigate the robustness of the proposed statistics and compare them with other commonly used efficiency robust tests such as MAX3 and ^{2 }with 2 degrees of freedom (df) test in matched case-control association designs through simulation studies.

Conclusion

Simulation results showed that if the mean genetic effect of the heterozygous genotype is between those of the two homozygous genotypes, then the proposed tests and MAX3 are preferred. Otherwise, ^{2 }with 2 df test may be used. To illustrate the robust procedures, the proposed tests are applied to a real matched pair case-control etiologic study of sarcoidosis.

Background

The population-based case-control association study is a powerful approach in detecting the association between a candidate marker and a disease. Compared with the family-based association study which recruits samples from family members, the case-control study is more cost effective because cases and controls are unrelated hence easy to recruit from population. To test the genetic association using the case-control design, the genotypic data for a bi-allelic marker are usually described by a 2 × 3 table where rows represent the disease status and columns represent the genotypic counts. Hence, to test for genetic association is equivalent to test for association between the rows and the columns. Generally, the Pearson's ^{2 }with 2 df test can be used to detect such an association. Besides, if a linear trend among the rows can be assumed, a more powerful test which utilizes the score test for a logistic regression can be obtained. This score test is known as the Cochran-Armitage trend test (CATT)

To apply the CATT, increasing scores are specified a priori for the underlying genetic model. A genetic model refers to the model of inheritance, which defines some relationship of the risks of having the disease given different genotypes. The common genetic models include, but not limit to, recessive (REC), additive (ADD) and dominant (DOM) models. If the underlying genetic model is known, the asymptotically optimal CATT can be used. Otherwise, the CATT is not robust when the scores are misspecified

Methods robust for a variety of underlying model of inheritance have recently become an important area of research. The Pearson's ^{2 }test with 2 df does not assume any structure of a genetic model so it is a robust test against the genetic model. Moreover, the maximin efficiency robust test (MERT) and the MAX method using the maximum of the CATTs optimal for REC, ADD and DOM respectively were extensively studied

Although the population based case-control study is powerful and feasible to implement, spurious association may arise due to known confounding factors such as gender, age and race. Intuitively, the GMS and GME do not work in the presence of confounding factors. One of the reasons is that when the samples are divided into several sub-populations via the confounding factors, the Hardy-Weinberg equilibrium (HWE) assumption needed in the first phase of the GMS and GME does not hold any more. Besides, the CATTs used in the second phase of the GMS and GME do not control the size well due to the confounding factors.

Typically, when the confounding factors can be observed, they could be treated as the covariates of interest and incorporated in the logistic regression. However, further calculation to adjust for the covariates may complicate the trend test. Alternatively, the matching strategy is frequently used as a much simpler way to control potential confounding factors in epidemiological studies. Specifically, a single case is matched with a certain number of controls based on the confounding factors constructing for each matched set. Then, a conditional logistic regression analysis is normally used to fit the matched data. Recently, an increasing number of matching studies are conducted by either adopting the matched design

Similar to the unmatched case-control association study, when the underlying genetic model is unknown, the robustness of the statistics for the matched case-control design is also worth studying. Zheng and Tian

Methods

Genetic model selection and exclusion

When the genetic model is unknown, the _{HWDTT }test proposed by Song and Elston _{HWDTT }> 0 under the REC model and _{HWDTT }< 0 under the DOM model. Denote _{0}, _{0.5 }and _{1 }as the CATTs optimal for REC, ADD and DOM respectively, Zheng and Ng _{0 }if _{HWDTT }> _{1 }if _{HWDTT }< -_{0.5 }otherwise to test for genetic association, where

Note that for the original GMS mentioned above, the risk allele is assumed to be known. However, if the risk allele cannot be correctly specified, such GMS may have some problems. Specifically, consider a bi-allelic marker with alleles _{0}, _{0.5 }and _{1 }are optimal for the REC, ADD and DOM models respectively. On the other hand, if _{1}, -_{0.5 }and -_{0 }are optimal for the REC, ADD and DOM models respectively. Joo et al. _{0.5 }to decide which one is the risk allele followed by the corresponding GMS which depends on the determined risk allele. Joo et al.

where I(.) is an indicator function.

When the GRRs are small, Joo et al. _{HWDTT }becomes small. On the other hand, the probability of correctly excluding the most unlikely genetic model remains high against the GRRs. Furthermore, when the most unlikely genetic model is excluded, the simple MERT _{0}, _{0.5 }and _{1 }in (1) by

Although GMS and GME are efficiency robust tests, they could be seriously affected by confounding factors. In the presence of sub-populations, GMS and GME may not keep the correct size. Therefore, to overcome this limitation, we propose the stratified genetic model selection (SGMS) and exclusion (SGME) approaches in the following.

Notation

Consider a bi-allelic marker with alleles _{0 }= _{1 }= _{2 }= _{l}
_{l }cases are drawn from the population and _{l }
_{l }
_{0}, _{1}, _{2}) in cases and controls in the _{0l
}, _{1l
}, _{2l
}) and (_{0l
}, _{1l
}, _{2l
}), respectively. Hence, _{l }
_{l }

In the _{il }
_{i}
_{l}
_{l }
_{l}
_{i }
_{il}Pr_{i}|C_{l}
_{il }
_{i}|_{l}
_{il}Pr_{i}|C_{l}
_{il }
_{il }
_{i}|_{l}
_{il}
_{i}|C_{l}
_{il}
_{1l
}= _{1l
}/_{0l
}and _{2l
}= _{2l
}/_{0l
}(_{0l
}> 0). A genetic model is REC, ADD and DOM if _{1l
}= 1, _{1l
}= (_{2l
}+ 1)/2 and _{1l
}= _{2l
}, respectively. We assume that HWE holds in each stratum. Thus, _{1}|_{l}) = 2_{l}q_{l }
_{l }
_{l }
_{l}

Stratified genetic model selection and exclusion

Let _{1lj
}and _{2ljk
}denote the genotypic scores for the _{l}
_{0}, _{1 }or _{2 }respectively, where

The null hypothesis of no association _{0 }:

Obviously, _{MTT}(

Suppose a family of scientifically plausible models is defined. Similar to the CATTs, corresponding to each model, an asymptotically optimal normally distributed MTT can be obtained. For example, _{MTT}(0), _{MTT}(0.5) and _{MTT}(1) are optimal for the REC, ADD and DOM models respectively. When the genetic model is uncertain, a pre-specified test from this family is not fully efficient, hence, MTTs are not suggested to be directly used when the underlying genetic model is unknown. This underlying genetic model, however, can be ascertained using the Hardy-Weinberg Disequilibrium (HWD) coefficient which is de-noted as Δ = ^{2}. In the unmatched study, denote the HWD coefficients in the case group and the control group as Δ_{p }
^{2 }and Δ_{q }
^{2}, Zheng and Ng _{
p
}- Δ_{
q
}> 0 under REC and Δ_{
p
}- Δ_{
q
}< 0 under DOM. Using the matched design described above, we denote Δ_{pl }
_{ql }
_{
pl
}- Δ_{ql }
_{
pl
}- Δ_{
ql
}< 0 for each

Denote

Notice that the denominator of _{SMRT }is estimated under the null hypothesis thus _{SMRT }is a score test _{SMRT }asymptotically follows a standard normal distribution _{SMRT }tends to be large if the true genetic model is REC and tends to be small if the true genetic model is DOM. Hence, with a pre-specified threshold ^{-1}(0.95)), we can classify the underlying genetic model as REC if _{SMRT }> _{SMRT }< -_{MTT}(_{MTT}(0) and _{MTT}(1) are optimal for the REC and DOM models respectively. On the other hand, if d is the risk allele, then _{MTT}(0) and _{MTT}(1) are optimal for the DOM and REC models respectively. Besides, the expected values of _{MTT}(0) and _{MTT}(1) are negative in this case. Similar to Joo et al. _{MTT}(0.5) to determine the risk allele. That is, if _{MTT}(0.5) > 0, _{MTT}(0), _{MTT}(0.5), _{MTT}(1) are optimal for the REC, ADD and DOM models; if _{MTT}(0.5) ≤ 0, -_{MTT}(1), -_{MTT}(0.5), -_{MTT}(0) are optimal for the REC, ADD and DOM models. Hence, the stratified genetic model selection (SGMS) test is proposed as

Under the null hypothesis of no association, we show that (_{MTT}(0.5), _{SMRT}, _{MTT}(_{x}

_{MTT}(0.5) and _{SMRT }are asymptotically independent. Detailed proof and the forms of _{x }
_{x}
_{0.5 }as well as their consistent estimates are derived in the **Appendix**. Define **ø**
_{x}
_{1}, _{2}, _{3}) as the density function of **∑**
_{x}
_{SGMS }and the corresponding p-value is obtained as

With a pre-specified significance level _{s }

Although Z_{SMRT }can be used to determine the underlying genetic model, the probability of selecting the correct genetic model is low when the GRRs are small or moderate. On the other hand, the probability of correctly excluding the most unlikely genetic model remains high when GRRs are very small. That is, when _{SMRT }> _{SMRT }< -

Similar to Joo et al. _{MAT}(0) is optimal for either REC or ADD and _{MAT}(1) is optimal for either DOM or ADD. Besides, Z_{MTT}(0.5) is still optimal for just ADD. Utilizing the stratified genetic model exclusion (SGME) strategy, we use _{MAT}(0) to test for association if _{SMRT }> _{MAT}(1) if _{SMRT }< -_{MTT}(0.5) otherwise. In addition, similar to SGMS, _{MTT}(0.5) is used at the beginning to determine the risk allele. Hence, the statistic for the SGME approach can be written as

Under the null hypothesis of no association, we obtain that (_{MTT}(0.5), _{SMRT}, _{MAT}(

_{SGMS}, the p-value of _{SGME }can be derived as

We declare a significant association if _{e }

Other robust procedures

In equation (2), we use one indicator to code three genotypes. One the other hand, if we define two dummy variables ((_{1lj1}, _{1lj2}) for the cases and (_{2ljk1}, _{2ljk2}) for the controls) taking values (0,0), (0,1) and (1,1) to code three genotypes _{0}, _{1 }and _{2}, the conditional likelihood function becomes

The score test derived from equation (9), denoted by ^{2 }distribution with 2 df under _{0 }: _{1 }= _{2 }= 0. Note that

Another robust test is the MAX3 which was also proposed as an efficiency robust test for unmatched genetic association studies

Compared with the optimal MTTs and ^{2 }test with 2 df, MAX3 has the largest minimum power across the three genetic models _{MAX3 }can be approximated by Monte-Carlo simulation. In addition, the p-value of _{MAX3 }can also be obtained according to the asymptotic formula given by Zang et al.

Results

Simulation

To check whether GMS and GME can keep the correct size in the presence of confounding factors, we carried out simulation studies to examine the performance of GMS and GME in the presence of sub-populations. The nominal level was set at 0.05. We assumed that due to confounding factors, each of the case and control populations was divided into two sub-populations with equal probability. The simulation results are summarized in Table

Type I error rates of GMS and GME based on 10,000 replicates without confounding (Scenario 1) and in the presence of confounding factors (Scenarios 2-8), with the significance level 0.05 using _{l }_{l }_{l }_{l }

**Scenario**

**
r
**

**
r
**

**
s
**

**
s
**

**
p
**

**
p
**

**
k
**

**
k
**

**GMS**

**GME**

250

250

250

250

0.3

0.3

0.05

0.05

0.0510

0.0502

250

250

250

250

0.05

0.5

0.01

0.1

0.0199

0.0141

250

250

250

250

0.1

0.5

0.01

0.1

0.0190

0.0167

250

250

250

250

0.2

0.4

0.03

0.07

0.0391

0.0384

300

200

200

300

0.2

0.4

0.03

0.07

0.3923

0.4403

325

175

175

325

0.2

0.4

0.03

0.07

0.7337

0.7880

350

150

150

350

0.2

0.4

0.03

0.07

0.9077

0.9567

375

125

125

375

0.2

0.4

0.03

0.07

0.9625

0.9954

To check if the ability of _{SMRT }to select the correct genetic model is low when GRRs are small, we conducted a simulation to compare the selection procedure with the exclusion procedure. Considered 300 cases with 600 matched controls, the samples were divided into 3 sub-populations with proportions being 0.3, 0.3 and 0.4 respectively. Set the MAFs and the penetrance in the three strata as (_{1}, _{2}, _{3}) = (0.1, 0.3, 0.5) and (_{01}, _{02}, _{03}) = (0.01, 0.05, 0.02). The threshold ^{-1}(0.95) and let GRR2 = _{2l
}increase from 1.1 to 2.0 with increments of 0.1,

The results are summarized in Figure

The probabilities of correctly selecting the genetic models and of correctly excluding the most unlikely genetic models based on 10,000 replicates

**The probabilities of correctly selecting the genetic models and of correctly excluding the most unlikely genetic models based on 10,000 replicates**.

Next, we performed simulations with no disease association and under various genetic models to evaluate the performance of the proposed robust methods. Moreover, we also considered the MTTs optimal for the REC, ADD and DOM models, i.e. _{MTT}(0), _{MTT}(0.5) and _{MTT}(1) respectively. Let _{i}
_{l}
_{l}
_{il}
_{l }
_{l }

Type I error rates of _{MTT(0)}, _{MTT(0.5)}, _{MTT(1)}, _{SGMS}, _{SGME}, _{MAX3 }and

**Scenario**

**
α
**

_{MTT}(0)

_{MTT}(0.5)

_{MTT}(1)

**
Z
_{SGMS}
**

**
Z
_{SGME}
**

**
Z
_{MAX3}
**

0.05

0.0527

0.0498

0.0518

0.0501

0.0490

0.0527

0.0531

0.0487

0.0503

0.0493

0.0494

0.0502

0.0515

0.0481

0.0510

0.0512

0.0509

0.0506

0.0510

0.0513

0.0490

0.0526

0.0524

0.0537

0.0529

0.0528

0.0516

0.0484

0.0519

0.0512

0.0534

0.0536

0.0526

0.0501

0.0507

0.0485

0.0486

0.0479

0.0488

0.0467

0.0501

0.0481

0.0493

0.0497

0.0490

0.0492

0.0498

0.0457

0.0491

0.0522

0.0493

0.0480

0.0522

0.0521

0.0525

0.0522

0.01

0.0092

0.0081

0.0100

0.0083

0.0081

0.0075

0.0084

0.0096

0.0091

0.0106

0.0106

0.0103

0.0096

0.0101

0.0093

0.0101

0.0109

0.0108

0.0104

0.0101

0.0109

0.0121

0.0101

0.0098

0.0093

0.0093

0.0095

0.0105

0.0098

0.0094

0.0101

0.0093

0.0092

0.0083

0.0114

0.0109

0.0095

0.0106

0.0109

0.0102

0.0102

0.0109

0.0100

0.0094

0.0105

0.0114

0.0103

0.0111

0.0093

0.0087

0.0111

0.0104

0.0099

0.0103

0.0117

0.0107

We considered eight separate scenarios (A to H) with different numbers of cases, controls, risk allele frequencies and disease prevalences. For example, in scenario A, 150, 150 and 200 cases from 3 different sub-populations comprised the whole case group and each case was matched with 2 controls within the same sub-population. The risk allele frequencies of the 3 sub-populations were 0.1, 0.3 and 0.5 respectively and the disease prevalences equalled to 0.01, 0.05 and 0.02. Table

We also conducted simulation to investigate the performance of the proposed tests for small sized samples, where the number of cases is at most 100. The results are summarized in Table

Type I error rates of _{MTT(0)}, _{MTT(0.5)}, _{MTT(1)}, _{SGMS}, _{SGME}, _{MAX3 }and

**Scenario**

**
α
**

_{MTT}(0)

_{MTT}(0.5)

_{MTT}(1)

**
Z
_{SGMS}
**

**
Z
_{SGME}
**

**
Z
_{MAX3}
**

0.05

0.0474

0.0501

0.0487

0.0493

0.0495

0.0452

0.0502

0.0531

0.0479

0.0497

0.0480

0.0485

0.0462

0.0503

0.0569

0.0526

0.0489

0.0507

0.0526

0.0516

0.0488

0.0470

0.0480

0.0516

0.0482

0.0484

0.0488

0.0503

0.0492

0.0498

0.0489

0.0518

0.0511

0.0535

0.0498

0.0519

0.0503

0.0514

0.0535

0.0537

0.0486

0.0489

0.0484

0.0505

0.0526

0.0483

0.0466

0.0551

0.0502

0.0504

0.0453

0.0451

0.0451

0.0456

0.0484

0.0504

0.01

0.0076

0.0083

0.0092

0.0102

0.0089

0.0108

0.0126

0.0075

0.0091

0.0097

0.0078

0.0088

0.0086

0.0081

0.0078

0.0089

0.0096

0.0082

0.0084

0.0126

0.0092

0.0080

0.0095

0.0116

0.0093

0.0092

0.0111

0.0099

0.0072

0.0093

0.0098

0.0099

0.0092

0.0091

0.0120

0.0087

0.0081

0.0089

0.0085

0.0081

0.0091

0.0116

0.0077

0.0120

0.0120

0.0113

0.0125

0.0081

0.0102

0.0079

0.0087

0.0088

0.0073

0.0081

0.0098

0.0095

The results are simulated based on 10,000 replicates in the presence of confounding factors with the significance level

The powers of the MTTs and robust tests were compared under three genetic models (REC, ADD and DOM). The settings were the same as those in Table

Empirical powers of _{MTT(0)}, _{MTT(0.5)}, _{MTT(1)}, _{SGMS}, _{SGME}, _{MAX3 }and

**Scenario**

**Model**

**Z**
_{
MTT
}
**(0)**

**Z**
_{
MTT
}
**(0.5)**

**Z**
_{
MTT
}
**(1)**

**
Z
_{SGMS}
**

**
Z
_{SGME}
**

**
Z
_{MAX3}
**

**
ρ*
**

REC

0.8059

0.5590

0.1369

0.6981

0.6760

**0.7340**

0.7154

0.3267

ADD

0.4890

0.7998

0.7126

0.7594

**0.7896**

0.7629

0.7188

0.7623

DOM

0.1237

0.6818

0.8040

0.7142

0.7155

**0.7300**

0.7158

0.2639

REC

0.8073

0.5367

0.1356

0.6725

0.6497

**0.7229**

0.7147

0.3423

ADD

0.4637

0.7977

0.7258

0.7646

**0.7908**

0.7502

0.7140

0.7445

DOM

0.1287

0.7011

0.8054

0.7168

0.7259

**0.7383**

0.7199

0.2756

REC

0.8057

0.5503

0.1330

0.6970

0.6691

**0.7244**

0.7153

0.3038

ADD

0.4896

0.8052

0.7139

0.7654

**0.7952**

0.7648

0.7094

0.7295

DOM

0.1193

0.6877

0.8062

0.7153

0.7210

**0.7445**

0.7112

0.2710

REC

0.7978

0.5235

0.1400

0.6655

0.6433

**0.7177**

0.7144

0.3124

ADD

0.4639

0.8045

0.7308

0.7654

**0.7934**

0.7499

0.7090

0.7037

DOM

0.1204

0.7024

0.8071

0.7144

0.7225

**0.7250**

0.7033

0.2792

REC

0.7974

0.5276

0.1453

0.7166

0.6782

**0.7288**

0.7218

0.3492

ADD

0.4667

0.8014

0.7294

0.7554

**0.7884**

0.7517

0.7131

0.7396

DOM

0.1261

0.6989

0.8027

0.7135

**0.7234**

0.7161

0.7077

0.2795

REC

0.8056

0.5547

0.1535

0.6991

0.6742

**0.7317**

0.7173

0.3561

ADD

0.5014

0.8068

0.7195

0.7650

**0.7955**

0.7524

0.7112

0.7661

DOM

0.1389

0.6933

0.8003

0.7167

0.7231

**0.7291**

0.7141

0.2887

REC

0.8045

0.5241

0.1499

0.7233

0.6786

**0.7316**

0.7082

0.3172

ADD

0.4562

0.8020

0.7313

0.7522

**0.7883**

0.7560

0.7068

0.6970

DOM

0.1247

0.7091

0.8004

0.7167

0.7297

**0.7468**

0.7099

0.2825

REC

0.7967

0.5481

0.1467

0.6950

0.6651

**0.7203**

0.7136

0.3279

ADD

0.4885

0.8017

0.7154

0.7610

**0.7911**

0.7529

0.7009

0.7272

DOM

0.1254

0.6944

0.8055

0.7183

0.7248

**0.7366**

0.7091

0.2945

The settings are the same as those in Table 2 except that the GRRs are determined so that the optimal MTT has the maximum power of about 80%. The significance level is 0.05.

From Table _{MTT}(0) and _{MTT}(1) are below 20% and the minimum powers of _{MTT}(0.5) are between 50% to 60%. On the other hand, the minimum powers of the robust tests are about 65% across all genetic models. Table _{SGME }and _{MAX3 }perform better than the other two robust tests and _{MAX3 }always dominate _{MTT}(0), _{MTT}(0.5)), corr(_{MTT}(0), _{MTT}(1))). _{SGME }and _{MAX3}. From Table _{MAX3 }performs better or at least as powerful as _{SGME}. However, when _{SGME }is a better choice. Notice that this finding is similar to the property of the efficiency robust procedures in survival data analysis studied by Freidlin et al.

We further compared _{SGMS}, _{SGME}, _{MAX3 }and _{2 }increased from 1.1 to 2.0 with increments of 0.1 and _{1l
}= 1 + _{2l
}- 1). The results are summarized in Figures

Empirical powers of _{SGMS}, _{SGME}, _{MAX3 }and

**Empirical powers of Z _{SGMS}, Z_{SGME}, Z_{MAX3 }and **. The significance level is 0.05.

Empirical powers of _{SGMS}, _{SGME}, _{MAX3 }and

**Empirical powers of Z _{SGMS}, Z_{SGME}, Z_{MAX3 }and **. The significance level is 0.05.

Empirical powers of _{SGMS}, _{SGME}, _{MAX3 }and

**Empirical powers of Z _{SGMS}, Z_{SGME}, Z_{MAX3 }and **. The significance level is 0.05.

Empirical powers of _{SGMS}, _{SGME}, _{MAX3 }and

**Empirical powers of Z _{SGMS}, Z_{SGME}, Z_{MAX3 }and **. The significance level is 0.05.

Notice that _{SGMS}, _{SGME }and _{MAX3 }have comparable powers although _{MAX3 }may be slightly more powerful than the other two tests under the REC and DOM models, and _{SGME }may dominate _{MAX3 }and _{SGMS }under the ADD model. _{SGME }is slightly more powerful than _{SGMS }and _{MAX3}, and

_{1l
}< _{0l
}, _{SGMS }and _{MAX3}. _{SGME }performs the worst in such a situation. Under the over-dominant model where _{1l
}> _{2l
}, all the robust tests perform very similarly.

To summarize, if the mean genetic effect of the heterozygous genotype is between those of the two homozygous genotypes, then we suggest _{MAX3}, _{SGMS }and _{SGME}. On the other hand, if the genetic effects are not ranked in accordance with the genotypes, then

Notice that in our simulation we consider the common disease common variant (CDCV) which is currently the most popular theory underlying complex disease etiology. However, if the common disease rare variant (CDRV) assumption holds which implies that the disease etiology is caused collectively by multiple rare variants with moderate to high penetrances, the proposed tests perform conservatively and underpowered for detecting association

An application

We applied MTTs and the robust tests to a matched pair case-control etiologic study of sarcoidosis (ACCESS)

The pair-matched case-control study of ACCESS.

Controls

Caucasian

'11'

'13'

'33'

Total

Cases

'11'

0

0

1

1

'13'

0

9

36

45

'33'

2

29

201

232

Total

2

38

238

278

Controls

Female/African-American

'11'

'13'

'33'

Total

Cases

'11'

1

11

8

20

'13'

8

26

40

74

'33'

4

34

24

62

Total

13

71

72

156

Controls

Male/African-American

'11'

'13'

'33'

Total

Cases

'11'

1

2

5

8

'13'

1

14

17

32

'33'

1

11

11

23

Total

3

27

33

63

Controls

Combined

'11'

'13'

'33'

Total

Cases

'11'

2

13

14

29

'13'

9

49

93

151

'33'

7

74

236

317

Total

18

136

343

497

First we applied the MTTs optimal for the REC, ADD and DOM models to the data set and obtained the p-values being 0.058, 0.025 and 0.093 for _{MTT(0)}, _{MTT(0.5) }and _{MTT(1) }respectively. Thus, whether or not there is a significant association is unclear under a nominal level 0.05 because different genetic models give different answers.

Then we applied _{MAX3 }to the data set and obtained the p-values as 0.076 and 0.056, which were also hard to provide a more conclusive finding under a significance level of 0.05. Note that the p-value of _{MAX3 }was calculated according to the asymptotic formula obtained by Zang et al. _{SGMS }and Z_{SGME }to the same data. We obtained _{SMRT }= 0.124, which falls in the interval [-1.645,1.645] and strongly suggested an ADD model. Thus, for SGMS we select ADD and for SGME we exclude REC and DOM. Using formulas (6) and (8) we obtained the p-values as 0.0398 for _{SGMS }and 0.0310 for _{SGME}, both suggesting a marginally significant association. According to our simulation, _{SGME }is the most powerful robust test under the ADD model. We also obtained the minimum correlation of the optimal tests _{SGME }is a better choice than _{MAX3 }according to our previous discussion for Table

Discussion

In this paper, we extended the GMS ^{2 }with 2 df test. Simulations were carried out to examine the robustness of all these tests. The tests were also used to analyze a real pair matched data set of sarcoidosis. Simulation results indicate that when the genetic model is unknown, a mis-specification of the genetic model may result in a substantial loss of power for the MTTs. In this situation, robust tests are preferred. Further comparisons among the robust tests were also conducted. According to our simulation, when the genetic effects are ordered in accordance with their genotypes, MAX3, SGMS and SGME are preferred. On the other hand, if the less plausible genetic models such as the over-dominant and under-recessive models cannot be excluded, then ^{2 }with 2 df test is a good choice.

We adopted the matching framework in the stage of recruiting samples so our study is a pre-matched case-control association study. In practice, even in the unmatched case-control design matching is still an important tool to eliminate the effect of latent confounding factors such as the population stratification and cryptical relatedness. For example, Guan et al.

Conclusion

Simulation results and real data analysis show that SGMS and SGME can keep a correct Type I error rate for stratified data while have good efficiency robustness against genetic model uncertainty. Besides, the proposed formulas in this paper can easily be used to calculate the corresponding p-values. Thus, SGMS and SGME are useful for genetic data analysis of matched case-control design.

Authors' contributions

ZY carried out the project and wrote the draft of the manuscript. FWK proposed the idea and revised the manuscript. Both authors read and approved the manuscript.

Appendix

First we derive the correlation _{x }
_{SMRT }and _{MTT}(_{MTT}, under the null hypothesis,

Following Zheng and Ng

When

Substitute _{l}
_{x }

Next we report the correlation _{x}
_{0.5 }between _{MTT}(_{MTT}(0.5) (x = 0, 1). Under the null hypothesis,

Since

Substitute _{
il
}(i = 0,1,2), we obtain the estimate _{
x, 0.5 }(x = 0,1).

Acknowledgements

The research of Y. Zang was partially supported by the China Natural Science Foundation grant 10701067 and the research of W. K. Fung was partially supported by the HKU Research Output Prize Funding.