Biostatistics Center and Graduate Institute of Biostatistics, China Medical University, Taichung, Taiwan

Graduate Institute of Statistics, National Central University, Chungli, Taiwan

Abstract

Background

It is well known that the presence of population stratification (PS) may cause the usual test in case-control studies to produce spurious gene-disease associations. However, the impact of the PS and sample selection (SS) is less known. In this paper, we provide a systematic study of the joint effect of PS and SS under a more general risk model containing genetic and environmental factors. We provide simulation results to show the magnitude of the bias and its impact on type I error rate of the usual chi-square test under a wide range of PS level and selection bias.

Results

The biases to the estimation of main and interaction effect are quantified and then their bounds derived. The estimated bounds can be used to compute conservative p-values for the association test. If the conservative p-value is smaller than the significance level, we can safely claim that the association test is significant regardless of the presence of PS or not, or if there is any selection bias. We also identify conditions for the null bias. The bias depends on the allele frequencies, exposure rates, gene-environment odds ratios and disease risks across subpopulations and the sampling of the cases and controls.

Conclusion

Our results show that the bias cannot be ignored even the case and control data were matched in ethnicity. A real example is given to illustrate application of the conservative p-value. These results are useful to the genetic association studies of main and interaction effects.

Background

In the search of causative agents of human disease, both environmental and genetic risk factors have been identified. Overwhelming evidence indicates that there are reasons to believe that relative common polymorphisms in a wide spectrum of genes may modify the effect of environmental agents

Many association designs have been proposed for studying gene-environment or gene-gene interactions. Recently, Wang and Zhao

In this paper, we investigate the joint effect of population stratification and sample selection in testing null main or interaction effects. Under general sampling, we quantify the magnitude of the PS-SS bias in terms of the baseline disease risks, genotype frequencies, exposure rates, their odds ratios (linkage disequilibrium coefficients), and the effect sizes of the risk factors. Based on this result, we find that matching in ethnicity cannot eliminate bias in association studies. Using the bias, we are also able to derive important conditions under which it is null.

The PS-SS bias cannot be estimated, since we don't know how many subpopulations involved in the studied population and/or which subpopulation a person belongs to. Although adjusting for covariates such as principal components can be used to account for PS in genome wide association studies

Results

The Magnitude of the Bias

We begin this section with the notation that will be used throughout this work. Disease status is denoted as

To quantify the PS effect, we assume that the risk model is given by

where the genetic and environmental data are obtained from subpopulation _{s }

as the baseline _{s }
_{s }

In the discussion of PS effect, one often assumes that case and control data are sampled according to the SRS design. Let ^{#}(^{#}(_{s }
_{s }

Since in the population level we only observe factors

where

and

exp(_{s}DS_{s }

therefore, if the disease prevalence _{s}DS_{s }

Maximal bias and conditions for the null bias

Here, we give conditions for the null bias and bounds for bias. The bias exp(

Note that the bias ^{† }= 1), and the disease risk is constant, then the bias is also null. (However, if the sampling is not SRS, the bias may be non-null; see Tables

Biases and the true type I errors of the chi-square tests when ^{† }= 5 and LD = (0,0)

**Bias**

**( γ = 0)**

**type I error**

**( γ = 0)**

**Bias**

**( γ = 1)**

**type I error**

**( γ = 1)**

**
H
^{† }
**

**
D
^{† }
**

**
DS
^{† }
**

**| β* |**

**| δ* |**

**
α**

**
α**

**| β* |**

**| δ* |**

**
α**

**
α**

1

1

1

0.0000

0.0000

0.0500

0.0500

0.0000

0.0000

0.0500

0.0500

3

0.2365

0.0000

0.3815

0.0500

0.2365

0.0000

0.3412

0.0500

5

0.2975

0.0000

0.5513

0.0500

0.2975

0.0000

0.4970

0.0500

PM

0.0000

0.0000

0.0500

0.0500

0.0000

0.0000

0.0500

0.0500

3

1

0.3725

0.0000

0.7134

0.0500

0.3725

0.0000

0.6530

0.0500

3

0.5953

0.0000

0.9823

0.0500

0.5953

0.0000

0.9661

0.0500

5

0.6518

0.0000

0.9937

0.0500

0.6518

0.0000

0.9857

0.0500

PM

0.0000

0.0000

0.0500

0.0500

0.0000

0.0000

0.0500

0.0500

5

1

0.5573

0.0000

0.9602

0.0500

0.5573

0.0000

0.9326

0.0500

3

0.7679

0.0000

0.9993

0.0500

0.7679

0.0000

0.9977

0.0500

5

0.8205

0.0000

0.9998

0.0500

0.8205

0.0000

0.9992

0.0500

PM

0.0000

0.0000

0.0500

0.0500

0.0000

0.0000

0.0500

0.0500

3

1

1

0.0000

0.0000

0.0500

0.0500

0.0000

0.0000

0.0500

0.0500

3

0.1916

0.1548

0.2583

0.0796

0.1916

0.1548

0.2232

0.0830

5

0.2383

0.2157

0.3729

0.1074

0.2383

0.2157

0.3201

0.1139

PM

0.0000

0.0000

0.0500

0.0500

0.0660

0.0285

0.0688

0.0511

3

1

0.3342

0.0762

0.5794

0.0572

0.3310

0.0796

0.4827

0.0584

3

0.5134

0.2312

0.9209

0.1163

0.5071

0.2345

0.8439

0.1232

5

0.5564

0.2892

0.9559

0.1538

0.5493

0.2918

0.8971

0.1632

PM

0.0000

0.0000

0.0500

0.0500

0.0930

0.0073

0.0812

0.0501

5

1

0.5129

0.0683

0.8997

0.0557

0.5058

0.0776

0.8083

0.0577

3

0.6812

0.2225

0.9918

0.1104

0.6687

0.2311

0.9657

0.1187

5

0.7210

0.2779

0.9962

0.1442

0.7071

0.2852

0.9796

0.1546

PM

0.0000

0.0000

0.0500

0.0500

0.0957

0.0222

0.0799

0.0506

5

1

1

0.0000

0.0000

0.0500

0.0500

0.0000

0.0000

0.0500

0.0500

3

0.1608

0.2158

0.1912

0.1113

0.1608

0.2158

0.1639

0.1164

5

0.1986

0.3042

0.2693

0.1720

0.1986

0.3042

0.2270

0.1816

PM

0.0000

0.0000

0.0500

0.0500

0.0884

0.0532

0.0815

0.0541

3

1

0.3005

0.1007

0.4697

0.0635

0.2951

0.1081

0.3676

0.0659

3

0.4501

0.3178

0.8213

0.1855

0.4405

0.3252

0.6897

0.1942

5

0.4848

0.4026

0.8762

0.2656

0.4741

0.4085

0.7551

0.2750

PM

0.0000

0.0000

0.0500

0.0500

0.1325

0.0192

0.1063

0.0505

5

1

0.4702

0.0892

0.8176

0.0605

0.4574

0.1089

0.6735

0.0655

3

0.6101

0.3062

0.9661

0.1738

0.5901

0.3249

0.8820

0.1880

5

0.6423

0.3875

0.9794

0.2470

0.6203

0.4034

0.9122

0.2609

PM

0.0000

0.0000

0.0500

0.0500

0.1409

0.0474

0.1064

0.0529

PM means that perfect matching ^{#}(^{#}(

Biases and true type I errors of the chi-square tests when ^{† }= 5 and LD = (0,0.05)

**Bias**

**( γ = 0)**

**type I error**

**( γ = 0)**

**Bias**

**( γ = 1)**

**type I error**

**( γ = 1)**

**
H
^{† }
**

**
D
^{† }
**

**
DS
^{† }
**

**| β* |**

**| δ* |**

**
α**

**
α**

**| β* |**

**| δ* |**

**
α**

**
α**

1

1

1

0.0000

0.0000

0.0500

0.0500

0.0000

0.0000

0.0500

0.0500

3

0.1862

0.3173

0.2456

0.1731

0.1862

0.3173

0.2116

0.1886

5

0.2313

0.4242

0.3535

0.2709

0.2313

0.4242

0.3021

0.2976

PM

0.0000

0.0000

0.0500

0.0500

0.0710

0.0871

0.0715

0.0598

3

1

0.3288

0.3309

0.5611

0.1735

0.3281

0.3208

0.4722

0.1791

3

0.5028

0.6401

0.9076

0.5019

0.5014

0.6166

0.8324

0.5127

5

0.5443

0.7413

0.9463

0.6209

0.5427

0.7122

0.8873

0.6299

PM

0.0000

0.0000

0.0500

0.0500

0.0972

0.0634

0.0837

0.0543

5

1

0.5062

0.4591

0.8883

0.2776

0.5046

0.4356

0.8052

0.2784

3

0.6695

0.7603

0.9894

0.6206

0.6667

0.7132

0.9643

0.6110

5

0.7080

0.8563

0.9948

0.7207

0.7048

0.8001

0.9787

0.7072

PM

0.0000

0.0000

0.0500

0.0500

0.0971

0.0486

0.0806

0.0523

3

1

1

0.0000

0.0000

0.0500

0.0500

0.0000

0.0000

0.0500

0.0500

3

0.1365

0.4049

0.1484

0.2659

0.1365

0.4049

0.1278

0.2821

5

0.1677

0.5542

0.2022

0.4417

0.1677

0.5542

0.1700

0.4669

PM

0.0000

0.0000

0.0500

0.0500

0.0961

0.1592

0.0851

0.0842

3

1

0.2693

0.3457

0.3779

0.1993

0.2634

0.3503

0.2862

0.2072

3

0.3958

0.7451

0.6991

0.6654

0.3859

0.7440

0.5461

0.6719

5

0.4244

0.8876

0.7629

0.8067

0.4135

0.8823

0.6072

0.8083

PM

0.0000

0.0000

0.0500

0.0500

0.1517

0.0739

0.1175

0.0561

5

1

0.4286

0.4464

0.7192

0.2912

0.4138

0.4620

0.5509

0.3090

3

0.5465

0.8394

0.9110

0.7501

0.5248

0.8442

0.7650

0.7536

5

0.5730

0.9756

0.9361

0.8607

0.5495

0.9731

0.8041

0.8575

PM

0.0000

0.0000

0.0500

0.0500

0.1656

0.0311

0.1203

0.0510

5

1

1

0.0000

0.0000

0.0500

0.0500

0.0000

0.0000

0.0500

0.0500

3

0.1034

0.4594

0.1039

0.3341

0.1034

0.4594

0.0917

0.3479

5

0.1262

0.6322

0.1325

0.5520

0.1262

0.6322

0.1135

0.5712

PM

0.0000

0.0000

0.0500

0.0500

0.0942

0.2098

0.0812

0.1101

3

1

0.2198

0.3865

0.2562

0.2424

0.2106

0.4008

0.1850

0.2529

3

0.3151

0.8406

0.4848

0.7777

0.3007

0.8531

0.3371

0.7791

5

0.3360

1.0059

0.5407

0.8992

0.3203

1.0147

0.3769

0.8962

PM

0.0000

0.0000

0.0500

0.0500

0.1623

0.0942

0.1176

0.0597

5

1

0.3590

0.4966

0.5345

0.3548

0.3343

0.5395

0.3535

0.3893

3

0.4474

0.9442

0.7431

0.8503

0.4139

0.9825

0.5114

0.8572

5

0.4667

1.1027

0.7822

0.9352

0.4310

1.1341

0.5467

0.9344

PM

0.0000

0.0000

0.0500

0.0500

0.1859

0.0365

0.1256

0.0513

PM means that perfect matching ^{#}(^{#}(

When the interaction effect is null, some conditions for the null bias

Next, we present bound to measure the largest bias to the estimation of main effect. In the Methods section, we show that the bias exp(

where _{s }
_{s }
^{† }= 1) is similar to that given by Lee and Wang

In the Methods section, we also showed that under SRS, the bias exp(^{† })^{2 }and bounded below^{† })^{-2}. These are the same bounds derived by Wang et al.

and bounded below by

True type I errors

In case-control studies, one often expects that the type I errors of the association tests can be approximately controlled at some predetermined level. However, in the presence of PS or selection bias, the usual test statistic does not have a chi-square distribution under the null hypothesis. Instead, it has a non-central chi-square distribution, with non-centrality parameter depending on the level of the bias. Thus, the usual chi-square test tends to have inflated type I errors.

Suppose that the intended type I error rate of the chi-square test is

where

The corresponding true type I error of the chi-square test is given by

Conservative p-values

In most practical applications, one often does not know the true value of the non-centrality parameter and therefore it is difficult to calculate the true p-value of the chi-square test when the PS is present and/or there is selection bias. However, we are able to develop a bound for the non-centrality parameter, and the latter may be estimable in many cases. Define _{δ}
_{β}
_{β}

Examples of true biases and type I error rates

Tables _{β }
_{δ }
_{s }= (_{1}, _{2}) where _{s }was the linkage disequilibrium coefficient between loci _{s }= 0 or 0.05. We also assumed that the sampling proportions of the cases followed SRS but those of the controls might not. The rest of the parameter values were determined from the values for the variations ^{† },^{† },^{† }and ^{† }given in the tables with the assumption that subpopulation 2 has the maximal baseline ^{#}(^{† }= 5 were given in Tables ^{† }= 3 can be found from Tables S1 and S2 in Additional file

**Biases and the true type I errors of the chi-square tests**. The file contains two tables showing the biases and true type I errors of the chi-square tests when ^{† }= 3 and LD = (0,0) or LD = (0,0.5).

Click here for file

According to the results in Table _{β }
_{β }
^{† }= 3 or 5) and _{β }
^{† }= ^{† }
^{† }= 5. This is contrary to our usual belief that matching between cases and controls in ethnicity can eliminate the PS bias. However, except in some special cases, the bias under perfect matching design are smaller than those under other sampling designs.

Wang et al. _{δ }
^{† }= 3 or 5), the true type I error rate _{δ }
_{δ }

Linkage disequilibrium between two genes or correlation between genetic and environmental factors play important role in determining the bias level in the studies of interaction. According to results presented in Table _{β }
_{δ }
_{δ }

An application

Shi et al. ^{-4}, indicating strong interaction effect. Also, from ^{† }= 4.8762. The range of maternal smoking rate was between 0.101 and 0.244 (see ^{† }= 1.968. Since maternal smoking and GSTT1 were independent in the unrelated control population (p-values of the independence test for the Demark data and Iowa data were respectively equal to 0.0942 and 0.0976), our upper bound for the bias exp(^{-2}. This suggests that the maternal smoking effect on the cleft risk can be modified by the GSTT1 genotype even the population stratification and selection bias are both present in the study.

Discussion

The impact of population stratification is considered by many to be important in case-control studies of gene-disease association. Many authors have suggested quantitative methods to control type I errors of the usual association test. The most popular treatments include the "genomic control" method

In practical applications, the selection bias is not unusual. For examples, when the hospital-based cases (controls) are used in the study and they are not representative of the population-based cases (controls) or when many non-response of the cases or/and controls occur in the study or there are self-selections, then the SRS condition may fail. In this paper, we show that under slight selection bias (^{† }= 3), the bias to the estimation of main or interaction effect may become unacceptable. Our suggestion is that the bias should be treated seriously, even when the genetic factors are in linkage equilibrium or the genetic and environmental factors are uncorrelated. Large correlation or strong linkage disequilibrium could make the bias become even larger. Also, small variation in disease risk cannot guarantee small bias, unless there is also small selection bias. In applications, it is important to be able to measure the impact of the bias. In this paper, we drive some bounds for the bias. If these bounds are estimable, then they can be used to make conservative inference. We show one real example that a conservative p-value for testing null interaction can be computed and significance conclusion can be reached even there is bias. Genotype frequencies of the SNPs and their LDs are readily available from international HapMap project. Further, disease prevalence is also available from many nations or from World Health Organization, for example. This information allows us to easily compute bounds and then conservative p-values.

We note that matching in ethnicity between cases and controls has been suggested by epidemiologists as an affective method to control the PS bias in case-control gene-disease association study. However, in a more complicated risk model such as the one discussed here, bias (

Since the presence of PS and selection bias may cause unacceptable bias to the usual interaction analysis, it is of importance to have an efficient method to control the bias. Unfortunately, so far there exists no effective method. The major difficulty is that the level of the bias depends on the effect size of other related factor which is in general unknown or not estimable under the PS. However, under some special cases, for example, when the genetic main effects are null (or weak) and testing gene-gene interaction is the main focus, one may follow the idea of genomic control to type extra pairs of null markers and apply the computed interaction levels to control the bias. In principle, if the candidate markers are in linkage equilibrium, the selected pairs of null markers also need to be in linkage equilibrium so that the important characteristics of the bias can be captured. On the other hand, if the candidate markers are in linkage disequilibrium, the paired null markers also need to be correlated. We are currently working to solve this important problem. Another approach for reducing bias is to match the cases and controls in ethnicity. According to our simulations, we find that under perfect matching and weak linkage disequilibrium, the bias to the estimation of the interaction effect is small. However, more study is needed in order to understand the impact of the residual bias when the matching is not perfect.

Conclusions

In this paper, the biases to the estimation of genetic main and interaction effects are quantified and their bounds are derived. We find that if there is environmental effect or interaction, the bias to the genetic main effect cannot be ignored even cases and controls were matched in ethnicity. The bias to the estimation of interaction effect also has the same problem. The estimated bound can be used to compute conservative p-value for the association test. The computation of conservative p-value does not require the knowledge on the number of subpopulations involved in the study or the membership of each study subject. In real applications, it is usually not clear that if there is PS or selection bias or both. However, if appropriate information such as the variation of genotype frequencies is known, we always can compute the conservative p-value. If the conservative p-value is smaller than the designated significance level, we can safely claim that the test is significant regardless of the presence of PS/non-SRS.

Methods

Following the usual Bayesian argument, the disease-risk model implies that

where

On the other hand, the joint frequency distribution of

Thus their ratio is given by

Here, we define ^{Δ}(0,0)},

Also note that we can express

where

Define

and

Simple algebra shows that there exists some constant

Here _{M}
_{m}
_{s}
_{M}
_{m}
_{M}
_{m }
_{s }
^{† })^{2 }and bounded below by (^{† })^{-2}. However, under general sampling design, the bias is expressed as

where

Authors' contributions

KFC designed the study, performed the analysis and wrote the paper. JYL performed the Computation and helped in discussion. All authors read and approved the final manuscript.

Acknowledgements

This research was supported in part by a grand from National Science Council and a joint research grand from China Medical University and Asia University. The authors are grateful to the discussion of Jin-Hua Chen and would like to thank two reviewers for their comments which greatly improve the presentation of this paper.