Joint Program in Survey Methodology, University of Maryland at College Park, 7965 Baltimore Ave, College Park, MD, 20742, USA

Department of Mathematics, University of Texas at Arlington, 701 S Nedderman Dr, Arlington, TX, 76019, USA

Abstract

Background

This study is motivated by National Household Surveys that collect genetic data, in which complex samples (e.g. stratified multistage cluster sample), partially from the same family, are selected. In addition to the differential selection probabilities of selecting households and persons within the sampled households, there are two levels of correlations of the collected genetic data in National Genetic Household Surveys (NGHS). The first level of correlation is induced by the hierarchical geographic clustered sampling of households and the second level of correlation is induced by biological inheritances from individuals sampled in the same household.

Results

To test for Hardy-Weinberg Equilibrium (HWE) in NGHS, two test statistics, the CCS method [1] and the QS method [2], appear to be the only existing methods that take account of both correlations. In this paper, I evaluate both methods in terms of the test size and power under a variety of complex designs with different weighting schemes and varying magnitudes of the two correlation effects. Both methods are applied to a real data example from the Hispanic Health and Nutrition Examination Survey with simulated genotype data.

Conclusions

The QS method maintains the nominal size well and consistently achieves higher power than the CCS method in testing HWE under a variety of sample designs, and therefore is recommended for testing HWE of genetic survey data with complex designs.

Background

This study is motivated by population-based family data collected from National Genetic Household Surveys (NGHS), in which complex samples (i.e. sample collected with stratified multistage cluster sampling), partially from the same family, are selected. There are two levels of correlations of the collected genetic data in NGHS. The first level of correlation is induced by the hierarchical geographic clustered sampling of households and the second level of correlation is induced by biological inheritances from individuals sampled in the same household. Moreover, national household surveys often apply differential selection probabilities of selecting households and persons within the sampled households.

NGHS from various countries, such as the Canadian Health Measures Survey

There are at least three complications to analyze data collected in NGHS: 1) differential population weights, 2) hierarchical geographical correlation among families, and 3) genetic correlation within families. As known, genetic variability can differ by race

Testing Hardy-Weinberg Equilibrium (HWE) of marker genotype frequencies has been widely recommended as a crucial step in genetic association studies

Methods developed by She et al.

In Li

In this paper, we examine and compare the performance of two methods, in terms of the sizes and powers, via Monte-Carlo simulation studies under a variety of complex sample designs with differential weighting schemes and varying magnitudes of the correlation effects. It is observed that the QS method maintains the nominal size relatively well and consistently achieves higher power than the CCS method in testing HWE using data collected from 2 parents and 1 offspring in NGHS. In Section “Methods”, we outline the detailed methodology of the CCS and the QS tests. Both methods are compared in Section “Results” via simulation studies on the finite sample performance and applied to a real data example from the Hispanic Health and Nutrition Examination Survey with generated genotype data. Finally, the paper is wrapped up in Section “Conclusions”.

Methods

Consider household surveys with stratified multistage cluster sample designs such as used in NHANES. These types of sample designs are described briefly as follows: The population of individuals is subdivided into disjoint primary sampling units (PSUs) usually based on the geographic locations of residence. For example, PSUs can be small cities or counties or contiguous cities/counties. The PSUs are grouped into strata so that they are approximately homogeneous with respect to certain demographic and geographic characteristics. At the first stage of sampling, a random sample of PSUs is selected from each stratum. At the second stage, smaller geographical units, so called secondary sampling units (SSUs), are randomly sampled from the sampled PSUs. Households/families are further randomly selected from the sampled SSUs, and at the ultimate stage individuals are randomly selected from sampled households/families. For each sampled individual, the inclusion probability is the product of the inclusion probabilities at each stage of sampling, and the corresponding sample weight is defined as the inverse of the inclusion probability. In most surveys the sample weights also involve adjustments for nonresponse and poststratification and can be considered as the number of people in the population represented by the sampled individual.

Let there be _{
h
} PSUs in the _{
h
}, data is collected on _{
hi
} families with _{
hij
} individuals selected in the _{
hi
}.

Corrected chi-square tests

She et al. _{
l
} is the number of families in the ^{th} family type and _{
A
} defined as the frequency of allele _{
a
} the frequency of allele

**Genotype**

**Familial type**

**Parents**

**Child**

**Joint probability**

**Count**

1

AA-AA

AA

_{1}

2

AA-Aa

AA

_{2}

3

AA-Aa

Aa

_{3}

4

AA-aa

Aa

_{4}

5

Aa-Aa

AA

_{5}

6

Aa-Aa

Aa

_{6}

7

Aa-Aa

aa

_{7}

8

Aa-aa

Aa

_{8}

9

Aa-aa

aa

_{9}

10

aa-aa

Aa

_{10}

The departure from HWE can be tested by Pearson chi-square test statistics

where

_{
w
} = (_{1w
}, _{2w
}, _{3w
}, _{4w
}, _{5w
}, _{6w
}, _{7w
}, _{8w
}, _{9w
}, _{10w
})^{
T
} with _{
lw
} representing the weighted number of families belonging to familial type _{
A
} and _{
a
} in **
π
** by

Due to both levels of correlations and the differential sample weights under the setting of complex survey design, the test statistics

where _{0.} We also considered the following tests for comparison purpose.

where

an F version of

Quasi score test

Li et al. _{
hij
} (=1, 2, 3…). Define **y**
_{
hijk
} = ( _{
hijk.1
},…, _{
hijk.g
},…,_{
hijk.M-1
}) for each sampled individual, where _{
hijk.g
} equals to 1 if individual-^{
T
}, where **
p
** = (

Define E(**y**
_{
hijk
}) = **μ**
_{
hijk
} with **μ**
_{
hijk
} = ( _{
hijk.1
},…, _{
hijk.g
},…, _{
hijk.M-1
}). If the genotype _{
hijk. g
} = 2(1 − _{
l
}
_{
l'
}. The estimating equations for the estimation of parameters

where **
y
**

To simplify the notation, the subscript of the **
Var
**(

where **
Var
**(

As known, family sizes and family relationships differ across families.

To test the null hypothesis of **
S
**

is asymptotically a ^{
2
} distributed variable with (

Results

Monte Carlo simulations

Let the finite population be of size _{
A
}
_{
a
}, and

In the simulations, two different sample weight distributions are employed: (1) the sample weight value of one is assigned to all the sample individuals, i.e. _{A} = 0.1, 0.3, or 0.5. We calculate the rejection rates, defined by the proportion of 1,000 simulation runs for which the p-value is less than the significance level α (=0.05), to evaluate the performance of the five tests

In the first simulation study, the genetic data are correlated among family members, but independent among families within PSUs. Specifically, we select 60 PSU’s by simple random sampling from 2,500 PSU’s. As described above, the genetic information for each pair of parents of the 40 families in the PSU are independently generated by multinomial distributions. Thus, the genetic information among the families within PSUs is independent. Table _{A} = 0.3. The results when _{A} = 0.1 or 0.5 showed the similar pattern, and therefore not shown. It can observed that the sizes of

^{#}**
r
** =

**
r
** =

**
r
** =

^{*} Tests proposed by She et al. (2009).

^{**}Tests proposed by Li et al. (2011).

^{#}Sizes ranging from 0.036

≡1

≡{1,3,5}

≡1

≡{1,3,5}

≡1

≡{1,3,5}

**0**.**056**

0.856

0.178

0.932

0.667

0.994

**0**.**049**

**0**.**044**

0.163

0.126

0.646

0.512

**0**.**038**

0.032

0.140

0.106

0.616

0.478

0.030

0.021

0.103

0.085

0.551

0.408

^{**}

**0**.**043**

**0**.**047**

0.244

0.209

0.800

0.697

In the second simulation study, the genetic data are correlated among family members as well as among families within PSUs. In order to introduce the correlation among families within PSUs, we generate a clustered finite population. In detail, we sort all the 100,000 families by the number of genotype

^{#}**
r
** =

**
r
** =

**
r
** =

^{*}Tests proposed by She et al. (2009).

^{**}Tests proposed by Li et al. (2011).

^{#}Sizes ranging from 0.036

**
w
**

≡**1**

≡{**1**,**3**,**5**}

≡**1**

≡{**1**,**3**,**5**}

≡**1**

≡{**1**,**3**,**5**}

0.998

1.000

1.000

1.000

1.000

1.000

0.116

0.111

0.215

0.210

0.539

0.535

0.035

0.031

0.078

0.072

0.242

0.232

0.027

0.024

0.065

0.061

0.218

0.204

^{**}

0.076

0.071

0.215

0.216

0.668

0.667

By comparing results from two simulation studies (see Tables

In conclusion, the

Example from the Hispanic health and nutrition examination survey with generated genotype data

We use data from the Mexican-American part of the HHANES

The HHANES had a stratified multistage cluster sample design; see

Since the HHANES did not genotype their sampled individuals, we generated genotype data using a two-step procedure by following _{A} was generated from the Beta distribution _{
A
} ~ Beta((1 − _{
A
}/ _{
A
})/ _{
A
} takes _{
A
}, where _{
A
}). Given parental genotypes, the genotype of a child was randomly generated by the Mendelian law.

According to the within-family sampling design of HHANES, we set the within-family weights to be 2 if the individual is 2–19 years; 1.33 if the person is between 20–44 years; and 1 if the person is 45–74 years. The final sample weight for each sampled individual was provided and calculated by the product of inclusion probabilities at each stage of sampling with nonresponse and postratification adjustments. For construction of family-level weights, we follow

We varied the values of the fixation coefficient _{
A
} = 0.3. All the four tests take account of the correlation induced from the selection of the families, and the biological correlation within the family. Consistent with results from the simulation studies, _{
A
} are conservative, producing larger p-values than

**
r
** =

**
r
** =

**
r
** =

**
r
** =

^{*}Tests proposed by She et al. (2009); ^{**}Tests proposed by Li et al. (2011).

0.347

0.207

**0**.**000**

**0**.**000**

0.347

0.241

**0**.**006**

**0**.**000**

0.414

0.329

0.051

**0**.**013**

^{**}

0.535

**0**.**014**

**0**.**000**

**0**.**000**

Conclusions

In this paper, we compared test statistics recently proposed by She et al.

Discussion

We originally planned for testing HWE using the National Health and Nutrition Examination Survey III (NHANES III) genetic data. However, in order to access NHANES III genetic data, researchers are required to be onsite at NCHS in Hyattsville, Maryland. Being located out of the state of Maryland, we are not able to access the data. Considering the similar survey components in Hispanic Health and Nutrition Examination Survey (HHANES) as in the NHANES III, we decided to apply the developed tests to HHANES with simulated genotype data. Although the genetic data are simulated, all the sampling components, such as the stratification, hierarchical clustering, family size, and family relationships, are real, and thus the analysis can still serve as a useful illustration for testing HWE in NGHS. However, we admit that the fact of simulated genotypes in the real data analysis is one of the limitations in this study.

In the simulation studies, the family members are selected by family relationship (i.e. 2P1O). In real surveys, however, individuals could be selected by their phenotypic characteristics, e.g., diseased or disease-free, which are often correlated within certain susceptible genetic variations. The magnitude of this correlation will differ depending on the susceptible genetic variations of interest. In our simulation studies (results not shown), both methods produced biased estimates of allele frequencies and the type I error rate is inflated when within-family selection is highly related to the genotypes. In future research, an extension of the QS estimator will be studied to account for within-family weights that are correlated with genetic variations.

Competing interests

The author declares that she has no competing interests.

Authors’ contributions

YL designed overall study including designing the sampling schemes, implementing the simulation studies, performing the analysis of HHANES with generated genotypes, and writing the manuscript.

Acknowledgements

The author thanks Mr. Tony Tsai for R scripts attempted for methods by She et al. (2009) in the simulations of this study.