The Genetic Analysis Workshop (GAW) 16 Problem 3 comprises simulated phenotypes emulating the lipid domain and its contribution to cardiovascular disease risk. For each replication there were 6,476 subjects in families from the Framingham Heart Study (FHS), with their actual genotypes for Affymetrix 550 k single-nucleotide polymorphisms (SNPs) and simulated phenotypes. Phenotypes are simulated at three visits, 10 years apart. There are up to 6 "major" genes influencing variation in high- and low-density lipoprotein cholesterol (HDL, LDL), and triglycerides (TG), and 1,000 "polygenes" simulated for each trait. Some polygenes have pleiotropic effects. The locus-specific heritabilities of the major genes range from 0.1 to 1.0%, under additive, dominant, or overdominant modes of inheritance. The locus-specific effects of the polygenes ranged from 0.002 to 0.15%, with effect sizes selected from negative exponential distributions. All polygenes act independently and have additive effects. Individuals in the LDL upper tail were designated medicated. Subjects medicated increased across visits at 2%, 5%, and 15%. Coronary artery calcification (CAC) was simulated using age, lipid levels, and CAC-specific polymorphisms. The risk of myocardial infarction before each visit was determined by CAC and its interactions with smoking and two genetic loci. Smoking was simulated to be commensurate with rates reported by the Centers for Disease Control. Two hundred replications were simulated.
The Framingham Heart Study (FHS) is a rich platform for the study of cardiovascular disease and the application of novel, imaginative analytic strategies. For Genetic Analysis Workshop (GAW) 16, we use a semi-simulated approach using actual genotypes from the 500 k Affymetrix platform and the 50 k candidate gene chip and building phenotypes on the observed genetic variation. Because blood lipid levels are a major risk factor in the development of cardiovascular disease , we modeled disease risk on the lipid pathway, including both genetic and environmental determinants. The FHS has reported that long-term averages of low-density lipoprotein (LDL), high-density lipoprotein (HDL), and triglyceride (TG) levels were highly heritable (0.66, 0.69, and 0.58, respectively) . Several familial studies also have reported heritabilities for LDL of 0.50, HDL of 0.54, and TG of 0.39 . Dyslipidemia, as a fundamental component of the atherosclerotic process, is a medically correctable risk factor with established efficacious treatments for reducing risk of coronary heart disease . Thus, we included in our simulation the use and effects of dyslipidemic medications, which have an important role in shaping lipid profiles. This simulation builds in the long tradition of previous simulations for Genetic Analysis Workshops [5,6].
The FHS pedigrees, distributed as GAW16 Problem 2, formed the basis of our simulation . In total, there were 6,476 subjects who had genotypes and simulated phenotypes. After the simulations began, additional FHS subjects provided broad consent for data sharing; these additional subjects were not included in the simulations. To ensure comparable data to that which was simulated, we provided a file that defined precisely which subjects were included and their relationships within families. The ~550 k measured single-nucleotide polymorphism (SNP) genotypes, distributed for GAW16 Problem 2 from both the genome-wide scan and the additional candidate gene platform (GeneChip® Human Mapping 500 k Array Set (Nsp and Sty), and the 50 k Human Gene Focused Panel) comprised the genotypes for GAW16 Problem 3. Novel fictitious phenotypes were simulated for subjects.
Although family members of the FHS attended various exams at different times, depending on the generation, we modeled our study as if all subjects were recruited at one time, calculated the family member's relative ages at one particular exam, and then assigned a simulated age for everyone at three time points, with 10-year intervals. The mean age in years (range) for the simulation, by generation and visit, is shown in Table 1.
Table 1. Mean ages of the simulated data (mean, minimum, and maximum age in years)
The simulation model is depicted in Figure 1. There are up to six "major" genes for the lipid phenotypes HDL, LDL, and TG, and 1,000 polygenes for each trait. Several polygenes have pleiotropic effects (i.e., several of these polygenes affect two or three or trait combinations simultaneously). The identity and effects of the major genes are documented in Table 2. The locus-specific heritabilities of the major genes range from 0.1-1.0% under additive (AA:AB:BB, 0:0.5:1), dominant (AA:AB:BB, 1:1:0), or overdominant (AA:AB:BB, 0:1:0; heterozygotes show higher effect than the two homozygotes) modes of inheritance, with minor allele frequencies at least 5%, with one exception (β4), for which the minor allele frequency was 1%. We simulated an overdominant effect (γ1) because there appears to be evidence supporting this possibility and this mode of inheritance is rarely, if ever, modeled. The gene α4 is pleiotropic for HDL and TG and interacts with β5 in determining LDL (Figure 1). The interaction accounts for 0.7% of the trait variance, and β5 has no marginal effect on any phenotype. The locus-specific effects of the polygenes were on average an order of magnitude smaller, ranging from 0.002-0.15%, with effect sizes extracted from negative exponential distributions. All polygenes act independently and have additive effects. HDL, TG, and LDL share 40% of their polygenes in common, and HDL and TG share an additional 20%. The specific identities of the polygenes, their locations, and their generating effect sizes are provided in the Additional Files 1, 2, 3 corresponding to HDL, LDL, and TG. A group of 39 polygenes influencing HDL were clustered within 0.5 Mb on chromosome 11; otherwise, the polygenes for each trait are randomly distributed throughout the genome. The overall effect of each trait-specific polygenic component was scaled to achieve the target total trait heritabilities of 60%, 55%, and 40% for HDL, LDL, and TG, respectively. The remaining variance is uncorrelated among family members, with the exception of a simulated dietary effect (variable: diet) on TG levels that accounts for a correlation of 0.05 among family members, regardless of their coefficient of relationship. The phenotypes generated from this genetic model were scaled to the empirically derived means and variances for the actual HDL, LDL, and TG traits within 13 age strata (in 5-year intervals) and sex as follows:
Format: XLS Size: 199KB Download file
This file can be viewed with: Microsoft Excel Viewer
Format: XLS Size: 199KB Download file
This file can be viewed with: Microsoft Excel Viewer
Format: XLS Size: 199KB Download file
This file can be viewed with: Microsoft Excel Viewer
Figure 1. The Genetic Analysis Workshop 16 Problem 3 diagram. Figure 1 shows simulated phenotypes emulating the lipid domain (HDL, LDL, TG, and CHOL) and its contribution to cardiovascular disease risk (CAC and MI). Simulated major genes are symbolized with Greek letters. There are 1,000 polygenes for each trait HDL, LDL, and TG, several of them with pleiotropic effects. Continued lines and arrows show causality/interaction (I); dashed lines show pharmacogenetic effects only for subjects treated with medication, where response was dependent on the subjects' genotypes. Environmental factors such as diet, smoking, and medication were modeled in the simulation.
Table 2. Summary characteristics of the major genes and polygenes for traits HDL, LDL, and TGa
where, for example, (HDL|age 5 year interval, sex) represents the mean of HDL in FHS, given a 5-year age interval and sex; (HDL|age 5 year interval, sex) is standard deviation of HDL in FHS, given a 5-year age interval and sex; hα1 is the square root of simulated heritability for the α1 SNP (as described in Table 2); aα1 is a simulated effect that reflects in part the penetrance of the α1 SNP; sign is a random integer number that takes values (-1) or (+1) with the purpose of randomly changing the contribution direction of polygenes; apolyκ represents an instance of each of the 1,000 SNPs effects (k = 1 to 1,000), selected as polygenes for HDL; hapolyκ is an instance of the of square roots of heritabilities for 1,000 SNPs selected as polygenes for HDL; aε represents the environmental effect that contributes to HDL; and hε is the square root of HDL variance explained by environmental causes.
As individuals progressed to the next visit 10 years later, their phenotypes were scaled by the appropriate age-sex means and variance, but there are no genes governing longitudinal trends per se. Instead, we simulated the complicating effects of medication. The simulated value for LDL at each visit for each subject was checked, and individuals in the upper tail of the distribution were simulated as medicated. The proportion of subjects that are medicated increased across visits to comprise 2%, 5%, and 15% of the subjects in Visits 1, 2, and 3, respectively. These proportions were estimated from the FHS data, and reflected the secular increase in the proportion of individuals being treated for elevated cholesterol levels. The response to treatment is governed by two loci (δ1 and δ2) as pharmacogenetic processes. The δ1 variant has a marginal effect on both HDL and TG levels via additive effects but also, individuals that are homozygous for the minor allele are non-responders to the treatment. Responders (homozygotes and heterozygotes for the major allele) exhibit a 10% increase in their HDL levels and a 15% decrease in TG levels. Similarly, δ2 is a variant with an additive marginal effect on LDL, and homozygotes for the minor allele are non-responders to treatment. Responders exhibit a 30% decrease in LDL levels. Total cholesterol (CHOL) level is calculated as 0.8*(HDL + LDL + TG/5), and has no independent genetic effects except those influencing the component phenotypes.
Coronary artery calcification (CAC) was simulated as a quantitative phenotype that takes many years to develop. For this reason, CAC was modeled in two stages. First, age-independent CAC (CACAI) was modeled as a function of total CHOL, HDL, and five other genes (τ1-τ5) having direct effects on its development. CACAI was simulated under the model
where ME is a joint genetic effect from an epistatic interaction between τ1 and τ2, the effect of τ1 is purely epistatic (i.e., τ1 displays only a minimal main effect) while τ2 displays an additional measurable additive main effect; PE is the joint effect from τ3 and τ4, a pair of purely epistatic SNPs, each with no main effect; Het is an effect from τ5, a SNP that displays heterosis (over-dominance); and ε is the residual variation not explained by the factors mentioned above. The term ε, 300 times a random draw from a normal distribution with mean 0 and variance 1 (300 × N(0,1)), represents the sum of normal deviations from the mean of each of the modeled genetic effects and "noise" from unmeasured environmental and genetic effects. Because CAC cannot be negative, CACAI = 0 if the generated value was negative. The models for the effects on CACAI due to the ME and PE genotypes are illustrated in Tables 3 and 4. The minor allele frequency (MAF) for each of the four SNPs τ1-τ4 is ~0.5. SNP τ5, which determines the Het effect, has a MAF of 0.2. SNP τ5 genotype 1/1 (common homozygote) increases CACAI on average by 25, genotype 1/3 decreases CACAI by 100, and genotype 3/3 increases CACAI by 400. CAC is derived from CACAI by using a piecewise linear age adjustment: subjects under 20 years have not developed measurable levels of CAC, CAC buildup is linear from the ages of 20 to 60, and for subjects older than 60, CAC = CACAI. Table 5 lists estimates of the proportion of the variability of CAC attributable to each of the genetic factors averaged over the 200 replicate datasets.
Table 3. Mean effects of ME (τ1 and τ2) on CACAI
Table 4. Mean effects of PE (τ3 and τ4) on CACAI
Table 5. Proportion of explained variability for the genetic factors contributing to CACa (by visit)
Whether a subject smoked during the period before a visit influenced the risk of a myocardial infarction (MI). At first visit, men had a 27% chance to be smokers and women had a 23% chance. Each smoker had an 8% chance of permanently quitting smoking before each subsequent visit. The resulting smoking rates are commensurate with rates reported by the Centers for Disease Control for 1998. The risk of an MI before each visit is determined by CAC and its interactions with smoking and two genetic loci, φ1 and φ2. No MIs were fatal in our data. Smoking and φ1 have an interactive effect on risk of MI. The effect of smoking is to constrict blood vessels, thus increasing the risk that CAC will lead to an MI. The risk of MI for a smoker with the most common φ1 genotype (3/3) is the same as that of an equivalent non-smoker whose CAC is 10% higher. The risk of MI for a smoker with either of the other φ1 genotypes is the same as that of a non-smoker whose CAC is 40% higher. The φ1 genotype has no effect on risk of MI in non-smokers. Carrying the most common φ2 genotype (3/3) has the same effect on risk of MI as reducing CAC by 5%. The effect of any other genotype is the same as increasing CAC by 5%. The final model for MI risk is
where ∂smoke is the joint effect of smoking and φ1 (0 if a non-smoker, 0.1 if a smoker with genotype 3/3 at φ1, 0.4 if a smoker with another genotype); and the value of the event variable is -0.05 if the φ2 genotype is 3/3 and 0.05 otherwise. The MAFs for φ1 and φ2 are ~0.3. MI_risk was calculated for each visit and a draw from a uniform distribution determined whether the risk resulted in an MI. The SNPs for CAC and MI event were chosen from the 50 k SNPs in the Gene Focused Panel based on desired MAF, completeness of genotyping, and lack of linkage disequilibrium between the SNPs. The specific identities of the SNPs τ1-τ5, φ1 and φ2, and their chromosomes are listed in Table 6.
Table 6. SNPs contributing to CAC and MI event
Results and discussion
The phenotypic simulated files are named simphen#.txt, where # stands for a number from 1-200, representing the replication number. The simulated data are archived in the dbGAP of the National Center for Biotechnology Information under the name "GAW16 Framingham and Simulated Data" . The 200 replications of the data include the indexing variable "shareid" that matches exactly with the same shareid of the Framingham Heart Study and can be used to merge the simulated phenotypic data with the FHS genotypic data. The phenotypic variables provided are sex, simage (simulation age), diet, rx (antihyperlipidemic medication use), LDL, HDL, TG, CHOL, CAC, SMOKE, and MIevent, each associated with a number (1, 2, or 3) to identify respectively variables that were simulated for Visit 1, Visit 2, or Visit 3.
We tested all the simulated traits and causative SNP heritabilities as well as the respective association models. Analyzing and interpreting data obtained as part of a genome-wide association study presents numerous challenges, as well as the promise of improved understanding of the genetic factors influencing complex traits. For validation and a detailed analysis of the simulated model see the Online Supplemental Materials for GAW16 . Many genome-wide association studies have been published recently, and many more are being carried out on virtually every conceivable phenotype of biomedical or public health importance. While the rate of development of genetic technologies has propelled us to this point, development and evaluation of statistical and analytic techniques is still underway, with many issues not yet satisfactorily resolved. Nonetheless, important discoveries have been reported. We hope that the simulated GAW16 Problem 3 provides data with which investigators can test the strengths and limitations of their statistical analytic approaches and software.
List of abbreviations used
CAC: Carotid arterial calcification; CHOL: Cholesterol; FHS: Framingham Heart Study; GAW: Genetic Analysis Workshop; HDL: High-density lipoprotein; LDL: Low-density lipoprotein; MAF: Minor allele frequency; MI: Myocardial infarction; SNP: Single-nucleotide polymorphism; TG: Triglyceride.
The authors declare that they have no competing interests.
All the authors contributed equally.
This work was partially supported by NIH grants HL08770003, HL08768803, 1RR02499203, DK06833603, and HL08821502. The authors are grateful to the continuous interactions with the GAW16 Steering Committee, especially Jean MacCluer and Laura Almasy; with the Framingham Heart Study and especially with collaborators L Adrienne Cupples and Larry Atwood; the NIH/NCBI, especially Cashell Jaquish and Michael Feolo. The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences.
This article has been published as part of BMC Proceedings Volume 3 Supplement 7, 2009: Genetic Analysis Workshop 16. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/3?issue=S7.
Ann Intern Med 1961, 55:33-50. PubMed Abstract
Kathiresan S, Manning AK, Demissie S, D'Agostino RB, Surti A, Guiducci C, Gianniny L, Burtt NP, Melander O, Orho-Melander M, Arnett DK, Peloso GM, Ordovas JM, Cupples LA: A genome-wide association study for blood lipid phenotypes in the Framingham Heart Study.
Kraja AT, Rao DC, Weder AB, Cooper R, Curb JD, Hanis CL, Turner ST, de Andrade M, Hsiung CA, Quertermous T, Zhu X, Province MA: Two major QTLs and several others relate to factors of metabolic syndrome in the family blood pressure program.
Curr Met Chem Cardiovasc Hematol Agents 2005, 3:187-193. Publisher Full Text
Genet Epidemiol 2001, 21(suppl 1):S332-S338. PubMed Abstract
dbGaP: Genotypes and Phenotypes. GAW16 Framingham and Simulated Data Study Accession: phs000128.v2.p2 [http:/ / www.ncbi.nlm.nih.gov/ projects/ gap/ cgi-bin/ study.cgi?study_id=phs000128.v2.p2] webcite