Gruppo Interdipartimentale di Bioinformatica e Biologia Computazionale, Università di Napoli “Federico II” - Università di Salerno, Italy

Dipartimento di Biologia e Patologia Cellulare e Molecolare “L. Califano”, Università di Napoli “Federico II”, Napoli, Italy

Dipartimento di Scienze Fisiche, Università di Napoli “Federico II”, Complesso Universitario di Monte S.Angelo, Napoli, Italy

INFN Sezione di Napoli, Napoli, Italy

Abstract

Background

The analysis of complex diseases is an important problem in human genetics. Because multifactoriality is expected to play a pivotal role, many studies are currently focused on collecting information on the genetic and environmental factors that potentially influence these diseases. However, there is still a lack of efficient and thoroughly tested statistical models that can be used to identify implicated features and their interactions. Simulations using large biologically realistic data sets with known gene-gene and gene-environment interactions that influence the risk of a complex disease are a convenient and useful way to assess the performance of statistical methods.

Results

The Gene-Environment iNteraction Simulator 2 (GENS2) simulates interactions among two genetic and one environmental factor and also allows for epistatic interactions. GENS2 is based on data with realistic patterns of linkage disequilibrium, and imposes no limitations either on the number of individuals to be simulated or on number of non-predisposing genetic/environmental factors to be considered. The GENS2 tool is able to simulate gene-environment and gene-gene interactions. To make the Simulator more intuitive, the input parameters are expressed as standard epidemiological quantities. GENS2 is written in Python language and takes advantage of operators and modules provided by the simuPOP simulation environment. It can be used through a graphical or a command-line interface and is freely available from

Conclusions

Data produced by GENS2 can be used as a benchmark for evaluating statistical tools designed for the identification of gene-gene and gene-environment interactions.

Background

Most of the common human diseases with high mortality rates (such as cancer, heart disease, obesity, diabetes, and several common psychiatric and neurological conditions) are classified as complex diseases

Gene-environment interactions (G×E) are expected to influence complex phenotypes, for example, disease risk. Hence individuals with predisposing genetics are more likely to develop a disease when exposed to a damaging environment than individuals, exposed to the same environment, without predisposing genetics

Complex phenotypes are regulated by pathways and biochemical mechanisms that involve many genetic products. Hence, in addition to interactions among genes and environment, interactions among different genetic loci (G×G) can also influence disease risk. In particular, G×G are defined as epistatic when the allelic variations of one gene alters the effect of variations of another gene

Surprisingly, despite the general agreement on the relevance of G×E and G×G for correct disease risk estimations, only a few epidemiological studies have attempted to identify them. Indeed, studying the complex interactions among risk factors is a daunting task that requires large data sets and specific research designs. Furthermore, the best statistical method for the identification of G×G and G×E in case-control data sets ^{2} are generally inappropriate and tend to lead to an over/underestimation of disease risk

A possible strategy to assess the performances of statistical methods is to test them against simulated data sets where the relevant features influencing the disease risk are known (for a review of genetic simulators see

1. a Multi-Logistic Model (MLM) that can model any type of G×G and G×E,

2. a mathematical approach (Knowledge Aided Parameterization System, KAPS) that can translate biological and epidemiological information to MLM parameters, and

3. GENS (Gene Environment iNteraction Simulator), a software that produces simulated data sets.

Using that approach interactions between one genetic and one environmental factor only could be simulated; therefore, it was not possible to account for epistatic G×G. Moreover, all simulated loci were considered to be independent and thus it was not possible to account for LD

In the present paper, we describe an extension of the previous model that overcomes such limitations using a new strategy that simulates up to two-genes×one-environment interactions with the possible inclusion of epistasis. Importantly, the present algorithm can be easily extended to manage more than two genetic and one environmental factors. However, to simplify the design of biologically meaningful interactions, we limited the number of features (see the Discussion section for details). Furthermore, the inclusion of two genetic factors (with epistatic interaction) that in turn interact with a continuous environmental factor heavily increased of the complexity of the model. Indeed, statistical methods that can deal with even two genetic factors are still far from being functionally useful for real, large data sets

Implementation

GENS2 workflow

Figure

GENS2 work flow

**GENS2 work flow.** Chart of the steps that were used to simulate a complex disease in a population using the simuPOP and GENS2 systems.

**Task**

**Required parameters**

**Description**

**SimuPOP**

1) Starting data (Hap Map)

Chromosomes, or chromosome regions, or markers and marker distance.

The genomic regions containing the loci that will be simulated

Population (ethnicity)

The starting frequency and linkage data to be used in the sumulation

2) Simulation of sample’s genetic data

DPLs (Disease Predisposing Loci)

Loci that will influence the disease risk.

Target allelic frequency

Final allelic frequencies at the end of simuPOP simulation

Final sample size

Number of individuals that compose the population by simuPOP

**GENS2**

Starting sample

simuPOP generated sample

Sample data generated with simuPOP

Disease prevalence

The expected disease prevalence in the whole sample

Environment

Environmental factor distribution

Distribution of the environmental exposure in the whole sample

Environmental factor OR

Odds ratio associated with one-unit-increase of the environmental exposure

Noisy Environmental variables

As many as desired confounding environmental exposures not associated with the disease risk (gaussian, binomial or uniform distributed)

Genetics

DPLs

These are the same DPLs as selected in the simuPOP simulation

High risk alleles

The allele, for each DPL, associated with the highest disease risk

DPLs genotypic RR

The relative risk of the high risk homozygote versus low risk homozygote, for each DPL

Dominance

The relationship of the risk associated with the heterozygote with that associated with the homozygotes (recessive, dominant, codominant)

Epistasis model (G×G)

Percent increase of the risk associated with each combined genotype

Gene Environment interaction

G×E model

One of the four predefined interaction models

Generation of the synthetic data set

The generation of the starting sample is carried out by a series of simuPOP scripts

• download phased genomic data from the HapMap public database

• select a subset of SNPs or entire genomic regions, and

• let the population evolve until it reaches the desired size and frequencies for some disease predisposing loci (DPLs).

To obtain a synthetic data set, simuPOP drives a forward-time simulation to obtain a population that is composed of the desired number of individuals and genotypic frequencies for all the markers. The use of this simulator helps to retain genetic realism in the final population, in particular with respect to the patterns of LD (for a detailed description of this process, please see

Definition of the penetrance model

The second branch of the simulation procedure (right side of Figure

• the expected prevalence of the disease in the sample,

• the

• the allelic frequencies of DPLs (calculated automatically from the input population),

• the effect on disease risk of each DPL in terms of the relative risk of the high risk homozygote compared with the other homozygote,

• the dominance relation of each DPL (W), expressed as a number in the interval [0−1], representing the dominance relation (W=0 dominant, W=1 recessive, 0<

• the distribution parameters and the effect of the environmental factor on disease risk, expressed as odds ratio (OR) of the risk related to one-unit increase in the exposure.

KAPS2 also requires G×E and G×G models when two DPLs are provided. In particular:

i) G×E models are chosen from a set of four predefined models, two models of interaction between DPLs and the environment, and two special models in which there is no gene-environment interaction but in which only one genetic or environmental factor contributes to the disease risk (see Table

**Interaction model**

**Description**

Genetic Model(GEN)

Disease risk depends only on the genetics of an individual

Environmental Model (ENV)

Disease risk depends only on environmental exposure of an individual

Gene Environment interaction Model (GEM)

The genetics modifies the effect of the environment in modulating the disease risk

Additive Model(ADD)

The effects of environment and genetics are independent and sums in modulating the disease risk

ii) G×G models (epistasis) are accepted in the form of percentage variations on the risk associated with a maximum number of three (out of the possible nine) combined genotypes.

KAPS2 converts population features and G×E and G×G models into the corresponding parameters of the MLM in two steps. First, starting from the provided epidemiological parameters, KAPS2 calculates the penetrance of each combined genotype assuming no interaction between the genotypes of each locus. Epistasis (if defined) is then modeled through a deformation procedure, reflecting G×G variations, of the set of penetrance values that keeps it coherent with user defined epidemiological parameters. In this step, when there is more than one way to change the values of the set (i.e. less than three epistatic variations are provided), a mathematical optimization system is employed to find the deformation characterized by the smallest variation on the values not constrained by user defined epistatic variations. An example of the results of the epistasis application is presented in Figure

Example of application of epistasis

**Example of application of epistasis.** Disease penetrance for combined genotypes before (left panel) and after (right panel) the application of an epistasis model with an increment of 20% of the risk associated with the (CC-TT) composed genotype. The x- and y- axes plot the reported genotypes of the two DPLs; the z-axis plots the risk associated with each combined genotype.

Consequently, for each combined genotype, KAPS2 computes the coefficients of a penetrance function of the environmental exposure that is associated with the combined genotype in the MLM. In this step G×E are also modeled; the Additive model (ADD) assumes that combined genotypes with higher penetrance have a higher basal disease risk, while the risk associated with the environmental factor is just added. On the other hand, in the modulative model (GEM), combined genotypes with higher penetrance have the same basal risk although they are more ’sensitive’ to the effect of the environment (see the Methods section).

Disease risk of an individual

In the final step the two branches of the procedure (Figure

Software

To create simulated populations, we employed an existing tool, SimuPOP, and the implementation of the above described algorithm. Using SimuPOP it is possible to drive a forward-time simulation that results in a population composed by the desired number of individuals and having specified genotype frequencies for a set of markers. To be usable in GENS2, populations should be created in SimuPOP as described previously

GENS2 accepts as input a population produced by SimuPOP and the

On the basis of the selected type of G×G and G×E, GENS2 calculates the coefficients of the MLM as described in the Method section.

For each individual, GENS2 assigns the disease status (affected or unaffected) on the basis of its disease risk by applying the MLM and using a random process.

The main output of the software can be either a single file or several files for a set of subpopulations of a given size produced by means of a subsampling procedure. Subsampling allows bootstrapping procedures to be executed on data sets produced with the same features. The output of GENS2 is in the form of a table in which each row represents an individual and the columns contain, from the left to the right, disease status, gender, environmental exposures and genotypes for each individual.

Two possible formats for the genetics output are available: phased haplotypes or genotypes. In both output formats the initial columns are identical to those described above; however, they differ in the way the genetic information for each individual is represented. In the phased haplotype format, there are two columns for each SNP that report the allele status (either A, C, T or G) on each chromosome. In the genotype format, each SNP is represented by one number (1, 2 or 3), where 2 represents the heterozygote and 3 represents, for DLPs, the high risk homozygote or, for all the other SNPs, the lower frequency homozygote.

In addition to the main output file, GENS2 outputs a log file that contains an extensive report of all the intermediate steps in the procedure and the values used to obtain the populations. Optionally, a file containing the

GENS2 is designed to be used either from the command line as a Python script, or through a graphical user interface, similar to a wizard, that prompts the user in the specification of all required parameters [see Additional file

The GENS2 graphic user interface. Flowchart showing a typical way of using GENS2 through its graphical user interface. Portable Network Graphics (.png) image file.

Click here for file

Overall, the computational time complexity of the simulation procedure depends by both simuPOP and GENS2. For GENS2, the procedure is dominated by the assignment of the disease status to all individuals in the population. Indeed, after the KAPS2 module has performed the translation of user provided parameters into MLM parameters in bounded constant time, the time complexity becomes linear in the number of individuals and the number of represented variables (genotypes and environmental exposures) for each individual in the simulated population. On the other hand, the amount of time required to perform a simulation with simuPOP depends on the size of the simulation and scales roughly linearly with the number of markers and individuals that are used

Results and discussion

Here we describe a method based on the MLM to simulate two genetic and one environmental factors interacting in the determination of a disease risk. The method is implemented in GENS2, a software that is freely available.

To test populations produced by GENS2, we performed a set of analyses on some representative populations. The aim was to emulate a case in which GENS2 was used to assess the performances of a feature selection method. In particular, all the analyses were performed using a logistic regression (

The first test was a single-marker analysis on a population of 1,000 cases and 1,000 controls with two DPLs in two distinct genomic regions, with no epistasis and an additive G×E model. The association of each marker with the status was tested using logistic regression analyses with model: disease risk = genetic factor + environmental factor. As expected, the most significant associations were those of DPLs [see Additional file ^{−6}). Furthermore, non-causative markers in LD with the two DPLs also showed a significant association that was roughly proportional to the value of ^{2} with the DPLs.

Association test in the case of additive G×E. The population comprised 1,000 cases and 1,000 controls. Two DPLs (RR=1.6, W=0.5) in an additive G×E model (OR=1.2) with no epistatic interaction were present. The two DPLs are in two distinct genomic regions (Chr 8: 115,755,575-120,750,913 in yellow; Chr 10: 112,253,020-117,247,095 in cyan). In the upper panel, the Manhattan plot shows the significance of the association (−log_{10}(p-value)) of each marker when tested individually (each dot represents a different marker). The red dashed line represents the significance threshold (0.05 after Bonferroni correction) and the green dashed lines mark the position of the DPLs. In the bottom panel, the ^{2} for each marker with the DPL in the same region is shown. Portable Network Graphics (.png) image file.

Click here for file

The second test was similar to the first, except that 10,000 cases and 10,000 controls and a modulative G×E model for the DPLs were used. For this test, the logistic regression was used by considering both an additive model (disease risk = genetic factor + environmental factor) and a multiplicative model (disease risk = genetic factor * environmental factor). None of the markers, when tested by additive model, reached a Bonferroni corrected significance level [see Additional file

Association test in the case of modulative G×E. The population comprised 10,000 cases and 10,000 controls. Two DPLs (RR=1.6, W=0.5) in a modulative G×E model (OR=1.2) with no epistatic interaction were present. The two DPLs are in two distinct genomic regions (Chr 8: 115,755,575-120,750,913 in yellow; Chr 10: 112,253,020-117,247,095 in cyan). In the upper panel, the two Manhattan plots show the significance of the association (−log_{10}(p-value)) of each marker when tested individually (each dot represents a different marker), using a multiplicative and an additive model in the logistic regression. The red dashed line represents the significance threshold (0.05 after Bonferroni correction) and the green dashed lines mark the position of DPLs. In the bottom panel, the ^{2} of each marker with the DPL on the same region is shown. Portable Network Graphics (.png) image file.

Click here for file

Finally, we tested an example of two DPLs with no marginal risk, an epistatic interaction ( + 20

Association test for the case of epistatic interaction

**Association test for the case of epistatic interaction.** The population comprised 5,000 cases and 5,000 controls. Two DPLs with no marginal risk (RR=1), an epistatic interaction ( + 20_{10}(p-value)) of each marker when tested individually (each dot represents a different marker). The red dashed line represents the significance threshold (0.05 after Bonferroni correction) and the green dashed lines mark the position of DPLs. In the middle panel, the ^{2} of each marker with the DPL in the same region is shown. In the bottom panel, the significance of the association for each 2-loci interaction (grey scale, nonsignificant; red scale, significant at a 0.05 level after Bonferroni correction) is shown.

The model described here can handle, in principle, any number of DPLs and environmental variables. However, we chose to limit the implementation to a relatively small number of factors (two genetic and one environmental) so that setting up the model does not become too complicated for the user. In this way, we reached a balance between the complexity of the represented phenomena and simplicity in the definition of the model. Moreover, the best strategy to identify even simple interactions as single G×G and G×E with binary environmental variables it is still debated (for an example of the debate, see the report on the 2009 Genetic Analysis Workshop

Several methods simulating genetic data have been proposed, many of them also handle complex LD patterns and polygenic traits

The simulated populations produced with GENS2 can be thought of as a sampling of an ideal infinite population that has the characteristics specified by the user. From this point of view, it is easy to understand that fluctuations of observed values around the expected ones can occur. Among the elements that mostly affect these fluctuations, are sample size, allele frequencies, and penetrance values. In particular, small sample sizes increase the effect of sampling error and thus, as expected, these fluctuations tend to vanish as the sample size is increased [see Additional file

Expected and observed penetrance values plotted for each combined genotype and for different sample sizes. In each of the panels one of the possible combined genotypes is shown. The genotypes (1, 2, and 3) are ordered according to their predicted affect on the overall disease risk. The x-axes show the sample size and the y-axes show the risk. The green lines represent the expected risk, the blue lines show the median observed risk, and the red dashed lines indicate the minimum and maximum observed disease risk in 100 replicates. Portable Network Graphics (.png) image file.

Click here for file

Although the GENS2 part of the simulation process is reasonably fast, the procedure to simulate large populations using simuPOP takes time to complete. It would be difficult to simulate a large number of samples without a cluster system, unless multiple (small) samples are drawn from the same large population.

Conclusions

GENS2 allows the simulation of gene-gene and gene-environment interactions among two genetic and one environmental factor in relation to the risk to develop a complex disease. It is based on data with a realistic pattern of LD and it has no limitations either on the number of individuals that can be simulated or on the number of genetic and environmental factors within a simulated data set. Furthermore, a large amount of effort has been channeled into allowing the input of parameters as standard epidemiological quantities so that the software is immediately usable by the biomedical community.

GENS2 provides large biologically realistic data sets with known features that can be used to challenge, and eventually improve, the statistical tools that are designed to identify those interactions.

Methods

Here we present the mathematical background underlying the extension of the earlier model

1. the genetics can influence the disease risk either directly or by modifying the effect of the environment.

2. the genetic loci can have independent effects (no epistasis) or can interact in an epistatic manner, and

3. the DPLs are not in LD.

The Multi-Logistic Model

To model these situations we applied the MLM, here briefly summarized, that uses a different logistic function for each combination of the two genotypes _{
a
}
_{
b
}) (with _{
a
}∈_{
b
}∈_{
a
}
_{
b
}

where

To simulate a population, the coefficients

Determination of MLM parameters

Let _{
a
},_{
b
}) is the product _{
a
},_{
b
}) the total risk for the disease insurgence _{
a
},_{
b
}). The value of this parameter is obtained with the MLM as

where

Because every logistic function in MLM is characterized by its own parameters, the 3×3 pairs of values (

Modeling G×E

In general, equation (2) admits infinite solutions. However, the G×E model imposes some constraints on the coefficients. Thus, by fixing the value of one of the coefficients _{
AB
}, the number of degrees of freedom of the system can be reduced, drawing one solution from the equation system. By convention, we chose to associate _{
AB
}to the genotype with highest risk; it is easy to show that this value corresponds to the natural logarithm of the odds ratio of the risk which is related to the increase of one unit of the environmental exposure. Constraints imposed on the system by each one of the proposed gene environment interactions model are summarized below:

• Genetic effect (GEN):

• Environmental effect (ENV):

• Modulative effect (GEM):

• Additive effect (ADD):

When the interaction model, the matrix containing total risk values for each combination of genotypes, namely _{
AB
}have been defined, a set of six transcendent equations can be written with the coefficients of the logistic functions (except _{
AB
}) as the unknown variables; these equations admit exactly one solution

Modeling G×G

To determine

Moreover, the total risk values associated with the genotypes of a single locus are related to those of combined genotypes via marginalization:

In general, once the marginals

Notice that the superscript “I” is a reminder that the independent polygenic model has been assumed.

Using an independent polygenic model and a deformation procedure, epistatic interactions among DPLs can be modeled to obtain a matrix

Let ^{
I
}, where each entry _{
a
},_{
b
}). By definition

Using the expressions in Eq.s (4) and ( 5) we get

Because by construction, the matrix

Once the quantities **R**
^{3×3} in which the elements are assignments for the entries of matrix

More precisely (in the two variables case), given the constraints of Eq.s (9) and (10) from one up to three entries for

An objective function can be used to minimize the variance of the set of ratios _{
a
},_{
b
}) corresponding to non-user-assigned

Establishing the disease status

Once the coefficients of the MLM are fixed, the disease risk for each individual in a population can be established by substituting the coefficients associated with the carried genotype into Eq. (1) and then by evaluating the resulting logistic function forthe exposure level of the environmental disease factor. Finally, to assign the disease status to each individual, the disease risk is compared with a random number drawn from a uniform distribution.

Availability and requirements

**Project name:** Gene-Environment iNteraction Simulator 2

**Project home page:**

**Operating system(s):** Platform independent

**Programming language:** Python

**Other requirements:** SimuPop, OpenOpt, wxPython (optional)

**License:** GNU GPLv3

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

MP conceived the model and the extensions, and drafted the manuscript; GS conceived and developed the extensions, implemented the software and drafted the manuscript; RA conceived the model and the extensions and drafted the manuscript; SC and GM conceived the study, and participated in its design and coordination and helped to draft the manuscript. All authors have read and approved the final manuscript.

Acknowledgements

RA is the recipient of a fellowship from the Doctorate of Computational Biology and Bioinformatics, University “Federico II”, Naples, Italy. The funders had no role in the study design, data collection and analysis, decision to publish, or in preparation of the manuscript.