BIPS - Institute for Epidemiology and Prevention Research GmbH, Bremen 28359, Achterstraße 30, Germany

University of Bremen, Institute of Public Health and Nursing Science (IPP), Bremen 28359, Grazer Straße 4, Germany

Abstract

Background

Gene-environment interactions play an important role in the etiological pathway of complex diseases. An appropriate statistical method for handling a wide variety of complex situations involving interactions between variables is still lacking, especially when continuous variables are involved. The aim of this paper is to explore the ability of neural networks to model different structures of gene-environment interactions. A simulation study is set up to compare neural networks with standard logistic regression models. Eight different structures of gene-environment interactions are investigated. These structures are characterized by penetrance functions that are based on sigmoid functions or on combinations of linear and non-linear effects of a continuous environmental factor and a genetic factor with main effect or with a masking effect only.

Results

In our simulation study, neural networks are more successful in modeling gene-environment interactions than logistic regression models. This outperfomance is especially pronounced when modeling sigmoid penetrance functions, when distinguishing between linear and nonlinear components, and when modeling masking effects of the genetic factor.

Conclusion

Our study shows that neural networks are a promising approach for analyzing gene-environment interactions. Especially, if no prior knowledge of the correct nature of the relationship between co-variables and response variable is present, neural networks provide a valuable alternative to regression methods that are limited to the analysis of linearly separable data.

Background

The etiological pathway of any complex disease can be described as an interplay of genetic and non-genetic underlying causes (e.g.

For modeling complex relationships, especially with little prior knowledge of the exact nature of these relationships, a more flexible statistical tool should be used. One promising alternative is the use of artificial neural networks. Here, variables do not have to be transformed a priori and interactions are modeled implicitly, that is, they do not have to be a priori formulated in the model

Since studies using neural networks for modeling continuous co-variables have previously shown promising results (see e.g.

Methods

Simulation study

Case-control data sets are generated using a two step design. First, underlying populations are simulated with a controlled prevalence of 10% and an overall sample size of five million observations. These populations carry the information of two marginally independent and randomly drawn factors – one biallelic locus and one continuous environmental factor – and a case-control status. The minor allele frequency is 30% to ensure sufficient cell frequencies in the final case-control data sets and it is assumed that the Hardy-Weinberg equilibrium holds. The environmental factor follows a continuous uniform distribution on the interval [0,100]. Depending on the genotype

Artificial neural networks and logistic regression models are fitted to the data, i.e. separately to all 100 case-control data sets for each situation. A multilayer perceptron (MLP, see e.g.

For comparison purposes, logistic regression models are fitted to the same data sets. The genotype is coded co-dominant counting the number of risk alleles and using two dichotomous design variables, one representing the heterozygous and one representing the homozygous mutated genotype. Five different models are used: the null model, three main effect models – containing only one or both main effects – and the full model – containing both main effects and one or two interaction terms depending on the genotype coding. For both coding approaches, the best model is selected based on BIC.

To assess the model fit of neural networks and logistic regression models, the mean prediction over the 100 data set is compared to the theoretic risk model of a case-control data set. This theoretic risk model stands for a perfectly drawn case-control data set since it reflects the probabilities of the given population and takes into account the changing prevalence in a balanced case-control data set. Mean absolute differences between the theoretic risk model and its predictions are calculated element-wise for an equidistant vector (^{
′
}= 0, 0.1, 0.2,…,100) used as an environmental factor which yields the matrix

where ^{
′
}) refers to the theoretic risk model of the case-control data set and
^{
′
}= 0, 0.1, 0.2,…,100 and

Data generation and all analyses are done using R

Theoretic risk models

Two different types of theoretic risk models for gene-environment interactions are used, namely the models introduced by Amato et al.

Risk models by Amato et al

Amato et al.

The four models are defined as follows:

• the genetic model: _{1} ≤_{2} ≤_{3} and _{1} =_{2} =_{3} = 0,

• the environmental model: _{1} =_{2} =_{3} and _{1} =_{2} =_{3} ≠ 0,

• the additive model: _{1} ≤_{2} ≤_{3} and _{1} =_{2} =_{3} ≠ 0,

• the interaction model: _{1} =_{2} =_{3} and _{1} ≤_{2} ≤_{3}.

To be able to fix the prevalence

The values of _{
g
},_{
g
},

**Risk model**

**Risk scenario**

**Constant values ****
α
**

**Constant values c, z**

Constant values _{
g
}, _{
g
}(

Genetic model

High risk

_{0} =_{1} =_{2} = 0

Low risk

_{0} =_{1} =_{2} = 0

Environmental model

High risk

_{0} =_{1} =_{2} = 7.5,

_{0} =_{1} =_{2} = −0.15,

Low risk

_{0} =_{1} =_{2} = 3.75,

Risk models by Amato et al.

_{0} =_{1} =_{2} = −0.075,

Additive model

High risk

_{0} =_{1} =_{2} = −0.15,

Low risk

_{0} =_{1} =_{2} = −0.075,

Interaction model

High risk

_{0} =_{1} =_{2} = 7.5,

_{0} = 2 ·_{1}, _{1} = −0.15, _{2} = 0.5·_{1},

Low risk

_{0} =_{1} =_{2} = 3.75,

_{0} = 2 ·_{1}, _{1} = −0.075, _{2} = 0.5 ·_{1},

Model 1

High risk (

Low risk (

Risk model representing a masking effect of the genetic factor

Model 2

High risk (

Low risk (

Model 3

High risk (

Low risk (

Model 4

High risk (

Low risk (

Theoretic risk models by Amato et al.

**Theoretic risk models by Amato et al. ****, high risk scenario.** The left part of each figure refers to the homozygous wild-type genotype, the middle one to the heterozygous, and the right one to the homozygous mutated genotype.

Risk models representing a masking effect of the genetic factor

In addition, we define four theoretic risk models representing four types of gene-environment interactions where the gene mainly has a masking effect. The kind of functional relationship between the environmental factor and the penetrance again depends on the genotype information. The four theoretic risk models are described in detail in the following:

1. The structure of the first risk model is given by the following penetrance function

2. The second risk model is defined by

3. In the third risk model, the penetrance function is given by

4. For the fourth risk model, the penetrance function is determined as follows:

In each of these four models,

Theoretic risk models representing a masking effect of the genetic factor, high risk scenario

**Theoretic risk models representing a masking effect of the genetic factor, high risk scenario.** The left part of each figure refers to the homozygous wild-type genotype, the middle one to the heterozygous, and the right one to the homozygous mutated genotype.

Real data application

To study the performance of a neural network in a real life situation, we applied this approach to a cross-sectional study dealing with a lifestyle induced complex disease. This application should serve as an example for the general practicability of our approach without describing the study from a subject point of view. The common effect of an SNP and a continuous environmental factor on a binary outcome is investigated while controlling for the effect of one binary confounder. The data set includes 138 cases and 1599 controls. As in the simulation study, neural networks with up to five hidden neurons are trained each five times with randomly initialized weights drawn from of a standard normal distribution and the best neural network is chosen based on BIC. The analysis is done once using the whole data set and once stratified by the confounding factor. For the stratified analysis, 95% bootstrap percentile intervals are calculated using 100 bootstrap replications

Results

Risk models by Amato et al.

A graphical comparison of the general modeling ability for neural networks and logistic regression models is shown in Figures

Graphical comparison of mean predictions

**Graphical comparison of mean predictions.** Risk models by Amato et al.
^{′ }= 0, 0.1, 0.2,…,100 and

Graphical comparison of mean predictions

**Graphical comparison of mean predictions.** Risk models by Amato et al.
^{′ }= 0, 0.1, 0.2,…,100 and

These results are also reflected by the sum of the mean absolute differences

**High risk scenario**

**Low risk scenario**

**Neural network**

**Logistic regression**

**Logistic regression (DV)**

**Neural network**

**Logistic regression**

**Logistic regression (DV)**

**
n =1000 + 1000
**

**
n=1000 + 1000
**

Sum of mean absolute differences between theoretic and estimated penetrance function for 100 case-control data sets in the low and high risk scenario for different sample sizes. Bold numbers mark the best model fit comparing neural networks and logistic regression models. DV = design variables.

Genetic model

40.79

**31.31**

48.15

48.22

**40.85**

83.62

Environmental model

**46.14**

277.11

277.11

**52.45**

171.61

171.36

Additive model

**45.13**

256.52

260.10

**47.99**

163.19

189.92

Interaction model

**119.77**

345.77

247.93

**132.47**

225.61

194.37

**
n =500 + 500
**

**
n = 500 + 500
**

Genetic model

59.28

**47.14**

68.22

**64.27**

92.02

159.80

Environmental model

**60.57**

277.51

277.15

**90.76**

174.37

174.16

Additive model

**56.10**

268.11

297.62

**80.66**

190.25

242.34

Interaction model

**138.91**

344.50

268.75

**153.56**

233.16

210.98

**
n = 200 + 200
**

**
n = 200 + 200
**

Genetic model

101.95

**85.67**

152.25

**97.23**

167.48

207.66

Environmental model

**96.32**

278.40

278.93

**163.16**

177.14

175.27

Additive model

**96.16**

329.55

374.17

**177.24**

246.06

292.39

Interaction model

**168.90**

349.88

316.01

**207.81**

256.22

291.88

If the sample size decreases, the modeling ability becomes worse for neural networks as well as for logistic regression models (see Table

Models representing a masking effect of the genetic factor

The general modeling ability for the risk models representing a masking effect of the genetic factor is shown in Figures

Graphical comparison of mean predictions

**Graphical comparison of mean predictions.** Risk models representing a masking effect of the genetic factor, high risk scenario, ^{′ }= 0, 0.1, 0.2,…,100 and

Graphical comparison of mean predictions

**Graphical comparison of mean predictions.** Risk models representing a masking effect of the genetic factor, low risk scenario, ^{′ }= 0, 0.1, 0.2,…,100 and

Comparing the sum of the mean absolute differences

**High risk scenario**

**Low risk scenario**

**Neural network**

**Logistic regression**

**Logistic regression (DV)**

**Neural network**

**Logistic regression**

**Logistic regression (DV)**

Sum of mean absolute differences between theoretic and estimated penetrance function for 100 case-control data sets in the low and high risk scenario for different sample sizes. Bold numbers mark the best model fit comparing neural networks and logistic regression models. DV = design variables.

**
n = 1000 + 1000
**

**
n = 1000 + 1000
**

Model 1

**38.63**

211.62

105.83

**41.07**

195.15

87.57

Model 2

**117.94**

359.10

155.40

**101.92**

323.89

114.71

Model 3

**40.67**

253.01

85.51

**43.15**

258.19

65.87

Model 4

103.37

228.10

**85.16**

103.63

227.50

**59.74**

**
n = 500 + 500
**

**
n = 500 + 500
**

Model 1

**54.58**

219.39

136.26

**70.40**

207.97

140.74

Model 2

**144.35**

363.36

176.74

183.28

327.58

**143.06**

Model 3

**60.98**

261.86

110.93

**66.25**

278.61

114.68

Model 4

143.62

235.44

**102.13**

115.59

237.14

**81.13**

**
n = 200 + 200
**

**
n = 200 + 200
**

Model 1

**126.56**

252.88

251.70

**192.47**

244.17

225.63

Model 2

262.92

371.69

**230.25**

297.68

348.46

**215.70**

Model 3

**139.27**

324.55

215.12

**141.28**

328.64

191.61

Model 4

189.69

287.39

**169.86**

164.13

280.21

**149.95**

With decreasing sample sizes, the model fit again becomes worse and the variance increases (data not shown). If the sample size is 500 + 500 subjects, neural networks again have the best model fit for the first three risk models in the high risk scenario. In the low risk scenario, this is only true for the first and the third risk model. A sample size of just 200 + 200 subjects leads to a considerably worse model fit of neural networks. In this situation, logistic regression models with design variables coding the genotype have the best model fit for the second and fourth risk model in both risk scenarios. Neural networks still have the best model fit if the gene has a masking effect only.

Real data application

The results for the real data application are shown in Figure

Real data set application

**Real data set application.** Prediction of the neural network using the whole data set. Two lines per genotype result from the inclusion of a binary confounding factor in the analysis. 138 cases and 1599 controls

Real data set application, stratified analysis

**Real data set application, stratified analysis.** Mean predictions of the neural network over 100 bootstrap replications (blue lines) and 95% bootstrap confidence intervals (red lines).

Discussion

In this paper, we studied the ability of neural networks and logistic regression models to capture different types of gene-environment interactions. Neural networks were able to predict the theoretic risk models in all sixteen investigated situations such that the prediction intervals contained the true underlying risk models in most situations and were thus superior to logistic regression models. Logistic regression models without design variables completely failed to model the constant effects. Employing design variables led to a considerably better model fit only when average values over the 100 data sets were considered. Single predictions for one data set often had a misleading form and did not distinguish between linear and non-linear components especially for the first two risk models. Nevertheless for risk model 4, logistic regression models using design variables provided the best model fit compared with neural networks as could be seen by the mean absolute differences although the prediction interval did not include the whole true risk model. However, the reasoning behind this fact is still unknown. The real data set application showed the general usability of neural networks in real life situations. Neural networks discovered different risk slopes for each genotype, which also became obvious from the corresponding bootstrap confidence intervals.

Neural networks do not use interaction terms. In our application, they mainly needed one or two hidden neurons if the environmental factor had an effect (risk models by

Logistic regression models belong to the class of generalized linear models and as such are limited in their modeling capacity to linearly separable data. On the contrary, neural networks can adapt to any piecewise continuous function. Since linear and non-linear relationships can be modeled simultaneously, neural networks are a promising tool if little is known about the exact relationship between co-variables and a response variable or especially, if a non-linear relationship is assumed.

In addition, we showed for simulated data assuming neither an association of the genetic nor an association of the environmental factor that neural networks also have a good model fit in this situation (see Figure

Mean prediction of the neural network

**Mean prediction of the neural network.** Risk model assumes no association. Mean prediction of the neural network
^{′ }= 0, 0.1, 0.2,…,100 and

Thus, our results suggest that neural networks can be a valuable approach already in the situation of 500 cases and 500 controls. However, there are two main drawbacks of neural networks. First, the computing time needed to train them is very high. In our application, the analyses for one situation (100 replications, six network topologies each) sometimes took more than 30 hours. Second, neural networks are still considered as black-box approach since both network topology and trained weights have no direct interpretation. Thus, there is no established way for model selection and parameter testing. One possibility to estimate the effect of a co-variable is provided by the concept of generalized weights

We assumed the environmental factor to be uniformly distributed over the interval [0,100]. In practice, bell-shaped distributions for environmental factors might be also of interest. Here, it can be expected that a higher sample size is necessary to enable the statistical method to detect the true shape of the underlying risk function also at the margins. Additionally, we assumed the minor allele frequency to be 30%. In a sensitivity analysis, we repeated the simulation study with a minor allele frequency of 5% (see Table

**High risk scenario**

**Low risk scenario**

**Neural network**

**Logistic regression**

**Logistic regression (DV)**

**Neural network**

**Logistic regression**

**Logistic regression (DV)**

Sum of mean absolute differences between theoretic and estimated penetrance function for 100 case-control data sets in the low and high risk scenario for different sample sizes. Bold numbers mark the best model fit comparing neural networks and logistic regression models. DV = design variables. ^{∗}Predictions were calculated for all models that do not have unspecified parameters due to empty cells.

**
n = 1000 + 1000
**

**
n = 1000 + 1000
**

Genetic model

**80.29**

80.39

303.07^{∗}

**87.65**

209.74

249.96

Environmental model

**79.60**

278.32

277.18

**78.18**

170.94

170.94

Additive model

**74.67**

369.57

443.10

**92.18**

303.98

348.50

Interaction model

**180.02**

415.60

541.02^{∗}

**191.77**

327.44

481.62^{∗}

Model 1

**113.62**

244.87

375.43^{∗}

**179.23**

226.03

355.59^{∗}

Model 2

**232.75**

389.70

472.47^{∗}

**318.57**

346.57

460.08^{∗}

Model 3

253.00

**230.12**

232.20

256.38

**253.67**

254.80

Model 4

133.91

126.27

**97.92**

138.28

132.11

**93.04**

Conclusions

To the best of our knowledge, neural networks have not been used for modeling gene-environment interactions so far. In other contexts, MLPs were clearly superior to logistic regression models

In practice, neural networks can be applied in case-control studies to investigate the common effect of two genetic factors or one genetic and one environmental factor. Since the functional form of the model has not to be specified in neural networks, it has neither to be known whether the two involved factors indeed have an effect on the disease nor whether an interaction between both factors is present. The prediction of a neural network generates insight in the kind of relationship between co-variables and disease, for example, whether the underlying relationship is non-linear or whether there are different relationships per genotype. Thus, although there is still need for further research regarding the interpretability of neural networks, neural networks are already a valuable statistical tool especially for exploratory analyses and/or when little is known about the functional relationship of risk factors and investigated disease.

Appendix

Artificial neural networks

The general idea of a multilayer perceptron (MLP) is to approximate functional relationships between co-variables and response variable(s). It consists of neurons and synapses that are organized as a weighted directed graph. The neurons are arranged in layers and subsequent layers are usually fully connected by synapses. Each synapse is attached by a weight indicating the effect of this synapse. A positive weight indicates an amplifying, a negative weight a repressing effect. Neural networks have to be trained using a learning algorithm to adjust the synaptic weights according to given data. The learning algorithm minimizes the deviation of predicted output and given response variable measured by an error function.

Data passes the MLP as signals. This process starts at the input layer consisting of all co-variables and a constant neuron and it stops at the output layer consisting of the response variable(s). Hidden neurons can be included between the input and output layer in several layers to increase the modeling flexibility. These hidden layers are not directly observable and cannot be controlled by data. See Figure

A multilayer perceptron

**A multilayer perceptron.** An MLP with one hidden layer consisting of three hidden neurons.

An MLP with one hidden layer is able to fit any piecewise continuous function

where _{0}, _{
j
}, and _{
ij
}, **
x = (x
**

To train neural networks according to the case-control data sets, resilient backpropagation

Competing interests

The authors declare that they have no competing interests.

Author’s contributions

FG planned and carried out the simulation study and drafted the manuscript. IP drafted the manuscript. KB planned the simulation study and drafted the manuscript. All authors read and approved the final manuscript.

Acknowledgements

We gratefully acknowledge the financial support for this research by the grant PI 345/3-1 from the German Research Foundation (DFG).

We would like to thank two anonymous reviewers for their valuable remarks.