Dept. of Animal Sciences, University of Wisconsin, Madison, 53706, USA

Dept. of Dairy Science, University of Wisconsin, Madison, 53706, USA

Dept. of Biostatistics and Medical Informatics, University of Wisconsin, Madison, 53706, USA

Dept. of Animal Sciences, Biometry and Genetics Branch, University of Yuzuncu Yil, Van, 65080, Turkey

Abstract

Background

In the study of associations between genomic data and complex phenotypes there may be relationships that are not amenable to parametric statistical modeling. Such associations have been investigated mainly using single-marker and Bayesian linear regression models that differ in their distributions, but that assume additive inheritance while ignoring interactions and non-linearity. When interactions have been included in the model, their effects have entered linearly. There is a growing interest in non-parametric methods for predicting quantitative traits based on reproducing kernel Hilbert spaces regressions on markers and radial basis functions. Artificial neural networks (ANN) provide an alternative, because these act as universal approximators of complex functions and can capture non-linear relationships between predictors and responses, with the interplay among variables learned adaptively. ANNs are interesting candidates for analysis of traits affected by cryptic forms of gene action.

Results

We investigated various Bayesian ANN architectures using for predicting phenotypes in two data sets consisting of milk production in Jersey cows and yield of inbred lines of wheat. For the Jerseys, predictor variables were derived from pedigree and molecular marker (35,798 single nucleotide polymorphisms, SNPS) information on 297 individually cows. The wheat data represented 599 lines, each genotyped with 1,279 markers. The ability of predicting fat, milk and protein yield was low when using pedigrees, but it was better when SNPs were employed, irrespective of the ANN trained. Predictive ability was even better in wheat because the trait was a mean, as opposed to an individual phenotype in cows. Non-linear neural networks outperformed a linear model in predictive ability in both data sets, but more clearly in wheat.

Conclusion

Results suggest that neural networks may be useful for predicting complex traits using high-dimensional genomic information, a situation where the number of unknowns exceeds sample size. ANNs can capture nonlinearities, adaptively. This may be useful when prediction of phenotypes is crucial.

Background

Challenges in the study of associations between genomic variables (e.g., molecular markers) and complex phenotypes include the possible existence of cryptic relationships that may not be amenable to parametric statistical modeling, as well as the high dimensionality of the data, illustrated by the growing number of single nucleotide polymorphisms, now close to 10 million in humans

There has been a growing interest in the use of non-parametric methods for prediction of quantitative traits based on reproducing kernel Hilbert spaces regressions on markers

In this study we investigated the performance of several ANN architectures using Bayesian regularization (a method for coping with the "small

Methods

For clarity of presentation the methodology is presented first, as the main objective of the paper was to cast neural networks in a quantitative genetics predictive context. Subsequently, a description of the two sets of data used to illustrate how the Bayesian neural networks were run is provided. As stated, the first data set consisted of milk, protein and fat yield in dairy cows. The second set represented 599 lines of wheat, with mean grain yield as target trait.

Excursus: Feed-Forward Neural Networks

To illustrate, consider a network with three layers, as shown in Figure

Illustration of the neural networks used

**Illustration of the neural networks used**. In the Jersey data there were 297 elements of pedigree or genomic relationship matrices used as inputs (the _{k }_{j,k }

Algebraically, the process can be represented as follows. Let _{i }_{i }_{ij}_{k }p_{i}_{k }_{kj }_{k}_{k}_{k }p_{i}_{1},w_{2},...,w_{S }_{k}_{k }p_{i}). The link between the response variable (phenotype) and the inputs is provided by the model

where _{i }^{2}) and ^{2 }is a variance parameter. If _{k}_{k }p_{i}_{k}_{kj}

Fisher's infinitesimal model viewed as a neural network

Let **u **~ (**0**, **A**^{2}^{2}** A = CC' **= {

I) **t **= **u **+ **e **= **Cz**_{u }**e **= **Cu*** + **e**,

where **z **is a vector of independent standard normal deviates, **u*** = **z**_{u }**0**, **I**^{2}_{u}**e **~ (**0**, **I**^{2}) is a residual vector with ^{2 }interpretable as environmental variance.

II) **t **= **AA**^{-1}**u **+ **e = Au**** + **e**,

where **u**** = **A**^{-1}**u **~ (**0, A**^{-1 }^{2}_{u}

III) **t **= **A**^{-1 }**Au + e = A**^{-1}**u***** + **e**,

where **u***** = **Au ~ **(**0**, **A**^{3}^{2}

In each of these formulations Fisher's model can be viewed as a neural network with a single neuron in the middle layer, where

and

Here, a bias parameter **A**, with the strengths of the connections represented by the corresponding entries of **u***, **u**** and **u*****, respectively.

Is it possible to exploit knowledge of relationships in a fuller manner? Since a neural network is a universal approximator, the predictive performance of the classical infinitesimal linear model can be enhanced, at least potentially, by taking a model on, say,

Here, the inputs are entries _{ij }_{k }_{k }_{k }

Given the availability of dense markers in humans and animals, an alternative or complementary source of input that can be used in equation (2) consists of the elements of a marker-based relationship matrix, as in _{ij }_{ij}**G**. As noted by **G **is proportional to **XX'**, where **X **is the incidence matrix of a linear regression model on markers, this is equivalent to Bayesian ridge regression. Of course, nothing precludes using both pedigree-derived and marker-derived inputs in the construction of a neural network.

Bayesian regularization

The objective in ANNs is to arrive at some configuration that fits the training data well but that it also has a reasonable ability of predicting yet to be seen observations. This can be achieved by placing constraints on the size of the network connection strengths, e.g., via shrinkage, and the process is known as regularization. A natural way of attaining this compromise between goodness of fit and predictive ability is by means of Bayesian methods

Conditionally on

where **w **denotes all connection strength coefficients (including all neuron-specific biases); ^{2 }**w**| ^{2}_{w}**I**^{2}_{w}^{2}_{w }

where the denominator is the marginal density of the data, that is

For a neural network with a least one non-linear activation function, the integral is expressible as

which does not have closed form, because of non-linearity. Recall that

Although a Bayesian neural network can be fitted using Markov chain Monte Carlo sampling, the computations are taxing because of the enormous non-linearities present coupled with the high-dimensionality of **w**, such as it is the case with genomic data. An alternative approach is based on computing conditional posterior modes of connection strengths, given some likelihood-based estimates of the variance parameters, i.e., as in best linear unbiased prediction (when viewed as a posterior mode) coupled with restricted maximum likelihood (where estimates of variances are the maximizers of a marginal likelihood). The conditional (given ^{2 }and **w **is from equation (4)

Let

where

and _{w }**w'w**. It follows that maximizing _{D}_{w}^{MAP }that maximizes **MAP **stands for "

If the additive infinitesimal model is represented as a neural network, the coefficient of heritability is given by **w **(shrinkage), which produces a less wiggly output function

Given α and β, the w = w^{MAP }estimates can be found via any non-linear maximization algorithm as in, e.g., the threshold and survival analysis models of quantitative genetics

Tuning parameters α and β

A standard procedure used in neural networks (and in the software employed here) infers α and β by maximizing the marginal likelihood of the data in equation (5); this corresponds to what is often known as empirical Bayes. Because (5) does not have a closed form (except in linear neural networks), the marginal likelihood is approximated using a Laplacian integration done in the vicinity of the current value w = w^{MAP}, which depends in turn on the values of the tuning parameters at which the expansion is made. This type of approach for non-linear mixed models has been used in animal breeding for almost two decades

The Laplacian approximation to the marginal density in equation (5) leads to the representation

where **w**^{MAP }evaluated at the "old" values of the tuning parameters)

and

These expressions, as well as (7), are similar to those that arise in maximum likelihood estimation of variance components

More details on computing procedures for neural networks are in **w**. 3) Compute _{new }_{new}

Neural Network Architectures Evaluated and Implementation

A prototype of the networks considered is in Figure _{ij }

where _{i }_{i})

In models with 2-6 neurons the emissions were always assigned a hyperbolic tangent activation (the choice of function can be based on, e.g., cross-validation); these activations were summed and collected linearly as shown in Figure

where the

MATLAB's neural networks toolbox

The neural networks were fitted to data in the training set, with the α and β parameters, connection strengths and biases modified iteratively. In the Jersey data, as parameters changed in the course of training, the predictive ability of the network was gauged in parallel in the validation set, which was expected to be similar in structure to the testing set, because they were randomly constructed. The same was done with the wheat data, except that there was no "intermediate" validation set. Once the mean squared error of prediction reached an optimal level, training stopped, and this led to the best estimates of the network coefficients. This estimated network was then used for predicting the testing set; predictive correlations (Pearson) and mean-squared errors were evaluated.

Before processing, MATLAB rescales all input and output variables such that they reside in the [-1, +1] range, to enhance numerical stability; this is done automatically using the "mapminmax" function. To illustrate, consider the vector **x**' = _{min }= 3 and _{max }= 6. If values are to range between _{min }= -1 and _{max }= +1, one sets temporarily **x _{temp}**' = [-1,1,4], so only

MATLAB uses the Levenberg-Marquardt algorithm (based on linearization) for computing the posterior modes in Bayesian regularization, and back-propagation is employed to minimize the penalized residual sum of squares. The maximum number of iterations (called epochs) in back-propagation was set to 1000, and iteration stopped earlier if the gradient of the objective function was below a suitable level or when there were obvious problems with the algorithm

Jersey cows data

Because of the high-dimensionality of the genotypic data, the neural networks used either additive or genome-derived relationships among cows as inputs (instead of SNP genotype codes), to make computations feasible in MATLAB. The rationale for this is based on the representation of the infinitesimal model as a regression on a pedigree, or as a regression on a matrix that is proportional to genomic relationships, as argued by

where _{ij }= a_{ij }_{ij }**p**_{i }

The expected additive genetic relationship matrix, **A **= {_{ij}**G **= {_{ij}_{i }**M**, with typical element _{ij }_{ij }**E**. 4) Form the estimated genomic relationship matrix (assuming linkage equilibrium among markers) as

The matrix **Z = M-E **contains "centered" codes, such that the mean of the values in any of its columns is null; **Z **can be used as in incidence matrix in marker assisted regression models

Wheat lines data

There were 599 wheat lines, each genotyped with 1279 DArT markers (Diversity Array Technology) generated by Triticarte Pty. Ltd. (Canberra, Australia;

Results

Degree of complexity

The effective number of parameters (γ) associated with each of the networks examined in the Jersey data is presented in Table

Effective number of parameters (± standard errors), by trait, in Jerseys.^{1}

**Network**

**Fat yield**

**(pedigree)**

**Fat yield**

**(genomic)**

**Milk yield**

**(pedigree)**

**Milk yield**

**(genomic)**

**Protein yield**

**(pedigree)**

**Protein yield**

**(genomic)**

Linear

123 ± 5.6

166 ± 2.0

124 ± 7.6

162 ± 2.9

118 ± 8.5

151 ± 4.5

1 neuron

91 ± 4.9

142 ± 2.0

93 ± 5.8

166 ± 2.0

91 ± 10.3

144 ± 2.5

2 neurons

104 ± 5.8

128 ± 7.6

122 ± 6.5

145 ± 7.8

114 ± 8.0

136 ± 8.0

3 neurons

107 ± 5.8

132 ± 5.7

123 ± 5.1

129 ± 6.0

126 ± 6.9

141 ± 4.9

4 neurons

108 ± 5.8

129 ± 4.7

112 ± 4.7

131 ± 5.8

129 ± 5.4

138 ± 6.0

5 neurons

106 ± 4.9

127 ± 4.9

118 ± 4.8

132 ± 5.4

131 ± 4.9

138 ± 5.6

6 neurons

114 ± 3.3

128 ± 7.5

122 ± 5.1

132 ± 5.6

136 ± 4.6

137 ± 5.0

^{1 }Results are averages of 20 runs based on random partitions of the data

Effective number of parameters obtained from different network architectures in the Jersey data

**Effective number of parameters obtained from different network architectures in the Jersey data**. Results shown are averages of 20 independent runs. "Linear" denotes a 1-neuron model with linear activation functions throughout.

Effective number of parameters, predictive correlations, and mean squared errors of prediction: wheat.^{1}

**ANN architectures**

**Linear**

**1 neuron**

**2 neurons**

**3 neurons**

**4 neurons**

Criterion

Effective number of parameters

299 ± 5.5

260 ± 6.1

253 ± 5.9

238 ± 5.5

220 ± 2.8

Correlations in testing set

0.48 ± 0.03

0.54 ± 0.03

056 ± 0.02

0.57 ± 0.02

0.59 ± 0.02

Mean squared error in testing set

0.99 ± 0.04

0.77 ± 0.03

0.74 ± 0.03

0.71 ± 0.02

0.72 ± 0.02

^{1 }The training-test partitions for this data were random and repeated 50 times; standard errors in parentheses

The effective number of parameters behaved differentially with respect to model architecture and this depended on the input variables used. When using pedigrees in the Jersey data, the hyperbolic tangent activation function in the 1-neuron model reduced γ drastically, relative to the linear model (1 neuron with linear activation throughout). Then, an increment in number of neurons from 2 to 6 increased model complexity relative to that of the 1 neuron model with non-linear activation, but not beyond that attained with the linear model, save for protein yield. For this trait, γ was 118 for the linear model, and ranged from 126 to 136 in models with 3 through 6 neurons. When genomic relationships were used as inputs, γ was largest for the linear model for fat and protein yield, and for the 1-neuron model with a non-linear activation function in the case of milk yield. In wheat, the effective number of parameters decreased as architectures became more complex. There was large variation among runs in effective number of parameters for both data sets, but there was not a clear pattern in the variability.

Predictive performance

Results pertaining to predictive ability evaluated in the testing sets are shown in Table

Prediction mean squared errors (± standard errors) by trait: Jerseys.^{1}

**Network**

**Fat yield (pedigree)**

**Fat yield (genomic)**

**Milk yield (pedigree)**

**Milk yield (genomic)**

**Protein yield (pedigree)**

**Protein yield (genomic)**

Linear

1.19 ± 0.07

0.86 ± 0.05

1.09 ± 0.05

0.88 ± 0.04

1.00 ± 0.04

0.75 ± 0.07

1 neuron

1.01 ± 0.04

0.74 ± 0.03

0.99 ± 0.04

0.81 ± 0.03

0.97 ± 0.04

0.71 ± 0.04

2 neurons

0.93 ± 0.05

0.70 ± 0.03

0.96 ± 0.05

0.76 ± 0.04

1.02 ± 0.04

0.72 ± 0.04

3 neurons

0.92 ± 0.04

0.71 ± 0.03

0.98 ± 0.02

0.78 ± 0.04

0.96 ± 0.06

0.80 ± 0.04

4 neurons

0.99 ± 0.04

0.84 ± 0.04

0.98 ± 0.04

0.72 ± 0.04

0.90 ± 0.06

0.70 ± 0.03

5 neurons

0.99 ± 0.04

0.86 ± 0.04

1.00 ± 0.05

0.80 ± 0.04

0.93 ± 0.04

0.77 ± 0.04

6 neurons

0.95 ± 0.03

0.77 ± 0.04

1.02 ± 0.05

0.79 ± 0.03

0.95 ± 0.03

0.76 ± 0.05

^{1}Results are the average of 20 runs based on random partitions on the data

Correlation coefficients (± standard errors) in the Jersey testing data set, by trait.^{1}

**Pedigree relationships**

**Genomic relationships**

**Network**

**Fat yield**

**Milk yield**

**Protein yield**

**Fat yield**

**Milk yield**

**Protein yield**

Linear

0.11 ± 0.04

0.07 ± 0.03

0.02 ± 0.02

0.43 ± 0.02

0.42 ± 0.03

0.44 ± 0.02

1 neuron

0.23 ± 0.03

0.10 ± 0.03

0.09 ± 0.02

0.51 ± 0.02

0.45 ± 0.02

0.44 ± 0.02

2 neurons

0.22 ± 0.03

0.08 ± 0.01

0.08 ± 0.03

0.49 ± 0.02

0.46 ± 0.03

0.51 ± 0.02

3 neurons

0.22 ± 0.02

0.13 ± 0.02

0.10 ± 0.03

0.53 ± 0.02

0.52 ± 0.02

0.47 ± 0.02

4 neurons

0.20 ± 0.02

0.09 ± 0.02

0.14 ± 0.02

0.45 ± 0.03

0.52 ± 0.02

0.47 ± 0.03

5 neurons

0.23 ± 0.02

0.13 ± 0.02

0.15 ± 0.02

0.42 ± 0.03

0.50 ± 0.02

0.47 ± 0.02

6 neurons

0.27 ± 0.02

0.10 ± 0.03

0.11 ± 0.02

0.48 ± 0.04

0.54 ± 0.02

0.50 ± 0.03

^{1}Results are the average of 20 runs based on random partitions on the data

Prediction mean squared errors in the Jersey testing set (vertical axis) by network

**Prediction mean squared errors in the Jersey testing set (vertical axis) by network**. Results are averages of 20 independent runs. "Linear" denotes a 1-neuron model with linear activation functions throughout.

Correlations between predictions and observations in the Jersey testing data set for the network considered

**Correlations between predictions and observations in the Jersey testing data set for the network considered**. Results shown are averages of 20 independent runs. "Linear" denotes a 1-neuron model with linear activation functions throughout.

The predictive correlations in wheat (Table

In the Jerseys, the large variability in mean squared error among runs (Table

Shrinkage

The distribution of connection strengths in a network gives an indication of the extent of regularization attained. Typically, weight values decrease with model complexity, in the same manner that estimates of marker effects become smaller in Bayesian regression models when

Distribution of connection strengths(_{kj}

**Distribution of connection strengths( w**. The linear model has single neuron architecture with linear activation functions. a) Fat yield using pedigree relationships: linear model (above) and 6 neurons (below). b) Milk yield using pedigree relationships: linear model (above) and 6 neurons (below). c) Protein yield using pedigree relationships: linear model (above) and 5 neurons (below). d) Fat yield using genomic relationships: linear model (above) and 3 neurons (below), e) Milk yield using genomic relationships: linear model (above) and (below) and 6 neurons (below). f) Protein yield using genomic relationships: linear model (above) and 2 neurons (below).

Discussion

Models for prediction of fat, milk and protein yield in cows using pedigree and genomic relationship information as inputs, and wheat yield using molecular markers as predictor variables were studied. This was done using Bayesian regularized neural networks, and predictions were benchmarked against those from a linear neural network, which is a Bayesian ridge regression model. In the wheat data, the comparison was supplemented with results obtained by our group using RKHS or support vector methods. Different network architectures were explored by varying the number of neurons, and using a hyperbolic tangent sigmoid activation function in the hidden layer plus a linear activation function in the output layer. This combination has been shown to work well when extrapolating beyond the range of the training data

The Levenberg-Marquardt algorithm, as implemented in MATLAB, was adopted to optimize weights and biases, as this procedure has been effective elsewhere

Because Bayesian neural networks reduce the effective number of weights relative to what would be obtained without regularization, this helps to prevent over-fitting

Our results with ANN for wheat are at least as good as those obtained with the same data in two other studies. Crossa et al.

A question of importance in animal and plant breeding is how an estimated "breeding value", i.e., an estimate of the total additive genetic effect of an individual, can be arrived at from an ANN output. There are at least two ways in which this can be done. One is by posing architectures with a neuron in which the inputs enter in a strictly linear manner, followed by a linear activation function on this neuron; the remaining neurons in the architecture, receiving the same inputs, would be treated non-linearly. A second one, is obtained by observing that the infinitesimal model can be written as y** z**'

Let **p**_{i }_{ij}

where:

and

with

Thus, the so defined breeding value of individual

Another issue is that of assessing the importance of an input relative to that of others. For example, in a linear regression model on markers, one could use a point estimate of the substitution effect or its "studentized" value (i.e., the point estimate divided by the corresponding posterior standard deviation), or some measure that involves estimates of substitution effects and of allelic frequencies. A discussion of some measures of relative importance of inputs in an ANN is in

Conclusion

Non-linear neural networks tended to outperform benchmark linear models in predictive ability, and clearly so in the wheat data. Bayesian regularization allowed estimation of all connection strengths even when

In summary, predictive ability seemed to be enhanced by use of Bayesian neural networks. Due to small sample sizes no claim is made about superiority of any specific non-linear architecture. As it has been observed in many studies, the superiority of one predictive model over another depends on the species, trait and environment, and the same will surely hold for ANNs.

Abbreviations

ANN: artificial neural network; BR: Bayesian regularization; BRANN: Bayesian regularization artificial neural network; LASSO: Least absolute shrinkage and selection operator; MAP: Maximum a posterior; NN: Neural network; RKHS: Reproducing kernel Hilbert spaces regression; SNP: Single nucleotide polymorphism; SSE: Sum of squared error.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

DG conceived, drafted and wrote the manuscript; HO conceived, carried out the study, performed computations and wrote a part of the manuscript; KAW and GJMR helped to conceive and coordinate the study, provided critical insights and revised the manuscript. All authors read and approved the final manuscript.

Acknowledgements

Research was supported by the Wisconsin Agriculture Experiment Station and by grants from Aviagen, Ltd., Newbridge, Scotland, and Igenity/Merial, Duluth, Georgia, USA.