Department of Industrial and Information Engineering, University of Pavia, Pavia, 27100, Italy

IRCCS Fondazione Salvatore Maugeri, Pavia, 27100, Italy

ICB, Weill Cornell Medical College, New York, USA

Abstract

Background

Genome Wide Association Studies represent powerful approaches that aim at disentangling the genetic and molecular mechanisms underlying complex traits. The usual "one-SNP-at-the-time" testing strategy cannot capture the multi-factorial nature of this kind of disorders. We propose a Hierarchical Naïve Bayes classification model for taking into account associations in SNPs data characterized by Linkage Disequilibrium. Validation shows that our model reaches classification performances superior to those obtained by the standard Naïve Bayes classifier for simulated and real datasets.

Methods

In the Hierarchical Naïve Bayes implemented, the SNPs mapping to the same region of Linkage Disequilibrium are considered as "details" or "replicates" of the locus, each contributing to the overall effect of the region on the phenotype. A latent variable for each block, which models the "population" of correlated SNPs, can be then used to summarize the available information. The classification is thus performed relying on the latent variables conditional probability distributions and on the SNPs data available.

Results

The developed methodology has been tested on simulated datasets, each composed by 300 cases, 300 controls and a variable number of SNPs. Our approach has been also applied to two real datasets on the genetic bases of Type 1 Diabetes and Type 2 Diabetes generated by the Wellcome Trust Case Control Consortium.

Conclusions

The approach proposed in this paper, called Hierarchical Naïve Bayes, allows dealing with classification of examples for which genetic information of structurally correlated SNPs are available. It improves the Naïve Bayes performances by properly handling the within-loci variability.

Background

In the last few years, the advent of massive genotyping technologies allowed researchers to define the individual genetic characteristics on a whole-genome scale. These advances boosted the diffusion of Genome Wide Association Studies (GWASs) and transformed them from expensive instruments of investigation into relatively affordable, popular and powerful research tools. For this reason, they have been extensively applied to the study of the most prevalent disorders.

As a matter of fact, most of the common diseases (e.g. diabetes mellitus, obesity, arterial hypertension, etc.) belong to the category of complex traits

To date, from the statistical viewpoint, the main limitations to the full exploitation of the GWAS results are mostly represented by the lack of appropriate multivariate tools, which can replace the usual univariate testing strategies, commonly used during for the discovery phase of a GWAS. In standard univariate analyses, rules for defining statistically significant associations are usually based on the application of over-conservative significance thresholds, imposed to minimize the probability of false positive associations. The main drawback of these approaches is that they tend to discard potentially informative signals, resumed by genetic loci characterized by small effects on the trait

In this context, multivariate models could overcome the limitations of the usual "one-SNP-at-a-time" testing strategies, offering the possibility of exploring and integrating the huge amount of information deriving both from whole genome screenings and from clinical/phenotypic measurements.

Beside logistic regression (LR), which represents the most common approach for building multivariate models from SNPs data

Recently, Lee

Multivariate models, however, can be hardly learned from GWASs data due to the so-called "

Bayesian methods, and in particular Bayesian Hierarchical Models (BHMs), represent a promising framework for deriving information from large sets of variables by exploiting available prior knowledge.

In our paper, we will exploit the capability of such models to use the knowledge about the correlation structure of such variables. Chromosome regions, represented by sequences of nearby SNPs, are often characterized by strong pairwise correlation, making the information available redundant and thus difficult to be analyzed. Hierarchical models (multilevel models) provide a way of pooling the information of correlated variables without assuming that they can be modelled as a unique variable

BHMs have been already applied in a variety of biomedical contexts. They have been proposed as a fundamental tool to analyze next generation genomics data

In the context of GWASs, we propose a Hierarchical Naïve Bayes (HNB) classification model that allows capturing the uncertainty of the information deriving from a set of genetic markers that are functionally/structurally correlated and to use this information to classify new examples. SNPs that do not fall within such regions as well as clinically relevant variables (e.g.: gender, smoke, therapies, candidate markers) can be also included in the model (Figure

Graphical representation of a genome region

**Graphical representation of a genome region**.

The following sections describe the main methodological aspects of the algorithm implemented as well as the results obtained on both simulated datasets and two real GWASs on the genetic bases of Type 1 Diabetes (T1D) and Type 2 Diabetes (T2D) by the Wellcome Trust Case Control Consortium (WTCCC)

Methods

The Hierarchical Naïve Bayes classifier (HBN) is an extension of the well-known Naïve Bayes classifiers (NB). NB assumes that, given a class variable _{f }_{1},..., _{nf}

HBN assumes that the measurements are stochastic variables with a hierarchical structure in terms of their probability distributions. We suppose that we can collect a number _{rep }_{k }_{1}_{h}

The hierarchical structure of the data represented with plates notation

**The hierarchical structure of the data represented with plates notation**.

In a Bayesian framework, the classification step is therefore performed by finding the class with the highest posterior probability distribution. Thanks to the conditional independence assumptions of the hierarchical model described above, we can write _{k }

Many replicates are available for each example. The examples are characterized by an individual vector of parameters

where **Ω**_{θ }

The learning problem will therefore consist in estimating the population parameters

From the computational viewpoint, this will allow us to compute separately the marginal likelihood for each variable to perform classification and to learn a collection of independent univariate models. In the following we will show how HNB deals with the classification and learning problems when the variables are discrete with multinomial distribution.

Hierarchical Naïve Bayes for discrete variables

In a SNPs based case-control GWAS, the individual-level information is represented by genotype configurations (aa/aA/AA). For sake of readability we have omitted the dependence of the vectors to the class

We also assume that the relationship between the data _{i }_{i }

Therefore _{i }_{ij }_{i}

with probability density:

where 0 <_{j }_{i}_{1}_{S}

Classification

As described in the previous section, the classification problem requires the computation of the marginal likelihood (1). We assume that an estimate of the population parameters _{1}_{S}

where _{1}_{S}_{θ }

The marginal likelihood can be thus computed as:

The NB approach allows to exploit this equation for each variable in the problem at hand, and then to apply the equation (2) to perform the classification. The marginal likelihood however requires the estimate of the population parameters

Learning with collapsing

The task of learning the population parameters can be performed by resorting to approximated techniques. Herein we will describe a strategy previously presented by

We suppose that a data set _{1},..., _{N}_{i }_{i1},..., _{is}

_{i }

Such assumption can be justified by the calculation of the first and second moment of _{(X*|ξ) }which is computed by approximating the distribution of the parameters

The Maximum Likelihood (ML) estimate of the parameters

Within this framework we can also provide a Bayesian estimate of the population parameters _{γ1}_{γS}_{j }

After collapsing, we may derive the posterior distribution of

In this setting, the parameter vector

Building the model

The HBN machinery can be conveniently exploited to build a multivariate model for SNPs coming from a GWAS. In presence of regions in which non-random association of alleles at two or more loci or Linkage Disequilibrium (LD) is observed

Figure

The hierarchical structure of the data represented with the plates notation using SNPs data

**The hierarchical structure of the data represented with the plates notation using SNPs data**.

Results

Datasets simulation

A total number of 9 independent datasets each composed by 300 cases, 300 controls and approximately 34,000 SNPs (representing the whole chromosome 22) have been simulated by the by the Hapgen software

•

•

•

Three simulated datasets have been generated according to each scenario, by imposing Minor Allele Frequency (MAF) ≥ 0.05.

Experimental datasets

The experimental case control datasets were represented by two genome-wide scans on T1D and T2D generated by the WTCCC consortium

Genotyped samples underwent a preliminary phase of data quality control (QC) which comprised the removal of cases and controls showing: i) missing data fraction > 3%;

ii) heterozygote genotypes fraction > 0.3

iii) discordances or lack in terms of phenotype vs. laboratory information; iv) not-European ancestry; v) 1^{st}/2^{nd }degree relatives; vi) duplicated samples. Analogously, SNPs QC consisted in removing markers characterized by: i) study-wise missing data proportion > 5% ^{-7}); iii) 1 df Trend Test/2 df General test p-value < 5.7 × 10^{-7 }comparing allele and genotype frequencies between control groups; iv) bad clustering quality.

For a more detailed description of samples selection, genotyping procedures and quality control filters applied, the reader may refer to

Data pre-processing

Both simulated and experimental datasets underwent a preliminary phase of features selection and variables filtering aimed at i) reducing the space of the hypotheses to be tested and ii) isolating chromosome regions characterized by strong LD.

The main steps of the datasets preparation are reported below:

1. The whole datasets have been split randomly into screening (representing 70% of the whole dataset) and replication sets (the remaining 30% of the whole dataset). The sampling procedure has been performed with stratification, so that each fold was represented by the same proportion of cases and controls.

2. On each screening set:

a. Selected the top 500 most significant markers, based on the results from univariate Pearson

b. Define chromosome regions characterized by the presence of nearby SNPs showing pairwise r^{2 }≥ ^{2 }= 0.6 (SNPs in moderate-to-strong LD) and 0.8 (SNPs in strong LD) respectively.

i. Group markers localized within the same LD -block and build latent-variables.

ii. Use the remaining SNPs falling outside the LD-blocks as covariates.

c. Split the whole screening set into 10 folds of equal sample size and characterized by cases/controls ratio = 1 according to the 10 Folds Cross Validation procedure (10 Folds CV)

3. Apply the LD-based SNPs grouping schema learnt on the screening set to the corresponding replication set.

Both screening and replication sets have been employed for evaluating the generalization performances obtained by the HNB algorithm and to compare them with those obtained by the standard NB classifier on the same datasets.

Results from simulated datasets

The HNB algorithm has been validated on simulated datasets, which underwent the pre-processing phases described in the previous sections.

Descriptive analyses of the simulated datasets revealed that the number of blocks to be analyzed increased proportionally to the stringency of the r^{2 }imposed for defining regions of correlation, while the median number of SNPs characterizing each block decreased. This is due to the fact that SNPs linked by strong correlation (r^{2 }≥ 8), are generally confined to small and fragmented regions due to structural recombination events. Table

Characteristics of the simulated datasets.

**LD thr.: r**
^{
2
}
**≥ .0.60**

**LD thr.: r ^{2 }≥ 0.80**

**sim**

**GRR**

**B**

**SNPs/B**

**r**
^{
2
}

**B**

**SNPs/B**

**r**
^{
2
}

1

1.5/3.0

43

5.0 [6.50]

0.94 [0.13]

63

3 [5.50]

0.97 [0.06]

2

1.5/3.0

36

6.5 [11.50]

0.95 [0.08]

55

4 [6.00]

0.98 [0.05]

3

1.5/3.0

58

3.5 [5.00]

0.97 [0.07]

76

3 [3.00]

0.98 [0.06]

4

2.0/4.0

24

8.5 [29.50]

0.97 [0.07]

67

4 [4.00]

0.98 [0.06]

5

2.0/4.0

34

4.5 [14.00]

0.95 [0.17]

61

3 [6.00]

0.98 [0.09]

6

2.0/4.0

39

5.0 [6.50]

0.97 [0.19]

70

4 [3.75]

0.99 [0.05]

7

3.0/6.0

22

9.0 [28.25]

0.96 [0.07]

49

5 [6.00]

0.98 [0.07]

8

3.0/6.0

45

5.0 [10.00]

0.98 [0.10]

80

3 [3.00]

0.98 [0.06]

9

3.0/6.0

34

8.5 [14.50]

0.93 [0.11]

72

3 [4.00]

0.96 [0.09]

GRR, heterozygote/homozygote Genotype Relative Risk (GRR); B, number of blocks; SNPs/B, median number [Interquartile Range (IQR)] of SNPs within each block; r^{2}, median [Interquartile Range (IQR)] pairwise r^{2 }within each block. The described parameters are reported for blocks defined using thresholds of LD corresponding to r^{2 }≥ 0.6 and 0.8 respectively.

The generalization performances of the two algorithms have been evaluated by comparing the Classification Accuracy (CA) and the Area Under the Curve (AUC) of the two models estimated by 10 Folds CV procedures and by testing the models learnt on single screening set on the corresponding independent replication set ^{2 }> 0.6) or strong (r^{2 }> 0.8) pairwise LD are analyzed.

Results from the analysis of simulated datasets

**10 Folds CV**

**Independent Test**

**sim**

**GRR**

**LD thr**.

**Model**

**CA**

**AUC**

**CA**

**AUC**

1

1.5/3.0

r^{2 }≥ 0.6.

HNB

0.85 [0.81-0.87]

0.92 [0.91-0.95]

0.64

0.66

NB

0.80 [0.78-0.82]

0.90 [0.89-0.90]

0.69

0.70

r^{2 }≥ 0.8.

HNB

0.85 [0.81-0.89]

0.93 [0.90-0.95]

0.63

0.68

NB

0.80 [0.78-0.82]

0.90 [0.89-0.90]

0.69

0.70

2

1.5/3.0

r^{2 }≥ 0.6.

HNB

0.87 [0.83-0.93]

0.94 [0.89-0.98]

0.63

0.68

NB

0.83 [0.80-0.83]

0.87 [0.84-0.90]

0.59

0.63

r^{2 }≥ 0.8.

HNB

0.85 [0.80-0.87]

0.92 [0.88-0.94]

0.65

0.70

NB

0.83 [0.80-0.83]

0.87 [0.84-0.90]

0.59

0.63

3

1.5/3.0

r^{2 }≥ 0.6

HNB

0.73 [0.70-0.77]

0.82 [0.76-0.85]

0.65

0.72

NB

0.78 [0.69-0.80]

0.86 [0.77-0.94]

0.68

0.75

r^{2 }≥ 0.8

HNB

0.77 [0.70-0.80]

0.85 [0.80-0.88]

0.71

0.75

NB

0.78 [0.69-0.80]

0.86 [0.77-0.94]

0.68

0.75

4

2.0/4.0

r^{2 }≥ 0.6

HNB

0.78 [0.72-0.84]

0.85 [0.80-0.89]

0.74

0.80

NB

0.72 [0.64-0.81]

0.76 [0.72-0.86]

0.71

0.75

r^{2 }≥ 0.8

HNB

0.72 [0.64-0.81]

0.77 [0.71-0.88]

0.70

0.76

NB

0.72 [0.64-0.81]

0.76 [0.72-0.86]

0.71

0.75

5

2.0/4.0

r^{2 }≥ 0.6

HNB

0.82 [0.77-0.83]

0.89 [0.83-0.92]

0.73

0.80

NB

0.78 [0.73-0.80]

0.84 [0.77-0.85]

0.76

0.83

r^{2 }≥ 0.8

HNB

0.82 [0.78-0.83]

0.88 [0.84-0.90]

0.76

0.86

NB

0.78 [0.73-0.80]

0.84 [0.77-0.85]

0.76

0.83

6

2.0/4.0

r^{2 }≥ 0.6

HNB

0.77 [0.73-0.80]

0.85 [0.83-0.87]

0.71

0.79

NB

0.75 [0.68-0.77]

0.80 [0.76-0.82]

0.66

0.71

r^{2 }≥ 0.8

HNB

0.73 [0.67-0.77]

0.80 [0.79-0.82]

0.65

0.72

NB

0.75 [0.68-0.77]

0.79 [0.76-0.82]

0.66

0.71

7

3.0/6.0

r^{2 }≥ 0.6

HNB

0.83 [0.81-0.83]

0.91 [0.87-0.93]

0.76

0.84

NB

0.80 [0.77-0.83]

0.85 [0.83-0.88]

0.81

0.87

r^{2 }≥ 0.8

HNB

0.83 [0.80-0.86]

0.94 [0.93-0.94]

0.82

0.91

NB

0.80 [0.77-0.83]

0.85 [0.83-0.88]

0.81

0.87

8

3.0/6.0

r^{2 }≥ 0.6

HNB

0.83 [0.78-0.87]

0.91 [0.89-0.94]

0.78

0.83

NB

0.82 [0.80-0.86]

0.87 [0.82-0.94]

0.81

0.85

r^{2 }≥ 0.8

HNB

0.82 [0.77-0.86]

0.90 [0.85-0.94]

0.78

0.86

NB

0.82 [0.78-0.86]

0.87 [0.82-0.94]

0.81

0.85

9

3.0/6.0

r^{2 }≥ 0.6

HNB

0.92 [0.87-0.93]

0.96 [0.94-0.98]

0.86

0.92

NB

0.83 [0.83-0.87]

0.92 [0.92-0.95]

0.84

0.86

r^{2 }≥ 0.8

HNB

0.87 [0.87-0.92]

0.96 [0.93-0.97]

0.89

0.92

NB

0.83 [0.83-0.87]

0.92 [0.92-0.95]

0.84

0.86

CA, Median Classification Accuracy and 25% - 75% of the distribution; AUC, Median Area Under the Curve and 25% - 75% of the distribution. The 25% - 75% of the distribution are reported for results deriving from 10 Folds CV.

Majority Classifier CA and AUC for 10 Folds CV and Independent test sets: 0.50

No significant variations in terms of CA and AUC have been observed as function of the different genotype relative risks imposed for data simulations (p > 0.05), thus CA and AUC estimated from different simulations have been pooled and used for evaluating the differences in terms of classification performances between HNB and NB.

Results show that the median CA and AUC obtained by the HNB over the single results are higher to those reached by the standard NB for both LD thresholds that have been evaluated. The one-tailed Wilcoxon signed rank test

Results from the Wilcoxon signed rank test showed that:

• The distribution of the AUC values estimated by the HNB over the complete set of simulations was significantly higher than the corresponding distribution of AUC estimated by the standard NB when r^{2 }≥ 0.8 was imposed as threshold for defining LD-regions (AUC from 10 Folds CV: p < 0.05; AUC from independent replication set: p < 0.05).

• The HNB algorithm reached CA and AUC estimates significantly higher than those obtained by the majority classifier:

○ by comparing the distribution of CA and AUC obtained by the HNB with those generated by the majority classifier on the corresponding folds (maj. CA = 0.50, maj. AUC = 0.50) for each screening set according to both LD thresholds (p < 0.01);

○ by comparing the distribution of CA and AUC estimated by the HNB over the 9 independent test sets with the corresponding distribution of CA and AUC obtained by the majority classifier (maj. CA = 0.50, maj. AUC = 0.50) on the corresponding test set according to both LD thresholds (p < 0.01).

Hierarchical Naïve Bayes for Type 1 and Type 2 Diabetes prediction

The HNB algorithm has been evaluated on two real genome-wide datasets aimed at identifying the genetic bases of T1D and T2D respectively. The analyzed datasets have been generated by the WTCCC ^{2 }≥ 0.8 as threshold) have been performed as reported in methods section, SNPs that did not fall within conserved regions have been used as covariates.

The generalization performances of the proposed approach and of the NB have been estimated by i) 10 Folds CV performed on the each screening set and ii) by learning the models on the whole screening set and then testing the CA and AUC on the two corresponding replication cohorts.

Results are reported in Table

Results obtained on the T1D and T2D datasets

**10 Folds CV**

**Independent Test**

**Study**

**Model**

**CA**

**AUC**

**CA**

**AUC**

T1D

HNB

0.70 [0.67-0.73]

0.80 [0.78-0.82]

0.71

0.79

NB

0.70 [0.67-0.72]

0.79 [0.76-0.81]

0.68

0.78

T2D

HNB

0.83 [0.81-0.85]

0.92 [0.89-0.93]

0.57

0.57

NB

0.81 [0.80-0.84]

0.90 [0.89-0.92]

0.55

0.56

CA, Median Classification Accuracy and 25% - 75% of the distribution; AUC, Median Area Under the Curve and 25% - 75% of the distribution. The 25% - 75% of the distribution are reported for results deriving from 10 Folds CV. The described parameters are reported for blocks defined using thresholds of LD corresponding to r^{2 }≥ 0.8

Majority Classifier CA and AUC for 10 Folds CV and Independent test sets: 0.50.

Discussion

The approach proposed, called Hierarchical Naïve Bayes, represents an innovative strategy aimed at exploiting correlated information from genome wide datasets. The human genome is typically characterized by local patterns of strong LD that define blocks of SNPs showing low recombination rates. In this scenario, the HNB represents a suitable way of deriving genetic information with respect to standard multivariate models, since it is able to take into account for structural correlations existing between markers. These characteristics allow HNB to overcome the limitations of the standard NB algorithm, which over-simplistic assumptions of independence between attributes are rarely respected in the context of GWAS data. The results obtained by the HNB on both simulated and real datasets show that the proposed approach is able to achieve classification performances that are generally higher or equal to those obtained by multivariate models based on standard NB. In particular, the HNB represents a suitable alternative to the standard NB when analyzing genome regions characterized by strong LD, a typical condition in which the assumptions of independency between variables of the HNB are dramatically violated.

To be noted, even if the results obtained by the 10 Folds CV procedures are prone to overfitting for both simulated and real datasets, since the preliminary filtering phase heavily exploits the screening set for features selection and blocks determination, the results obtained on the replication sets are free from these limitations. These observations confirm how taking into account for structural correlation between markers offers substantial gain in terms of generalization capability with respect to the standard NB approach that does not consider the human genome structure.

Many research groups used the publicly available WTCCC datasets and private case/control cohorts on T1D and T2D for testing the predictive performances of several machine learning algorithms. As an example, Wei

Lower CA and AUC estimates are generally obtained from the T2D datasets. As an example, van Hoek

The performances obtained by the HNB on the independent test sets are generally comparable to those reported by other research groups for both T1D and T2D reported in this section. However, a direct comparison of the performances obtained by the HNB on the real datasets with those obtained by other previously published approaches on the same WTCCC cohorts can be hardly interpreted due to differences in terms of sample size of the control population (the analyzed dataset does not include the 1958 British Birth Cohort of controls, generated by the WTCCC and commonly used as reference population along with the UK Blood Service cohort). Further, the lack of covariates regarding T1D and T2D cases and controls (e.g., BMI, smoking history,.., etc.) limited the possibility to integrate genetic and clinical information, a key step for a deeper comprehension of complex trait diseases. Thus, the availability of GWAS datasets complete of detailed phenotype and clinical information will allow testing the HNB in a more realistic scenario. Beside these considerations, the proposed approach can be further improved to take into account also functional correlations, by using, for example, the Tree Augmented Naïve Bayes (TAN) approach on the latent variables, thus combining the two strategies

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

AM carried out the molecular genetic studies, performed the statistical analysis and drafted the paper. NB carried out software tools development and integrations, participated in study design and drafted the manuscript. RB conceived the study, participated in its design and coordination and helped to draft the manuscript. All authors read and approved the final manuscript.

Acknowledgements

We are grateful to Andrea Demartini for the implementation of the HNB algorithm. The research was supported by the Innovative Medicine Initiative under grant agreement n° IMI/115006 (the SUMMIT consortium).

This study makes use of data generated by the Wellcome Trust Case Control Consortium. A full list of the investigators who contributed to the generation of the data is available from

This article has been published as part of