Department of Information Engineering, University of Padova, 35131 Padova, Italy

Abstract

Background

Multifactorial diseases arise from complex patterns of interaction between a set of genetic traits and the environment. To fully capture the genetic biomarkers that jointly explain the heritability component of a disease, thus, all SNPs from a genome-wide association study should be analyzed simultaneously.

Results

In this paper, we present Bag of Naïve Bayes (BoNB), an algorithm for genetic biomarker selection and subjects classification from the simultaneous analysis of genome-wide SNP data. BoNB is based on the Naïve Bayes classification framework, enriched by three main features: bootstrap aggregating of an ensemble of Naïve Bayes classifiers, a novel strategy for ranking and selecting the attributes used by each classifier in the ensemble and a permutation-based procedure for selecting significant biomarkers, based on their marginal utility in the classification process. BoNB is tested on the Wellcome Trust Case-Control study on Type 1 Diabetes and its performance is compared with the ones of both a standard Naïve Bayes algorithm and HyperLASSO, a penalized logistic regression algorithm from the state-of-the-art in simultaneous genome-wide data analysis.

Conclusions

The significantly higher classification accuracy obtained by BoNB, together with the significance of the biomarkers identified from the Type 1 Diabetes dataset, prove the effectiveness of BoNB as an algorithm for both classification and biomarker selection from genome-wide SNP data.

Availability

Source code of the BoNB algorithm is released under the GNU General Public Licence and is available at

Background

In the past few years, the hereditary component of complex multifactorial diseases has started to be explored through the novel paradigm of Genome-Wide Association Studies (GWASs). A GWAS searches for patterns of genetic variation, in the form of Single Nucleotide Polymorphisms (SNPs), between a population of affected individuals (cases) and a healthy population (controls). The objective of a GWAS is twofold: on the one hand, one searches for the set of SNPs that best explains the hereditary component of the disease (

Further applications of GWASs include searching for the genetic predisposition to complex traits, such as height

The extremely large numbers involved in a GWAS (millions of SNPs measured for thousands of individuals) have led the vast majority of studies to rely upon single, univariate SNP association tests

In the literature, the few approaches to multivariate SNP analysis on a genome-wide scale mainly rely on two methodological frameworks: penalized logistic regression

All methods for the simultaneous analysis of the whole SNP set have to cope with

In this work, we present Bag of Naïve Bayes (BoNB), an algorithm for classification and genetic biomarker selection from the simultaneous analysis of genome-wide SNP data. Our algorithm is based on Naïve Bayes (NB) classification

Three strategies are exploited in BoNB to tailor the Naïve Bayes framework to genome-wide SNP data analysis: (a) a bagging of Naïve Bayes classifiers, to improve the robustness of the predictions, (b) a novel strategy for ranking and selecting the attributes used by each bagged classifier, to enforce attribute independence, and (c) a permutation-based procedure for selecting significant biomarkers, based on their marginal utility in the classification process.

BoNB is tested on the WTCCC case-control study on Type 1 Diabetes

Results

Algorithm

Given a dataset **X**, consisting of **Y **of class labels, one for each observation (case/control), a Naïve Bayes Classifier (NBC,

where _{1 }. . . _{p }

The classification rule of Equation (1) states that the probability of a subject being in class _{k}_{1 }. . . _{p}_{k}_{k}_{k}_{i}_{k}_{1 }. . . _{p }

For categorical attributes, such as SNPs, probability distributions Pr(_{k}_{i}_{k}

Our algorithm, Bag of Naïve Bayes (BoNB), consists in an ensemble of Naïve Bayes Classifiers, trained on GWAS data with the procedure known as Bootstrap Aggregating or

Given a training dataset **X**, the Bagging procedure starts by computing a set of **X**, **X**^{(1) }. . . **X**^{(B)}} of datasets, each one obtained by sampling **X **^{(b) }is then trained on each Bootstrap sample **X**^{(b)}. Class probabilities of unseen subjects, drawn from an independent test set, are then obtained by averaging the output class probabilities computed by each NBC^{(b) }(Figure

Schematics of the BoNB algorithm

**Schematics of the BoNB algorithm**: **X**^{(1) }. . . **X**^{(B)}} are drawn from a GWAS training dataset **X**;

Given the binary nature of the case/control classification problem and the frequent unbalance between the number of cases and controls in a GWAS, we decided to rely on the Matthews Correlation Coefficient (MCC,

where

The MCC is often preferred to standard classification accuracy,

The conditional independence assumption below the Naïve Bayes classification rule (Equation (1)) is unlikely to hold if all the SNPs of a GWAS are exploited as attributes, because of genetic linkage. Moreover, computing Equation (1) for the whole SNP set can be computationally cumbersome and can lead to numerical and overfitting problems.

We thus developed a procedure for selecting a good set of independent SNPs for each NBC^{(b)}: the procedure consists of a ranking step followed by an attribute selection step. In the ranking step, each SNP is given a score, according to its ability in discriminating the subjects in the bootstrap sample **X**^{(b)}. The score is thus defined as the MCC of a Naïve Bayes Classifier, trained and tested on the same set **X**^{(b)}, with the SNP as a single attribute (a precise mathematical description of the Naïve Bayes attribute score is given in Methods). SNPs are then ranked in decreasing order of score.

In the attribute selection step, SNPs are iteratively extracted from the top of the ranked list and added as attributes of NBC^{(b)}. Each time a SNP is included as an attribute, the procedure removes from the ranked list all the SNPs that are both close to the SNP on the genome (distance < 1 Mb) and correlated with it (^{2 }>^{2 }is the squared correlation between the two SNPs and

Rather than including one SNP at a time, uncorrelated SNPs are added in groups of exponentially increasing size, starting from one SNP and doubling the size at each new addition. New SNPs are added as long as the generalization ability of NBC^{(b) }increases: to estimate the generalization ability, we test each NBC^{(b) }on the corresponding Out-of-Bag sample OOB^{(b)}, consisting of all the observations left out from **X **when sampling **X**^{(b)}, and measure the MCC of the prediction. The exponential increase in the number of added attributes allows BoNB to reach the adequate size for the attribute set of each NBC in a logarithmic number of steps.

The attribute selection procedure, iterated for the

For the second objective of GWASs, biomarker selection, we adapted for BoNB a procedure originally designed for the Random Forests bagged classifier ^{(b) }on its corresponding OOB^{(b) }and record the relative decrease in MCC due to the permutation. Such a measure, which we define

For each SNP, the permutation procedure returns a list of values of MU, one value for each NBC that included the SNP: we test for MUs significantly greater than zero with a one-tailed Wilcoxon signed rank test, selecting as biomarkers the SNPs for which the p-value of the test is lower than 0.05.

The following pseudocode summarizes the training phase and the biomarker selection phase of the BoNB algorithm:

BoNB(**X**,

// Training

1 **for ****to **

2 [**X**^{(b)}, OOB^{(b)}] = bootstrap replicate from **X**

3 **for ****= **1 to

4 Compute the contingency table for SNP **X**^{(b)}

5 Compute the Naïve Bayes attribute score of

6 ^{(b) }= list of SNPs in decreasing order of score

7 Initialize NBC^{(b) }as a Naïve Bayes Classifier with no attributes

8 Extract ^{(b) }from the top of ^{(b)}, excluding from future additions all SNPs at distance > 1 Mb and with ^{2 }<

9 **while **MCC of NBC^{(b)}, tested on OOB^{(b) }with the new attributes, increases

10 Add the new attributes to NBC^{(b)}

11 Update

12 Extract ^{2 }<

// Biomarker selection

13 **for ****in **all SNPs selected by at least 5% of the NBCs

14 **for ****in **all NBCs that selected

15 Permute the genotype of ^{(b)}

16 Record the Marginal Utility (MU) of

17 Select as biomarkers the SNPs with MU significantly larger than zero.

For analyzing the computational complexity of BoNB, one can start by noting that, for each

The attribute selection step (lines 7-12), executed for each ^{2 }between SNPs and test of NBC^{(b) }on OOB^{(b)}. If we define ^{(b)}, and its computational complexity is thus expressed by the following summation:

where

For the complexity of the biomarker selection phase of BoNB, we define

Testing

BoNB was tested on the WTCCC case-control study on Type 1 Diabetes

We excluded a small number of subjects according to the sample exclusion lists provided by the WTCCC. In addition, we excluded a SNP if (i) it is on the SNP exclusion list provided by the WTCCC; (ii) it has a poor cluster plot as defined by the WTCCC. The resulting dataset consists of 458376 SNPs, measured for 1963 cases and 2938 controls.

The number ^{2 }for uncorrelated SNPs was set to 0.1. Please see Methods for an analysis of how performance is affected by variations of the parameters

Independent train-test set pairs for assessing the classification performance of BoNB were obtained by repeatedly sub-sampling at random 90% of the dataset for training and 10% for testing. The procedure was iterated 10 times and classification performance was assessed with the MCC of the predictions on the test sets. The list of selected biomarkers, on the other hand, was computed on the whole dataset.

Classification performance was compared with the ones obtained by a standard Naïve Bayes Classifier, trained on all the SNPs that reached the significance threshold of 5 × 10^{-7 }(as in ^{2 }test of association with a general genetic model, and by HyperLASSO, a logistic regression method for the simultaneous analysis of all SNPs in a genome-wide association study

On the experimental dataset, BoNB reached an MCC of 0.55 ± 0.03 (mean ± standard deviation), significantly higher than the ones reached by both the standard Naïve Bayes Classifier (0.31 ± 0.05, Wilcoxon signed-rank p-value 0.002) and by HyperLASSO (0.46 ± 0.03, p-value 0.002). Figure

Box plots of MCC (left panel) and classification accuracy (right panel) of the standard Naïve Bayes classifier, HyperLASSO and BoNB on ten random subsamplings of the WTCCC T1D dataset

**Box plots of MCC (left panel) and classification accuracy (right panel) of the standard Naïve Bayes classifier, HyperLASSO and BoNB on ten random subsamplings of the WTCCC T1D dataset**. The dashed lines represent the classification performance of a majority classifier.

To further analyze the behaviour of the three methods at different levels of the output function (

Precision

**Precision vs Recall curve (left panel) and Receiver Operating Characteristic (right panel) of the standard Naïve Bayes classifier, HyperLASSO and BoNB on a random subsampling of the WTCCC T1D dataset**.

For biomarker selection, we run BoNB on the whole dataset and compared its results with the biomarkers identified by HyperLASSO and by the general 2

SNPs selected as attributes for at least 5% of the Naïve Bayes Classifiers by BoNB on the WTCCC T1D dataset, with

**SNP**

**Chr**

**Gene**

**Relation**

**%NBCs**

**MU (median)**

**rs6679677**

**1**

**RSBN1**

**downstream**

**7**

**0.033**

rs9266774

6

MICA

upstream

5.5

0.011

**rs805301**

**6**

**BAT3**

**intron**

**17.5**

**0.043**

**rs492899**

**6**

**SKIV2L**

**intron**

**8.5**

**0.025**

**rs9273363**

**6**

**HLA-DQB1**

**downstream**

**100**

**0.835**

**rs9275418**

**6**

**HLA-DQB1**

**upstream**

**80**

**0.160**

**rs6936863**

**6**

**HLA-DQA2**

**upstream**

**8**

**0.08**

rs9784858

6

TAP2

intron

5

0.008

**rs3101942**

**6**

**LOC100294145**

**exon**

**21.5**

**0.045**

First column: dbSNP RS ID. Second column: SNP chromosome. Third and fourth column: annotated gene and relation with the SNP. Fifth column: percentage of Naïve Bayes Classifiers that included the SNP as attribute. Sixth column: median of the marginal utility of the SNP. SNPs selected as genetic biomarkers by the permutation procedure are marked in bold.

Compared to the 394 SNPs that reached the significance level on the 2

HyperLASSO selected 8 SNPs, all in the MHC region of chromosome 6: 4 of the SNPs are among the biomarkers selected by BoNB, thus suggesting a certain coherence between the two algorithms and providing further confidence on the identified biomarkers.

Implementation

BoNB is implemented in C++ and relies only on standard libraries, thus being fully portable across operating systems. On the WTCCC case-control study on Type 1 Diabetes, BoNB takes approximately 50 minutes for training 200 NBCs and selecting the biomarkers on a 3.00 GHz Intel Xeon Processor E5450. A careful allocation strategy makes BoNB occupy around 600 MB of RAM for the WTCCC dataset, allowing it to be easily run on a desktop computer.

Discussion

In this paper, we presented a novel algorithm for classification and biomarker selection from genome-wide SNP data. The algorithm, Bag of Naïve Bayes (BoNB), is based on the Naïve Bayes classification framework, enriched by three main features: bootstrap aggregating of an ensemble of Naïve Bayes classifiers, a novel strategy for ranking and selecting the attributes used by each classifier and a permutation-based procedure for selecting significant biomarkers, based on their marginal utility in the classification process.

The effectiveness of BoNB was demonstrated by applying it to the WTCCC case-control study on Type 1 Diabetes: BoNB indeed outperforms two algorithms from the state of the art, namely a Naïve Bayes Classifier and HyperLASSO, in terms of classification performance and all the genetic biomarkers identified by BoNB are meaningful for Type 1 Diabetes.

Learning an ensemble of classifiers from a bootstrap sample of the original dataset provides BoNB with two main advantages: on the one hand, it guarantees a higher generalization ability by increasing the stability of the learning process

Two features of the Naïve Bayes Classifier, chosen as building block of the BoNB algorithm, make it rather appealing for genome-wide data analysis: on the one hand, conditional probability table analysis does not assume a pre-specified model of genetic effect, on the other hand, missing values are seamlessly handled by both the learning and the classification procedure.

The idea of bagging Naïve Bayes classifiers has already been proposed in the Random Naïve Bayes algorithm of Prinzie and Van der Poel

Our approach to attribute selection, consisting in a univariate ranking step followed by a multivariate selection step, has the advantage of favouring informative attributes, but without the need of pre-selecting fixed sets of attributes or of defining cut-offs on the strength of the association with the disease: attributes, in fact, are added to the classifiers as long as their combined effect on the generalization ability increases.

To provide the reader with further insight on the Naïve Bayes attribute score, exploited in BoNB for univariate attribute ranking, we studied it against the 2df ^{2 }statistic of association for all the SNPs in the Wellcome Trust Case-Control Study on Type 1 Diabetes (Figure

Naïve Bayes attribute score ^{2 }statistic for all SNPs in the WTCCC T1D dataset

**Naïve Bayes attribute score vs χ ^{2 }statistic for all SNPs in the WTCCC T1D dataset**.

As it can be seen from the figure, the two measures are in a strong monotonic relation for the majority of SNPs; when used as ranking criteria, thus, they are deemed to return similar ranked lists.

Major exceptions are the points plotted along the two axes of Figure ^{2 }test can not be run, because at least one of the entries in the SNP contingency table has less than 5 elements. Along the horizontal axis, on the other hand, lie the SNPs that, when used to train a Naïve Bayes Classifier, lead to a majority classifier,

Analyzing the extreme behaviours of the two scoring measures provides the key for understanding the main difference between them: while ^{2 }is designed to capture a difference in SNP frequencies from the frequencies expected under no association between the SNP and the disease, the Naïve Bayes attribute score is meant to select good

For this reason, the Naïve Bayes attribute score is not much sensitive to small variations of contingency table entries with few or zero elements and thus it does not require a minimum number of elements per entry. On the other hand, it does not reward SNPs with even large differences in frequencies from the case of no association, if one of the two classes is consistently over represented, since such SNPs can not be effective as univariate predictors in the dataset under analysis.

Conclusion

The analysis of genome-wide SNP data for multifactorial diseases mainly suffers from two, intertwined problems: on the one hand, multifactorial diseases are caused by complex patterns of interaction between multiple genetic traits and the environment, on the other hand, genetic linkage confounds the search for genetic biomarkers, because of the non-random association between the true genetic causes and the SNPs in genomic regions close to them. The algorithm we proposed, Bag of Naïve Bayes, proved effective in tackling both of these problems: the simultaneous analysis of all SNPs on a genome-wide scale can capture the sets of SNPs with the strongest joint effect on the disease; the novel procedure for attribute ranking and selection enforces attributes independence, thus discriminating causal SNPs from nearby weaker signals.

Apart from genome-wide association studies, BoNB can also be applied, with minor modifications, to the analysis of SNP data from case/control exome sequencing experiments

Methods

Naïve Bayes Algorithm

The Naïve Bayes Algorithm

for each input attribute _{i}_{ij }_{k }

In addition, the algorithm must estimate the parameters of the

where |

The only tunable parameter of the Naïve Bayes Algorithm is the

Naïve Bayes attribute score

In the BoNB algorithm, SNPs are ranked as candidate attributes of the **X**^{(b)}; this is estimated for each SNP as the MCC of a Naïve Bayes Classifier, trained and tested on **X**^{(b)}, with the SNP as the single attribute. The rationale for such a measure is to give a higher rank to SNPs that guarantee a lower training error on **X**^{(b) }when used as attributes.

For a more formal definition of the score, we start by defining the elements of the contingency table for the SNP as in Table

and the three corresponding indicator functions _{0}, _{1 }and _{2}, returning 1 if the inequality holds and 0 otherwise. The three inequalities determine the behaviour of the Naïve Bayes Classifier in classifying unseen subjects according to their genotype, by comparing the posterior probabilities of the two classes.

When the same set **X**^{(b) }is used both for training and for testing, the Naïve Bayes attribute score

where XOR (·,·) is the boolean operator returning 1 if exactly one of the operands is equal to 1, and 0 otherwise.

Contingency table of a SNP, with the genotype codes 0 for the homozygous pair of minor alleles, 1 for the heterozygous pair and 2 for the homozygous pair of major alleles.

genotype

0

1

2

cases

_{ca}

controls

_{co}

_{0}

_{1}

_{2}

Each element in the contingency table reports the number of subjects with the corresponding genotype and phenotype. _{0}, _{1 }and _{2 }are the column sums, _{ca }_{co }

HyperLASSO algorithm

The HyperLASSO algorithm

Like all other penalized logistic regression approaches, the HyperLASSO algorithm has a tunable parameter

The HyperLASSO algorithm has an element of stochasticity, namely in the order with which model parameters are updated in the model selection procedure, and is designed to carry out multiple runs with different orderings and report the best scoring model. For our analysis, we set the number of runs to 10, resulting in approximately 60 hours for processing the entire WTCCC T1D dataset on a 3.00 GHz Intel Xeon Processor E5450.

Effect of parameters variation on the performance of BoNB

The BoNB algorithm exposes two parameters to the user: the number of Bootstrap replicates and Naïve Bayes Classifiers,

Figure ^{-4}).

Box plots of the MCC obtained by BoNB on ten random subsamplings of the WTCCC T1D dataset, for

**Box plots of the MCC obtained by BoNB on ten random subsamplings of the WTCCC T1D dataset, for B = 200 and θ ranging from 0.02 to 0.5 (left panel) and for θ = 0.1 and B ranging from 50 to 500 (right panel)**.

Concerning the number of Bootstrap replicates

Given the consistency among the results for higher values of

List of abbreviations

SNP: Single Nucleotide Polymorphism; GWAS: Genome-Wide Association Study; NBC: Naïve Bayes Classifier; OOB: Out-of-Bag; MCC: Matthews Correlation Coefficient; MU: Marginal Utility.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

FS designed and implemented the BoNB algorithm, carried out its performance analysis and drafted the manuscript. ET carried out the performance analysis of the algorithms for the comparison. BDC participated in the design of both the algorithm and the performance analysis and helped to draft the manuscript. GMT helped to design the performance analysis and to draft the manuscript. CC coordinated the study and helped to draft the manuscript. All authors read and approved the final manuscript.

Acknowledgements

The research was supported by the European Union's Seventh Framework Programme (FP7/2007-2013) for the Innovative Medicine Initiative under grant agreement n. IMI/115006 (the SUMMIT consortium).

This study makes use of data generated by the Wellcome Trust Case-Control Consortium. A full list of the investigators who contributed to the generation of the data is available from

This article has been published as part of