Department of Bioinformatics and Computational Biology, UT MD Anderson Cancer Center, Houston, Texas 77030, USA

Department of Biostatistics, UT MD Anderson Cancer Center, Houston, Texas 77030, USA

Department of Computer Science, Rice University, Houston, Texas 77005, USA

Abstract

Background

Considerable progress has been made on algorithms for learning the structure of Bayesian networks from data. Model averaging by using bootstrap replicates with feature selection by thresholding is a widely used solution for learning features with high confidence. Yet, in the context of limited data many questions remain unanswered. What scoring functions are most effective for model averaging? Does the bias arising from the discreteness of the bootstrap significantly affect learning performance? Is it better to pick the single best network or to average multiple networks learnt from each bootstrap resample? How should thresholds for learning statistically significant features be selected?

Results

The best scoring functions are Dirichlet Prior Scoring Metric with small

Conclusions

For small data sets, our approach performs significantly better than previously published methods.

Introduction

In the last ten years there has been a great deal of research published on learning Bayesian networks from data

In this paper, our goal is to understand the interactions between choice of bootstrap method and scoring function, size of training data and number of bootstrap replicates as well as the choice of feature-selection threshold on averaged models. Unlike linear regression, for which closed form expressions for the performance of model-averaging procedures can be derived

The variance in the structure of learned networks comes in part from the data, which is usually a single realization from some unknown true model. This is schematically depicted in Figure

Bayesian network search space

**Bayesian network search space** The search for Bayesian networks guided by data. Contours represent likelihood of structures given data.

When the amount of data is small relative to the size of the model, the posterior probability of the model given the data is not sharply peaked around a single model

or we can compute Bayesian posterior probabilities

where we weight each feature in proportion to the posterior probability of the structure

For large networks, the direct estimation of

A different approach to estimating feature probabilities is to sample in different sections of the structure space by perturbing the data set

The practical application of this algorithm, particularly in the context of limited data requires answers to many questions. Specifically:

• How many bootstrap resamples

• Which scoring functions _{s}

• Is the bias in bagged Bayesian networks arising from the discreteness of the bootstrap method

• How do we select thresholds for learning statistically significant features over averaged models? Does the choice of threshold depend on the scoring function, problem domain, sample size, and bootstrap resampling technique?

• Should a single high scoring structure be learned from a bootstrap replicate (as shown in Algorithm 1), or an averaged ensemble of high scoring structures (double averaging, as shown in Figure

Bayesian model averaging strategies

**Bayesian model averaging strategies** On the left is the model averaging protocol of

The paper is organized as follows. In the next section, we review the problem of learning the structure of Bayesian networks from data, and describe the state of the art in scoring functions, search algorithms, and bootstrapping techniques. In the following section, we present an extensive experimental study that explores the space of model-averaging strategies in the context of the ALARM network. We then describe our technique for automatically selecting a probability threshold for feature selection. Our threshold selection criterion is designed to minimize the number of false positive features in bagged models. We conclude with a characterization of an effective model-averaging strategy in contexts with limited data.

Learning the structure of Bayesian networks

Bayesian networks

Bayesian networks are a compact, graphical representation of multivariate joint probability distributions on a set of variables. The graph structure reflects conditional independence relationships between the random variables. Formally, a Bayesian network for a set **X** = {_{1},…, _{n}_{1},… , _{n}_{Xi}_{|}_{Ui} = _{i}**U**_{i}**U**_{i}_{i} in the graph **P**(_{1},…, _{n}

Learning Bayesian networks from data: scoring functions

The problem of learning a network from data is normally posed as an optimization problem: Given a set **P**(**X**), find a network

The prior over network structures has two components: a discrete probability distribution

The posterior probability

The scoring function

where _{i}_{i}_{i}_{i}_{ijk}_{i}^{th}^{th}

A generalization of the above scoring metric uses Dirichlet priors (

In addition, the parameter modularity assumption is made _{i}

The graph structure parameter prior _{i}

The scoring function with these assumptions reduces to

where _{ijk}

There is another family of scoring functions based on maximizing the likelihood function

where

where _{i}_{i}_{i}_{i}

The MDL, or the Minimum Length Description scoring function has its origins in data compression and information theory. The best model for a data set

where |_{i}_{i}_{i}**X**. The description length of Θ is the product of the total number of independent parameters for specifying each conditional probability table, and the number of bits needed to encode each of these parameters for a data set of size

Finally, the number of bits needed to encode the data given the model is

Thus

Minimizing the MDL scoring function yields a model with the shortest description length. If we ignore the description length of the graph,

To summarize, scoring functions belong to one of two families. Bayesian scoring functions, of which K2, DPSM, and BDe are examples, represent the posterior probability

Learning Bayesian networks from data: local search algorithms

The problem of learning a network which maximizes

The local search algorithm used in our experiments is greedy hill-climbing with randomized restarts and Friedman’s sparse candidate algorithm _{1} = 25 times. Another _{2} = 25 such searches are performed starting from a network chosen randomly from all of those seen during the first _{1} searches. Network features are determined from the highest scoring network(s) visited over all searches.

Bootstrap and Bayesian bootstrap

The bootstrap is a general tool for assessing statistical accuracy of models learned from data. Suppose we have a data set **x**^{(}^{j}^{)} is a vector of size _{1},…, _{n}

Each example in _{ijk}

An alternative approach is to use the Bayesian bootstrap **X**), and that we have no specific priors on that distribution. The uninformative prior on

We now turn to the question of the number of bootstrap resamples needed to form accurate estimates of structure feature probabilities.

Estimating a bound on the number of bootstrap resamples

The probability of a feature

where

For

In practice, we are most concerned about feature probabilities close to the cutoff threshold. For features probabilities

Even for a large

In practice, our resample feature probabilities are sometimes not exactly 0 or 1, so we expect the above to be slight overestimates.

ALARM network simulation study

To understand how different bagging methods, scoring metrics, and their parameters affect learning performance, we conducted an extensive simulation study using the well known ALARM network. This network contains 37 discrete nodes, with two to four values per node, and 46 edges. We drew a master dataset containing 100,000 data points from the ALARM network’s joint probability distribution.

To evaluate the effect of different strategies and parameters on learning performance, we extract a subset from that master dataset and apply the network learning strategy being tested. In all cases, the core learning strategy is the greedy hill-climbing algorithm with random restarts described above. We compute posterior probabilities of network features, such as edges, by determining how often each such feature occurs in the best single network learnt from each bootstrap resample. A slight variation is double averaging, which first computes feature probabilities for each resample by (Bayesian) averaging of multiple high-scoring networks learnt for that resample, and then averages these probabilities across resamples. In either case, the bagging process produces estimates of posterior probabilities

Number of bootstrap resamples required

To estimate the number of bootstrap resamples required, we performed 60 independent Bayesian bootstrap bagging analyses of a single dataset containing 250 data points. Each analysis used identical data, but a different seed for the pseudo-random number generator. To illustrate the convergence of the different bootstrap runs, we choose the edge between PULMEMBOLUS and PAP as a representative example. Figure ^{–3} to 10^{–4}), we need between 500 and 2,500 resamples. Most experimental studies in the literature use 200 bootstrap samples, a figure suggested by

Bootstrap convergence

**Bootstrap convergence** This figure shows the convergence of 60 independent bootstrap estimates for the probability of the edge between PULMEMBOLUS and PAP in the ALARM network being present for a single dataset containing 250 data points. The horizontal axis (log scale) is the number of bootstrap resamples. The left-hand vertical axis is the cumulative proportion of top networks that include the edge between PULMEMBOLUS and PAP. Each of the 60 independent bootstrap bagging analyses is plotted as a faint grey line. Overlapping lines are darker in proportion to the number of lines involved. The average across all 60 estimates is plotted as a solid blue line, and the dotted blue lines indicate plus or minus three estimated standard deviations from the average. The right-hand vertical axis (log scale) is the variance of the estimates. The solid red line is the sample variance. The dotted red line is the theoretical estimate according to equation 3.

Effect of bagging on learning performance

Figure

Effect of bagging on learning performance

**Effect of bagging on learning performance** This figure shows the effect of bagging on learning performance for the ALARM network. The horizontal axis is the number of false positives and the vertical axis is the number of false negatives. The red curve is computed by simple averaging of the top networks learnt by the search procedure from the original data, while the yellow curve is derived from the same networks weighted by their probability with respect to the data. The blue curve is the simple average of the single best network learnt by the search procedure for each of 2500 bootstrap resamples. The results are averages over 60 independent data sets each containing 250 data points.

Effect of double bagging

The learning procedure, with its multiple restarts, often produces several top networks with very similar high scores. Since the greatest computational effort, by far, is the learning procedure, it might be possible to improve learning performance by deriving a composite probability for each feature from several of these top scoring networks from each resample. We compared bagging using only the best network, to bagging of feature probabilities estimated by a simple average and a Bayesian average of multiple, distinct high-scoring networks from each bootstrap resample. This averaging over the top networks occurs prior to bagging, as shown on the right hand side of Figure

Effect of double bagging

**Effect of double bagging** This figure shows the effect of double bagging on learning performance for the ALARM network. The horizontal axis is the average number of bad edges detected (false positives), and the vertical axis is the average number of good edges not detected (false negatives). The blue curve is the simple average of the single best network learnt by the search procedure for each of the 2500 bootstrap resamples. The magenta curve is the simple average over 2500 bootstrap resamples of the simple average of the best ten networks learnt by the search procedure for each resample. The green curve is the simple average over 2500 bootstrap resamples of the Bayesian average of the best networks learnt by the search procedure for each resample. Each curve is the average of 60 independent data sets each containing 250 data points.

Effect of bias-correction strategies on model accuracy

To determine the impact of bias introduced by discrete bootstrap resampling, and the effect of changes to the scoring function to account for it, we compared ordinary bagging, ordinary bagging with a bias corrected scoring function

Effect of bias-correction strategies

**Effect of bias-correction strategies** This figure shows the effect of bias correction strategies on learning performance for the ALARM network. In the upper panel, the horizontal axis is the number of false positives and the vertical axis is the number of false negatives. In the lower panel, the horizontal axis is the probability threshold above which an edge is deemed to be present, and the vertical axis is the total number of errors (false positives plus false negatives). The results are averages over 60 independent data sets each containing 250 data points. The DPSM scoring metric with

Below we offer a possible explanation for this result. Ordinary bagging and Bayesian bagging appear to have very similar performance, with Bayesian bagging outperforming ordinary bagging in terms of the number of structural errors for thresholds above 0.72. Since this threshold range is likely to be the range of greatest interest, especially in the context of limited data, we believe that Bayesian bagging is a better choice for structure learning of Bayesian networks.

Effect of local scoring function

Figure

Effect of local scoring function

**Effect of local scoring function** This figure shows the effect of the scoring metric on learning performance for the ALARM network. The horizontal axis is the number of false positives and the vertical axis is the number of false negatives. The results are averages over 60 independent data sets each containing either 250 (top panel) or 1500 (bottom panel) data points. Bayesian bagging was used.

Effect of

Figure

Effect of

**Effect of λ for DPSM scoring function** This figure shows the effect of

As

Our experimental results suggest that when data is limited, promiscuous scoring functions (such as DPSM with

Effect of training set size

Figure

Effect of training set size

**Effect of training set size** This figure shows the effect of training set size on learning performance for the ALARM network. The horizontal axis is the number of false positives and the vertical axis is the number of false negatives. The results are averages over 60 independent data sets each containing 250 data points. The DPSM scoring metric with

Comparison to other methods

Figure

Comparison to other methods

**Comparison to other methods** This figure compares the learning performance of our method (in blue) with the ones presented by Friedman and Koller in the top two panels on the left side of Figure

Threshold selection using permutation testing

**Threshold selection using permutation testing** Estimating the number of incorrect edges from the number of edges found in permuted data sets for the ALARM network. The horizontal axis is the frequency threshold above which an edge is said to be present. The vertical axis is the number of edges. The dashed blue line is the number of correct edges. The dashed red line is the number of incorrect edges. The faint purple lines are the number of edges found for each of 60 independent permuted data sets. Overlapping lines are darker in proportion to the number of lines involved. The solid red line is the average number of edges found in the permuted data sets.

Threshold selection using permutation testing

When we attempt to learn a network structure using the above bagging procedure, the edge frequencies obtained will range between 0 and 1. We need a global threshold on edge frequencies to determine whether or not an edge is to be included in the final model. The optimal threshold to use is not constant, but depends on the specific dataset being analyzed. Consequently, in practice, when learning a network model from data, both the true model and the optimal threshold to use are unknown. We need a method to estimate the threshold that minimizes the number of structural errors made (false positive and false negatives).

Given randomly permuted (null) data, our bagging process will compute edge frequencies. Our hypothesis is that the edge frequencies of incorrect edges are similar to the edge frequencies obtained from randomly permuted data. Consequently, the number of edges found in a randomly permuted data set that are above a particular threshold will be indicative of the number of incorrect edges above that threshold in the network learnt from the unpermuted data. Our permutation test determines the likely number of incorrect edges above a particular threshold by averaging the number of edges above that threshold across 60 random permutations of the data. If the data is in fact random, this is clearly a reasonable estimate. For non-random data, it appears that the number of edges obtained from the permuted data overestimates the number of incorrect edges (see figure

To select a threshold, we compute the difference between the total number of edges found and the estimated number of incorrect edges found by the permutation test. A simple threshold, however, does not discriminate between edges that are way above the threshold (and very unlikely to be incorrect) from those just above it (and much more likely to be incorrect). Consequently, for edge frequencies (_{f}_{f}_{f}

with the added constraint that it monotonically decreases as the threshold decreases. Since neither the total number of edges nor the number of permuted edges are smooth functions, we approximate a smooth result by averaging finite differences over a range of widths. Figure

Edge confidence estimation for the ALARM network

**Edge confidence estimation for the ALARM network** The horizontal axis is the frequency threshold above which an edge is said to be present. The left-hand vertical axis is the number of edges. The solid blue line is the total number of edges. The faint purple lines are the number of edges found for each of 60 independent permuted data sets. The solid red line is the average number of edges found in the permuted data sets. The solid yellow line is the difference between the total number of edges and the average number of edges in the permuted datasets. The right-hand vertical axis is the confidence that an edge is real. The solid green line is the confidence associated with an edge found at the threshold concerned.

INSURANCE network simulation study

To determine the generality of the above results we performed a similar study using the INSURANCE network. As was the case for the ALARM network, learning performance using DPSM improves as

Application to biological dataset

Figure

Consensus Bayesian Network of a Glioblastoma Dataset

**Consensus Bayesian Network of a Glioblastoma Dataset** The consensus Bayesian network obtained by applying our method to the Glioblastoma dataset. The color of each edge indicates the frequency that that edge occurred in the bagged resamples.

We use bagged gene shaving

According to our analysis, survival is linked most closely with two clinical covariates, age and gender, and two gene clusters, a cluster of interferon induced genes and a cluster of growth inhibition genes.

Conclusions and future work

We explored the space of model-averaging strategies in contexts with limited data with the goal of robustly learning large Bayesian networks. Our results follows.

1.

2.

3.

4.

5.

6.

One of the open questions is a theoretical characterization of the relationship between model complexity and sample size for Bayesian networks. In this paper, we characterized this relationship empirically for two well-known networks of moderate complexity. Our model averaging strategy learns more accurate models than other approaches when the amount of data is limited relative to the complexity of the model being learned. In future work we plan to explore more networks in biological applications, and refine our protocol for learning high confidence models with limited data.

List of abbreviations used

AIC: Aikake Information Criterion; BDe: Bayesian Dirichlet metric; BIC: Bayesian Information Criterion; DPSM: Dirichlet Prior Scoring Metric; GBM: Glioblastoma Multiforme; MCMC: Markov Chain Monte Carlo; MDL: Minimum Description Length; TCGA: The Cancer Genome Atlas; UPSM: Uniform Prior Scoring Metric.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

BB and DS conceived of the study, and participated in its design and coordination and helped to draft the manuscript. KA consulted on statistical considerations. All authors read and approved the final manuscript.

Acknowledgements

This research was supported in part by NIH/NCI grant R01 CA133996 and the Gulf Coast Center for Computational Cancer Research.

This article has been published as part of