School of Mathematical Sciences, Monash University, Victoria 3800, Australia

Abstract

Background

Many problems in bioinformatics involve classification based on features such as sequence, structure or morphology. Given multiple classifiers, two crucial questions arise: how does their performance compare, and how can they best be combined to produce a better classifier? A classifier can be evaluated in terms of sensitivity and specificity using benchmark, or gold standard, data, that is, data for which the true classification is known. However, a gold standard is not always available. Here we demonstrate that a Bayesian model for comparing medical diagnostics without a gold standard can be successfully applied in the bioinformatics domain, to genomic scale data sets. We present a new implementation, which unlike previous implementations is applicable to any number of classifiers. We apply this model, for the first time, to the problem of finding the globally optimal logical combination of classifiers.

Results

We compared three classifiers of protein subcellular localisation, and evaluated our estimates of sensitivity and specificity against estimates obtained using a gold standard. The method overestimated sensitivity and specificity with only a small discrepancy, and correctly ranked the classifiers. Diagnostic tests for swine flu were then compared on a small data set. Lastly, classifiers for a genome-wide association study of macular degeneration with 541094 SNPs were analysed. In all cases, run times were feasible, and results precise. The optimal logical combination of classifiers was also determined for all three data sets. Code and data are available from

Conclusions

The examples demonstrate the methods are suitable for both small and large data sets, applicable to the wide range of bioinformatics classification problems, and robust to dependence between classifiers. In all three test cases, the globally optimal logical combination of the classifiers was found to be their union, according to three out of four ranking criteria. We propose as a general rule of thumb that the union of classifiers will be close to optimal.

Background

A common problem arising in bioinformatics is to classify experimental results into two categories, according to the presence or absence of some property of interest. Such classification problems are widespread and diverse. For example, in genome-wide association studies (GWAS), genotype data is collected at SNP or other marker loci across the entire genome for a large number of cases and controls (as in

Despite the diversity of these applications, each can be reduced to binary classification, e.g. disease-associated or non-associated, trafficked or not, infected or parasite-free, etc. Whatever the specific context, it is important to quantify the accuracy of the classifier, in order to assess the level of confidence one should place in the predictions, and so that alternative classifiers can be compared and ranked. Classifiers can be assessed in terms of their

Ideally, there will be a

The intuition here is that the extent to which competing classifiers agree or disagree provides information about the reliability of each classifier. In the absence of a gold standard, all that is known is the imperfect binary classifications, which can be organised into a matrix such as shown in Table

**Classifier**

**Individual**

**1**

**2**

**…**

**
K
**

1

0

1

…

0

2

1

0

…

1

⋮

⋮

⋮

⡆

⋮

1

1

…

1

A variety of techniques have been proposed in the medical statistics literature for comparing diagnostic tests in the absence of a gold standard. A spectrum of approaches has been developed to suit specific variations of the problem; for example, some approaches assume log-linear models of errors, as compared to error models that assume normally distributed errors. Approaches also differ in methodology, for example, maximum likelihood approaches compared to Bayesian approaches. Not only do the approaches vary, but the assumptions also vary, with some approaches requiring data from two or more populations with different prevalences of the disease (for example

Once classifiers have been compared, the question naturally arises how to combine them to form a new classifier that is better than any of the constituents. A simple method is to take a consensus, that is, to classify an individual as positive if most of the component classifiers ‘vote’ for a positive classification, and classify an individual as negative otherwise. A weighted consensus, in which the vote of some classifiers counts more than others, is also possible. But what is the optimal way to combine classifiers? This problem has been extensively studied (see

Results and discussion

The model was loaded into WinBUGS, and run with three test data sets: protein subcellular localisation, to test performance in the presence of a gold standard; swine flu diagnostics, to test performance with a small data set; and classification of SNPs in macular degeneration, to test performance with a large data set. Test data are provided in Additional file

**Supplementary Material.** All supplementary material is contained in the file ‘comparing-binary-classifiers-180712-bmc-supp-v3.doc’.

Click here for file

Evaluation with gold standard data: classifying protein subcellular localisation

We first evaluated our method relative to a gold standard. Protein sequences were obtained from the Arabidopsis Proteome in FASTA format from the website of

The output of each classifier was converted to a sequence of 0s and 1s, indicating which proteins were localised to the chloroplast region (1) and which were not (0). For each parameter of the model, summary statistics for its marginal posterior distribution were obtained (Additional file

Protein sub-cellular localisation results.

**Protein sub-cellular localisation results.** Density plots of model variables for the chloroplast localisation data. Vertical lines show gold standard sensitivity, specificity or proportion. **A**, **B**, **C**: the sensitivity of the AA , DP and NCC classifiers, respectively. **D**, **E**, **F**: the specificity of the AA , DP and NCC classifiers, respectively. **G**: estimated proportion of proteins localised to chloroplasts.

Our inferred mean posterior sensitivities were typically greater than 2 standard deviations above gold standard estimates, but the latter were nevertheless within the range of values obtained. A similar statement applies to specificities, and the prevalence of chloroplast localisation. Importantly, the classifiers were ranked in the correct order of sensitivity and specificity. That is, Classifiers 2, 3 and 1 had increasing sensitivity and decreasing specificity (see Additional file

Application to a small data set - Swine Flu

We then tested our method on data for the diagnosis of swine flu in patients, where the data set is very small and no gold standard is available. The data contains the diagnosis of

Density plots were produced using the last 5000 iterations of the time-series, as shown in Figure _{1}) scores higher than NS (_{2}) according to all four ranking criteria.

Swine flu results.

**Swine flu results.** Density plots of model variables for the swine flu data. **A:** Sensitivity of the NPA classifier. **B:** Sensitivity of the NS classifier. **C:** Specificity of the NPA classifier. **D:** Specificity of the NS classifier. **E:** Prevalence of the disease.

Application to a large data set - SNP classification

To test the method’s performance on a large data set, data from a genome-wide association study was analysed. This study identified SNPs associated with age-related macular degeneration, according to

The parameters of the model were again initialised to 0.5, and the model run for 10000 iterations. Summary statistics were produced for each parameter (Additional file

The summary statistics and density plots show smaller standard deviations than the swine flu data, indicating greater confidence in predicting model parameters, which can be attributed to the larger data set. Our method finds that the PLINK classifier had significantly higher sensitivity than the other two classifiers, and slightly higher specificity as well. However, all three had high specificities, as a consequence of classifying such a small proportion of the data as positive.

The difference between the results for the different thresholds is unexpected and informative. Our expectation was that using higher thresholds would increase the sensitivities of all three classifiers and decrease the specificities. Moreover, ideally the method should obtain roughly the same estimate of disease prevalence regardless of the thresholds, since the underlying population is the same. Instead, using the higher thresholds resulted in lower estimates of both sensitivity and specificity, but only slightly (compare tables and density plots in Additional file

Notably however,

Inference of the best combination of classifiers

The R-script described in the Methods was used to invoke the WinBUGS model from R (using R2WinBUGS), and the model was rerun for all three test cases for a burn-in of 1000 iterations. Then, for each case the model was run for a further 1000 iterations, and at every iteration an estimate of the sensitivity and specificity was calculated for all possible logical combinations of the classifiers. Only the last 500 iterations were used for the following analyses.

To determine which logical combination of the classifiers performed best, we applied the four ranking criteria based on the (1) product, (2) sum of squares, (3) sum of absolute values, and (4) minimum of the sensitivity and specificity. Summary statistics for sensitivities and specificities of all logical combinations of the swine flu classifiers, and selected simple combinations of the chloroplast localisation and SNP classifiers, are shown in Tables

**Sensitivity**

**Specificity**

**2**
^{
K
}**-bit code**

**Combination**

**Mean**

**Median**

**SD**

**Mean**

**Median**

**SD**

SD: Standard Deviation.

^{‡}Optimal combination using ranking criteria 1, 2 and 3.

^{†}Optimal combination using ranking criterion 4.

2

_{1}∧_{2}∧_{3}

0.311

0.307

0.094

1.000

1.000

0.000

4

_{2}∧_{3}

0.394

0.384

0.096

0.999

0.999

0.001

6

_{1}∧_{3}

0.553

0.544

0.108

0.997

0.997

0.001

18

_{1}∧_{2}

0.436

0.433

0.101

0.998

0.998

0.001

24

At least two classifiers

0.762

0.772

0.089

0.994

0.995

0.003

64

_{2}∨_{3}

0.867

0.870

0.058

0.932

0.933

0.021

96^{
†
}

_{1}∨_{3}

0.934

0.939

0.037

0.894

0.895

0.025

120

_{1}∨_{2}

0.900

0.910

0.053

0.904

0.907

0.025

128^{
‡
}

_{1}∨_{2}∨_{3}

0.969

0.975

0.021

0.868

0.869

0.029

**Sensitivity**

**Specificity**

2^{
K
}**-bit code**

**Combination**

**Mean**

**Median**

**SD**

**Mean**

**Median**

**SD**

SD: Standard Deviation.

^{‡}Optimal combination using ranking criteria 1, 2 and 3.

^{†}Optimal combination using ranking criterion 4.

1

All results are negative

0.000

0.000

0.000

1.000

1.000

0.000

2

_{1}∧_{2}

0.626

0.633

0.162

0.991

0.994

0.010

3

¬_{1}∧_{2}

0.116

0.101

0.088

0.938

0.947

0.047

4

_{2}

0.742

0.760

0.152

0.928

0.939

0.054

5

_{1}∧¬_{2}

0.214

0.201

0.127

0.879

0.885

0.072

6^{
†
}

_{1}

0.840

0.859

0.119

0.870

0.875

0.078

7

(¬_{1}∧_{2})∨(_{1}∧¬_{2})

0.330

0.340

0.124

0.817

0.821

0.079

8^{
‡
}

_{1}∨_{2}

0.957

0.974

0.050

0.808

0.813

0.087

9

¬(_{1}∨_{2})

0.043

0.026

0.050

0.192

0.187

0.087

10

(_{1}∧_{2})∨¬(_{1}∨_{2})

0.670

0.660

0.124

0.183

0.179

0.079

11

¬_{1}

0.160

0.141

0.119

0.130

0.125

0.078

12

¬_{1}∨_{2}

0.786

0.799

0.127

0.121

0.115

0.072

13

¬_{2}

0.258

0.240

0.152

0.072

0.061

0.054

14

_{1}∨¬_{2}

0.884

0.899

0.088

0.062

0.053

0.047

15

¬(_{1}∧_{2})

0.374

0.367

0.162

0.009

0.006

0.010

16

All results are positive

1.000

1.000

0.000

0.000

0.000

0.000

**∼ 1000 positives**

**Sensitivity**

**Specificity**

**2**
^{
K
}**-bit code**

**Combination**

**centerMean**

**Median**

**SD**

**Mean**

**Median**

**SD**

SD: Standard Deviation.

^{‡}Optimal combination using ranking criteria 1, 2, 3 and 4.

2

_{1}∧_{2}∧_{3}

0.083

0.082

0.017

1.000

1.000

0.0000

4

_{2}∧_{3}

0.098

0.097

0.018

1.000

1.000

0.0000

6

_{1}∧_{3}

0.144

0.143

0.020

1.000

1.000

0.0000

18

_{1}∧_{2}

0.483

0.481

0.053

1.000

1.000

0.0000

24

At least two classifiers

0.559

0.558

0.053

1.000

1.000

0.0000

64

_{2}∨_{3}

0.645

0.647

0.047

0.997

0.997

0.0001

96

_{1}∨_{3}

0.869

0.869

0.035

0.997

0.997

0.0001

120

_{1}∨_{2}

0.932

0.933

0.021

0.998

0.998

0.0001

128^{
‡
}

_{1}∨_{2}∨_{3}

0.943

0.944

0.018

0.996

0.996

0.0002

∼ 5000 positives

Sensitivity

Specificity

2^{
K
}-bit code

Combination

Mean

Median

SD

Mean

Median

SD

2

_{1}∧_{2}∧_{3}

0.065

0.066

0.007

1.000

1.000

0.0000

4

_{2}∧_{3}

0.079

0.079

0.007

1.000

1.000

0.0000

6

_{1}∧_{3}

0.123

0.123

0.008

1.000

1.000

0.0000

18

_{1}∧_{2}

0.439

0.440

0.025

1.000

1.000

0.0000

24

At least two classifiers

0.510

0.511

0.026

1.000

1.000

0.0000

64

_{2}∨_{3}

0.601

0.601

0.024

0.987

0.987

0.0002

96

_{1}∨_{3}

0.850

0.852

0.018

0.991

0.991

0.0004

120

_{1}∨_{2}

0.918

0.919

0.011

0.994

0.994

0.0004

128^{
‡
}

_{1}∨_{2}∨_{3}

0.930

0.931

0.010

0.986

0.986

0.0004

For the two cases with <500 data points (swine flu and subcellular localisation), the optimal logical combination was the same in up to 0.526 iterations of the Gibbs sampler. However, for the SNP data with more than 500000 data points, the optimal classifier was the same at every iteration. This is expected, as more data should increase the confidence with which the optimal classifier can be identified.

Posterior density plots for the sensitivity and specificity of all possible logical combinations of the swine flu classifiers are presented in Additional file _{1}) as best in a majority of MCMC iterations, yet the average of the Method 4 score is slightly higher for the union of the NPA and NS classifiers (_{1}∨_{2}). We note that the standard deviations of the ranking scores are quite large relative to the differences between ranking scores, which may suggest that combinations of classifiers other than that identified as ‘best’ remain plausible candidates. Nevertheless it is clear that the union of all classifiers ranks well if not best for all data sets and any ranking criterion examined here.

Run times

Run times for the various data sets are shown in Table

**Data**

**No. Subjects**

**WinBUGS**

**WinBUGS(R)**

**R**

Swine Flu

48

625 s

0s

1s

Chloroplasts

357

624 s

1s

12s

SNP

541094

8 hrs

2917s

12s

The run times of the 2000 iteration runs from the previous sub-section included an R component and a WinBUGS (called from R) component, shown in the last two columns. WinBUGS apparently runs faster when called from R. Although the R combination algorithm (second sub-section of the Methods) is

Conclusions

The method presented in this paper addresses two significant problems with ubiquitous applications in bioinformatics: comparing binary classifiers in the absence of a gold standard, and identifying the optimal logical combination of such classifiers. Using Bayesian models developed for evaluating medical diagnostic tests, we present the first applications of these models in the bioinformatics domain and demonstrate their feasibility and utility for comparing classifiers on genomic scale data sets. A new, concise and highly efficient implementation of these models was developed in WinBUGS, and is the first freely available implementation applicable to an arbitrary number of classifiers. To identify the optimal logical combination of classifiers, we developed an entirely new algorithm and again demonstrated its feasibility for genomic scale data sets. The algorithm is the first to employ the above-mentioned Bayesian models to evaluate logical combinations of classifiers and indeed apparently the first to systematically evaluate all logical combinations. It is implemented in R and is freely available. The algorithm is

The methods were evaluated on a protein subcellular localisation data set for which a gold-standard data set was available for the purpose of comparison. Some discrepancy in the estimates of sensitivity and specificity was expected because a key assumption of the model - conditional independence of the classifiers - is often violated in practice. However, we found that the discrepancy was in most cases small and more importantly, that the method was able to correctly rank the classifiers.

In all of our examples, a simple union of the classifiers was found to be optimal according to three out of four alternative ranking criteria (and in some cases also by the fourth). While this finding is unlikely to be general, we propose as a rule of thumb that the union of classifiers is likely to be close to the optimal logical combination.

Methods

Estimating sensitivities and specificities

In this section, we describe a Bayesian model that combines features of the models of

Following

Figure _{
kn
} be the outcome of Classifier _{
kn
}=1 indicating a positive result and _{
kn
}=0 indicating a negative result. These outcomes are modeled as independent Bernoulli trials, conditional on the true classification for each individual (that is, the classifiers are _{
n
} and let _{
k
} and _{
k
} denote the true positive and false positive rates of Classifier

Conditional dependencies of the model.

**Conditional dependencies of the model.** The dependencies of parameters in the model. _{n}is the true classification of individual _{k}and _{k}are the probabilities of a true positive and a false positive (respectively) for classifier _{kn}is the classification of individual

Let the proportion of the population under study that has the feature of interest be _{
n
}are the outcomes of

This part of the model would be the same regardless of whether one adopted a Bayesian approach. What makes a model Bayesian is the specification of prior probabilities for the parameters of interest. Here we assign uniform priors on the interval [0,1] for each of the parameters _{
k
} and _{
k
}. This means that, prior to observing the data, all possible values between 0 and 1 are considered to be equally likely. In addition, the model applies the constraints _{
k
}≥_{
k
}, since a classifier would be better discarded if it is more likely to classify an individual as positive when that individual is actually negative. Note this inequality introduces dependence between _{
k
} and _{
k
} for a single classifier, but does not violate the conditional independence of distinct classifiers.

Having defined the likelihood and prior probabilities, the model is straightforward to implement with the freely available Bayesian software package WinBUGS (

Inferring the best combination of classifiers

Another important goal is to decide how best to combine a collection of classifiers using the logical operators AND (∧), OR (∨) and NOT (¬). As applied to classifiers, the complement (

The sensitivity and specificity of any logical combination of classifiers can be calculated from the sensitivities and specificities of the constituent classifiers, if one assumes conditional independence. Here we propose to evaluate the sensitivity and specificity of every possible logical combination, using estimates for the constituents obtained using the method of the previous sub-section. This can therefore be done in the absence of a gold standard.

One problem is that there are infinitely many semantically correct ways to arrange the symbols _{1},…,_{
K
} can be reformulated as a disjoint union of intersections of the form (_{1}∧_{2}∧…∧_{
K
}), where each _{
k
}is either _{
k
}or

The sensitivity and specificity of a logical combination of classifiers can be built up from the following primitive rules applied to the canonical forms. First, the sensitivity (SENS) and specificity (SPEC) of

and the specificity by:

Here we have freely used

To systematically evaluate the sensitivity and specificity of all intersections of the form (_{1}∧…∧_{
K
}) mentioned above, we assign a _{
k
}=_{
k
}, the ^{
K
}such intersections, and we compute their sensitivities and specificities in the order indicated by their bit codes. Let the sensitivity and specificity of this ^{
K
}−1.

It remains to compute the sensitivity and specificity of any disjoint union of these 2^{
K
}intersections. A disjoint union of classifiers has sensitivity:

and specificity:

These two rules suffice to calculate the sensitivity and specificity of any disjoint union, but again the 2^{
K
}intersections must be processed systematically. We assign a second 2^{
K
}-bit code to each disjoint union, with bit ^{
K
}-bit code 0011 represents ^{
K
}-bit code by SENS_2K(

and

where:

and

We use the model from the previous sub-section to determine the optimal logical combination as follows. Each iteration of the Gibbs Sampler produces an estimate of the sensitivity and specificity for each of the

The optimal logical combination, thus determined, may differ from one iteration of the Gibbs sampler to the next. We therefore estimate the probability that any given logical combination is optimal as the proportion of Gibbs sampler iterations in which that combination was optimal. The overall best combination is then the one that is found to be optimal in the greatest proportion of iterations.

Note that this method exhaustively enumerates all ways in which classifiers can be combined, if all that is known about each individual is the classifications (i.e. only data of the form illustrated in Table

Note that enumeration of all possible logical combinations of classifiers necessarily requires computational time

R code implementing this method is available in Additional file

System and Implementation

The model was implemented in the freely available Microsoft^{
®
} Windows-based Bayesian Analysis software, WinBUGS v1.7 ^{TM} Optiplex^{TM} 980 computer with a quad core 3.33 GHz Intel^{®;} Core^{TM} i5 processor. To combine classifiers, output from the WinBUGS runs was loaded into the statistical package R

Competing interests

The authors declare that they have no competing interests.

Author’s contributions

JMK conceived the idea of the methods, algorithms and applications, developed them on paper, wrote the WinBUGS code, and worked extensively on all parts of the text; CMD wrote the R code, ran all of the computational experiments, produced all tables and figures, and wrote the initial draft of the text; SEB searched the literature and the internet for relevant previous work and suitable data sets, in particular identifying the sub-cellular localisation data as suitable for verifying our methods, and worked extensively on all parts of the text. All authors read and approved the final manuscript.

Acknowledgements

The swine flu test results were provided by Shiv Erigadoo and Khoa Tran of the Department of Respiratory Medicine, Logan Hospital, Queensland, Australia. John Attia and Chris Oldmeadow of the School of Medicine and Public Health at The University of Newcastle provided the age-related macular degeneration data set, which was collected as part of the Blue Mountain Eye Study. David Albrecht from the Faculty of IT at Monash University provided helpful comments on the manuscript. This work was partially funded by ARC Discovery Grants DP0879308 and DP1095849.