Biomedical Proteomics Research Group, Department of Structural Biology and Bioinformatics, Medical University Centre, Geneva, Switzerland

Swiss Institute of Bioinformatics, Medical University Centre, Geneva, Switzerland

Abstract

Background

Receiver operating characteristic (ROC) curves are useful tools to evaluate classifiers in biomedical and bioinformatics applications. However, conclusions are often reached through inconsistent use or insufficient statistical analysis. To support researchers in their ROC curves analysis we developed

Results

With data previously imported into the R or S+ environment, the

Conclusions

Background

A ROC plot displays the performance of a binary classification method with continuous or discrete ordinal output. It shows the sensitivity (the proportion of correctly classified positive observations) and specificity (the proportion of correctly classified negative observations) as the output threshold is moved over the range of all possible values. ROC curves do not depend on class probabilities, facilitating their interpretation and comparison across different data sets. Originally invented for the detection of radar signals, they were soon applied to psychology

In the ROC context, the area under the curve (AUC) measures the performance of a classifier and is frequently applied for method comparison. A higher AUC means a better classification. However, comparison between AUCs is often performed without a proper statistical analysis partially due to the lack of relevant, accessible and easy-to-use tools providing such tests. Small differences in AUCs can be significant if ROC curves are strongly correlated, and without statistical testing two AUCs can be incorrectly labelled as similar. In contrast a larger difference can be non significant in small samples, as shown by Hanczar

Software for ROC analysis already exists. A previous review

The R

1)

2) The

3) Bioconductor includes the

4) Pcvsuite

Table

Features of the R packages for ROC anaylsis

**Package name**

**ROCR**

**Verification**

**ROC (Bioconductor)**

**pcvsuite**

**pROC**

Smoothing

No

Yes

No

Yes

Yes

Partial AUC

Only SP^{1}

No

Only SP^{1}

Only SP

SP and SE

Confidence intervals

Partial^{2}

Partial^{3}

No

Partial^{4}

Yes

Plotting Confidence Intervals

Yes

Yes

No

Yes

Yes

Statistical tests

No

AUC (one sample)

No

AUC, pAUC, SP

AUC, pAUC, SP, SE, ROC

Available on CRAN

Yes

Yes

No,

No,

Yes

^{1}Partial AUC only between 100% and a specified cutoff of specificity

^{2}Bootstrapped ROC curves must be computed by the user

^{3}Only threshold averaging

^{4}Only at a given specificity or inverse ROC

The

Implementation

AUC and pAUC

In _{0}, t_{1}) can be analyzed, and not only portions anchored at 100% specificity or 100% sensitivity. Optionally, pAUC can be standardized with the formula by McClish

where

Comparison

Two ROC curves are "paired" (or sometimes termed "correlated" in the literature) if they derive from multiple measurements on the same sample. Several tests exist to compare paired

The bootstrap test to compare AUC or pAUC in

where _{1 }and _{2 }are the two (partial) AUCs. Unlike Hanley and McNeil, we compute _{1 }- _{2}) with N (defaults to 2000) bootstrap replicates. In each replicate _{1,r
}and _{2,r
}and their difference _{r }
_{1,r
}- _{2,r
}are computed. Finally, we compute _{1 }- _{2}) =

Bootstrap is stratified by default; in this case the same number of case and control observations than in the original sample will be selected in each bootstrap replicate. Stratification can be disabled and observations will be resampled regardless of their class labels. Repeats for the bootstrap and progress bars are handled by the

The second method to compare AUCs implemented in

Venkatraman and Begg

Finally a test based on bootstrap is implemented to compare the ROC curve at a given level of specificity or sensitivity as proposed by Pepe

Confidence intervals

CIs are computed with Delong's method

Smoothing

Several methods to smooth a ROC curve are also implemented. Binormal smoothing relies on the assumption that there exists a monotone transformation to make both case and control values normally distributed

This is different from the method described by Metz et al.

Results and Discussion

We first evaluate the accuracy of the ROC comparison tests. Results in Additional File

**Assessment of the ROC comparison tests**. We evaluate the uniformity of the tests under the null hypothesis (ROC curves are not different), and the correlation between the different tests.

Click here for file

**Histograms of the frequency of 600 test p-values under the null hypothesis (ROC curves are not different)**. A: DeLong's paired test, B: DeLong's unpaired test, C: bootstrap paired test (with 10000 replicates), D: bootstrap unpaired test (with 10000 replicates) and E: Venkatraman's test (with 10000 permutations).

Click here for file

**Correlations between DeLong and bootstrap paired tests**. X axis: DeLong's test; Y-axis: bootstrap test with number of bootstrap replicates. A: 10, B: 100, C: 1000 and D: 10000.

Click here for file

**Correlation between DeLong and Venkatraman's test**. X axis: DeLong's test; Y-axis: Venkatraman's test with 10000 permutations.

Click here for file

We now present how to perform a typical ROC analysis with

Case study on clinical aSAH data

The purpose of the case presented here is to identify patients at risk of poor post-aSAH outcome, as they require specific healthcare management; therefore the clinical test must be highly specific. Detailed results of the study are reported in

ROC curves were generated in

AUC and pAUC

Since we are interested in a clinical test with a high specificity, we focused on partial AUC between 90% and 100% specificity.

The best pAUC is obtained by WFNS, with 3.1%, closely followed by S100β with 3.0% (Figure

ROC curves of WFNS and S100β

**ROC curves of WFNS and S100β**. ROC curves of WFNS (blue) and S100β (green). The black bars are the confidence intervals of WFNS for the threshold 4.5 and the light green area is the confidence interval shape of S100β. The vertical light grey shape corresponds to the pAUC region. The pAUC of both empirical curves is printed in the middle of the plot, with the p-value of the difference computed by a bootstrap test on the right.

In the rest of this paper, we report only not standardized pAUCs.

CI

Given the pAUC of WFNS, it makes sense to compute a 95% CI of the pAUC to assess the variability of the measure. In this case, we performed 10000 bootstrap replicates and obtained the 1.6-5.0% interval. In our experience, 10000 replicates give a fair estimate of the second significant digit. A lower number of replicates (for example 2000, the default) gives a good estimate of the first significant digit only. Other confidence intervals can be computed. The threshold with the point farthest to the diagonal line in the specified region was determined with pROC to be 4.5 with the

The confidence intervals of a threshold or of a predefined level of sensitivity or specificity answer different questions. For instance, it would be wrong to compute the CI of the threshold 4.5 and report only the CI bound of sensitivity without reporting the CI bound of specificity as well. Similarly, determining the sensitivity and specificity of the cut-off 4.5 and then computing both CIs separately would also be inaccurate.

Statistical comparison

The second best pAUC is that of S100β with 3.0%. The difference to WFNS is very small and the bootstrap test of

In

The bootstrap test can be performed with the following code in R:

Smoothing

Whether or not to smooth a ROC curve is a difficult choice. It can be useful in ROC curves with only few points, in which the trapezoidal rule consistently underestimates the true AUC

(i) The normal fitting (red) gives a significantly lower AUC estimate (Δ = -5.1, p = 0.0006, Bootstrap test). This difference is due to the non-normality of WFNS. Distribution fitting can be very powerful when there is a clear knowledge of the underlying distributions, but should be avoided in other contexts.

(ii) The density (green) smoothing also produces a lower (Δ = -1.5, p = 6*10^{-7}) AUC. It is interesting to note that even with a smaller difference in AUCs, the p-value can be more significant due to a higher covariance.

(iii) The binormal smoothing (blue) gives a slightly but not significantly higher AUC than the empirical ROC curve (Δ = +2.4, p = 0.3). It is probably the best of the 3 smoothing estimates in this case (as mentioned earlier we were expecting a higher AUC as the empirical AUC of WFNS was underestimated). For comparison, Additional File

ROC curve of WFNS and smoothing

**ROC curve of WFNS and smoothing**. Empirical ROC curve of WFNS is shown in grey with three smoothing methods: binormal (blue), density (green) and normal distribution fit (red).

**Binormal smoothing**. Binormal smoothing with pcvsuite (green, solid) and pROC (black, dashed).

Click here for file

Figure

Screenshot of

**Screenshot of pROC in S+ for smoothing WFNS ROC curve**. Top left: the General tab, where data is entered. Top right: the details about smoothing. Bottom left: the details for the plot. Checking the box "Add to existing plot" allows drawing several curves on a plot. Bottom right: the result in the standard S+ plot device.

Conclusion

In this case study we showed how

Installation and usage

R

Loading the package:

Getting help:

S+

In addition to the command line functions, a GUI is then available in the

Functions and methods

A summary of the functions available to the user in the command line version of pROC is shown in Table

Functions provided in pROC

**are.paired**

**Determines if two ROC curves are possibly paired**

auc

Computes the area under the ROC curve

ci

Computes the confidence interval of a ROC curve

ci.auc

Computes the confidence interval of the AUC

ci.se

Computes the confidence interval of sensitivities at given specificities

ci.sp

Computes the confidence interval of specificities at given sensitivities

ci.thresholds

Computes the confidence interval of thresholds

coords

Returns the coordinates (sensitivities, specificities, thresholds) of a ROC curve

roc

Builds a ROC curve

roc.test

Compares the AUC of two correlated ROC curves

smooth

Smoothes a ROC curve

Methods provided by pROC for standard functions

**lines**

**ROC curves (roc) and smoothed ROC curves (smooth.roc)**

plot

ROC curves (roc), smoothed ROC curves (smooth.roc) and confidence intervals (ci.se, ci.sp, ci.thresholds)

All pROC objects (auc, ci.auc, ci.se, ci.sp, ci.thresholds, roc, smooth.roc)

Conclusions

The

Availability and requirements

• Project name: pROC

• Project home page:

• Operating system(s): Platform independent

• Programming language: R and S+

• Other requirements: R ≥ 2.10.0 or S+ ≥ 8.1.1

• License: GNU GPL

• Any restrictions to use by non-academics: none

List of abbreviations

aSAH: aneurysmal subarachnoid haemorrhage; AUC: area under the curve; CI: confidence interval; CRAN: comprehensive R archive network; CSAN: comprehensive S-PLUS archive network; pAUC: partial area under the curve; ROC: receiver operating characteristic.

Authors' contributions

XR carried out the programming and software design and drafted the manuscript. NTu, AH, NTi provided data and biological knowledge, tested and critically reviewed the software and the manuscript. FL helped to draft and to critically improve the manuscript. JCS conceived the biomarker study, participated in its design and coordination, and helped to draft the manuscript. MM participated in the design and coordination of the bioinformatics part of the study, participated in the programming and software design and helped to draft the manuscript. All authors read and approved the final manuscript.

Acknowledgements

The authors would like to thank E. S. Venkatraman and Colin B. Begg for their support in the implementation of their test.

This work was supported by Proteome Science Plc.