Centre for Reviews and Dissemination, University of York, York,, YO10 5DD, UK

Department of Health Sciences, University of York, York, UK

Abstract

Background

Misplaced or poorly calibrated confidence in healthcare professionals’ judgments compromises the quality of health care. Using higher fidelity clinical simulations to elicit clinicians’ confidence 'calibration' (i.e. overconfidence or underconfidence) in more realistic settings is a promising but underutilized tactic. In this study we examine nurses’ calibration of confidence with judgment accuracy for critical event risk assessment judgments in a high fidelity simulated clinical environment. The study also explores the effects of clinical experience, task difficulty and time pressure on the relationship between confidence and accuracy.

Methods

63 student and 34 experienced nurses made dichotomous risk assessments on 25 scenarios simulated in a high fidelity clinical environment. Each nurse also assigned a score (0–100) reflecting the level of confidence in their judgments. Scenarios were derived from real patient cases and classified as easy or difficult judgment tasks. Nurses made half of their judgments under time pressure. Confidence calibration statistics were calculated and calibration curves generated.

Results

Nurse students were underconfident (mean over/underconfidence score −1.05) and experienced nurses overconfident (mean over/underconfidence score 6.56), P = 0.01. No significant differences in calibration and resolution were found between the two groups (P = 0.80 and P = 0.51, respectively). There was a significant interaction between time pressure and task difficulty on confidence (P = 0.008); time pressure increased confidence in easy cases but reduced confidence in difficult cases. Time pressure had no effect on confidence or accuracy. Judgment task difficulty impacted significantly on nurses’ judgmental accuracy and confidence. A 'hard-easy' effect was observed: nurses were overconfident in difficult judgments and underconfident in easy judgments.

Conclusion

Nurses were poorly calibrated when making risk assessment judgments in a high fidelity simulated setting. Nurses with more experience tended toward overconfidence. Whilst time pressure had little effect on calibration, nurses’ over/underconfidence varied significantly with the degree of task difficulty. More research is required to identify strategies to minimize such cognitive biases.

Background

The ability of nurses (like all clinicians) to be appropriately confident in their clinical judgments is an important part of safe and effective healthcare. Overconfidence in judgments such as “critical event” (such as cardio pulmonary arrest) risk assessment may result in delayed or inappropriate interventions

Overconfidence is a reasoning bias that is not always mediated by clinical experience. Hausman et al.

Confidence in a judgment appears to be linked to the perceived difficulty of a judgment

Time is a crucial factor when seeking to understand the relationship between confidence and performance

Aside from clinical experience, task difficulty and time constraints, there are significant methodological challenges in seeking to examine the relationship between confidence and judgment performance. One of the most pertinent is the means by which judgments are elicited. Many calibration studies use paper-based scenarios to elicit clinicians’ judgment and confidence ratings. Paper-based simulation is easy to administer but is limited by its lack of face validity, a particularly important limitation for generalising to clinical environments in which clinical information is often perceptual. Using higher fidelity clinical simulation

Methods

Calibration statistics

A series of calibration statistics were calculated: a calibration score, measures of over/underconfidence and resolution. The calibration score is a weighted squared deviation between the mean proportion of judgments that are correct and the mean confidence rating associated with each confidence category (see equation 1)

Equation 1: calibration statistic

where: (n) represents the total number of responses; (_{
j
}) represents the number of responses in confidence category (_{
j
}) represents the mean proportion correct in each category (

Calibration score ranges from 0 (perfect calibration) to 1 (worst calibration). “Perfect” calibration is achieved when the percentage correct is always the same as the confidence level in the judgments assigned in each category by individuals. In contrast, the worst calibration score of 1.0 would be the result of a participant always assigning a confidence score of 100 when their judgments are incorrect, and assigning zero confidence when their judgments are correct.

The over/underconfidence score is an index of the relationship between confidence and accuracy. This score quantifies the deviation between confidence and proportion correct on the basis of the formula: (

The resolution score measures a person’s discrimination ability by evaluating how well judges use their confidence ratings to differentiate correct from incorrect responses. Resolution is a weighted squared deviation between the mean proportion correct (_{
j
}) for each confidence category (e.g. 0.50-0.59, 0.60-0.69 and so on) and the overall proportion correct (

Equation 2 Resolution

The resolution score ranges from zero to knowledge index

Equation 3 Normalised resolution

The normalised resolution score, which is independent of mean proportion correct, provides a more robust measure when comparing discrimination skills. Normalised resolution scores range from 0 to 1. A higher score is indicative of greater ability to differentiate correct from incorrect responses. The resolution statistic helps further decompose a participant’s judgmental achievement. For instance, if a participant always has a confidence score of 100 on wrong judgments and has zero confidence on correct judgments, the calibration score would be terrible but the resolution score would be perfect. In this paper we report only the results for normalised resolution scores.

Calibration curve analysis

We constructed calibration curves as another means of examining the relationship between probability judgments and confidence ratings. Each curve is derived by plotting the proportion correct on the y axis against the confidence rating on the x axis

The participant sample

We sampled 34 experienced nurses from the population of ward and critical care nurses in North Yorkshire and 63 2^{nd} and 3^{rd} year nurse students from the population of undergraduate students from the University of York, UK. Given the lower recruitment costs associated with student nurses compared to experienced nurses, a ratio of 2:1 for students versus experienced nurse was therefore used for the sample recruitment. Using modestly unequal independent samples such as a ratio of 2:1 can lead to substantial cost saving with only little compromising effect on statistical power

The clinical scenarios and judgment criteria

Twenty five scenarios were simulated using a high fidelity mock up of an emergency admission hospital room. Scenarios were generated by randomly sampling patient cases from a dataset of emergency admissions in one NHS District General Hospital during March 2000

Scenarios were used to simulate the five information cues important for critical event risk assessment: systolic blood pressure, heart rate, respiratory rate, temperature and level of consciousness ^{TM}SimMan, Stavanger, Norway,

The 25 clinical scenarios were divided into two blocks: scenarios (1–12) and scenarios (13–25). Nurses were placed under time pressure in the first block by allowing only 20 seconds per scenario for a judgment. No time pressure was placed on judgments made in the second block.

Judgment task difficulty

To investigate whether nurses’ confidence calibration was affected by the ease or difficulty of the clinical judgment task we explored the uncertainty associated with each task. The judgment rule associated with each task was, “if information values are above a clinically significant threshold then classify as ‘at risk’ of a critical event”. We used the Modified Early Warning Scoring system (MEWs)

Data collection

After being exposed to each scenario in the high fidelity simulation setting participants were asked to make a dichotomous judgment (yes/no: at risk of a critical event) on a data collection sheet (see Additional file

**Appendix.** High fidelity clinical simulation scenario.

Click here for file

Ethical approval

Ethical approval for the study was granted by the Health Sciences Research Governance Committee at the University of York, UK. A written informed consent for participation in this study was obtained from each participant.

Data analysis

Confidence calibration statistics were calculated for each participant. For data appropriate for parametric statistical tests, independent two sample t tests were used to test for the significance of the difference of means in calibration statistics between the two groups. For non parametric data the Wilcoxon rank-sum test was used to test the null hypothesis that the difference of the medians for the calibration indices between the two groups was zero. Analysis of variance (ANOVA) was used to test the mean difference between groups when more than two independent variables were involved. Interactions between independent variables were examined in ANOVA. We used p < 0.05 as a cut off for statistical significance. All analyses were performed using Stata version 9 (

Confidence curves were plotted. To reduce any bias in the curves, we did not plot data in which confidence was less than 50 (7.51% of confidence ratings). Because of very few data points associated with each confidence category less than 50, deriving a curve based on these very few data points would bias the calibration curve.

Results

Participants

Table

**Demographic characteristics**

**Experienced nurses (n = 34)**

**Student nurses (n = 63)**

n/a: not applicable.

Age (years) Mean (SD)

36.55 (9.96)

27.75(8.22)

Clinical experience (years) Mean (SD)

12.15 (9.90)

n/a

Gender n/(%)

Male

5 (15%)

7(11.1%)

Female

28 (85%)

56(88.9%)

Proportion correct

In the high fidelity clinical simulation settings, no significant differences in proportion correct were found between the student (mean 73.7%; SD 6.88%) and experienced nurse group (mean 73.5%; SD 9.08%), t (95) = 0.11, P = 0.91. The ease or difficulty of the scenario exerted a statistically significant effect (F (1, 289) =247.76, P < 0.001) on the proportion of correct judgments. Time pressure did not significantly alter the number of correct judgments (F (1, 289) =0.00, P = 0.97). Similarly, the proportion correct did not significantly vary between each participant (F (96, 289) =0.74, P = 0.96). No significant interaction was observed between time pressure and easy/difficulty of the scenarios on proportion correct (P = 0.59).

Confidence ratings

Experienced nurses (mean 80.09; SD 10.47) were significantly more confident than students (mean 72.66; SD10.74), t (95) = −3.28, P = 0.001. Participants’ confidence ratings varied significantly as a result of the difficulty of the cases (F (1, 289) =133.94, P < 0.001) and between participants (F (96, 289) = 7.70, P < 0.001).

Whilst time pressure had no significant effect on confidence on its own (F (1, 289) = 1.78, P =0.18), there was a significant interaction (P = 0.008) between time pressure and the relative difficulty of the task on confidence. Clearly, time pressure had different effects on confidence for easy and difficult scenarios: time pressure increased participants’ confidence for easy judgments whilst time pressure decreased their confidence for difficult cases.

Under/overconfidence

Students were underconfident (mean over/underconfidence score −1.05; SD 13.41) and experienced nurses were overconfident (mean over/underconfidence score 6.56; SD 15.68), t (95) = −2.51, P = 0.01.

Calibration & resolution

Experienced nurses were no better calibrated (median 0.048) than students (median 0.048), z = −0.25, P = 0.80. Similarly, students (median 0.198) and experienced nurses (median 0.192) did not differ significantly in their ability to discriminate between their correct and incorrect judgments, z = 0.67, P = 0.51.

Calibration curve analysis

Figure

Calibration curves for student and experienced nurses

**Calibration curves for student and experienced nurses.**

Figure

Calibration curves of no time pressure and time pressure for easy/difficult cases

**Calibration curves of no time pressure and time pressure for easy/difficult cases.**

Calibration curves of easy and difficult judgments under no time pressure/time pressure

**Calibration curves of easy and difficult judgments under no time pressure/time pressure.**

Discussion

In capturing and reporting the less than optimal confidence calibration of nurses and students, this study offers both information (and a methodology) for those developing high fidelity clinical simulations (particularly for assessment of critical care events). In the high fidelity clinical simulation environment, we observed a clear relationship between nurses’ subjective confidence ratings and accuracy in their risk assessments: experienced nurses were generally overconfident, while student nurses tended toward underconfidence. The difference of this measure between the two groups was statistically significant (p = 0.01). The findings showed that the subjective probability judgments of experience nurses and students were subject to systematic bias; either they overestimated or underestimated their judgmental abilities or knowledge of self judgment. Our findings replicate the more general psychological picture that suggests people (including decision makers with more experience) are often systematically overconfident with regard to judgment accuracy

An appropriate level of confidence, given someone’s clinical experience, is one marker of a nurse’s competency, and clinical experience is a significant factor in building confidence in nurses’ judgment

Given that nurses experience significant amounts of audio and visual information (which is a mix of important signals and ‘noise’) in daily clinical activities, it is reasonable to hypothesise that experienced nurses are more likely to have better calibration performance than student nurses in high fidelity simulated conditions. However, our findings showed no significant difference in calibration and resolution between experienced and student nurses. Our study does not support the hypothesis that confidence calibration performance is a linear function of clinical experience, even in the less than perfect environment of the high fidelity clinical simulation.

Task difficulty and calibration

Our findings further reveal that nurses’ calibration differs with the difficulty of the judgment task they are faced with. Nurses’ calibration and resolution were generally worse on the more difficult and uncertain tasks. By varying the task difficulty, a hard-easy effect was seen: nurses are overconfident in hard judgments and underconfident in easy judgments.

Similar findings have been documented in psychological studies since the 1970s

As with confidence miscalibration, nurses’ ability to resolve information altered as a result of task difficulty: their ability to differentiate correct from incorrect judgments decreased as task difficulty increased. This is also consistent with the psychological literature

Time pressure and calibration

Time pressure had no significant impact on nurses’ confidence, the percentage of correct judgments or their overall calibration. This finding runs counter to those studies

Time pressure increased nurses’ confidence in easy cases and reduced nurses’ confidence in the difficult ones. Such a significant interaction revealed that time pressure had a different effect on confidence between easy and difficult judgments. This phenomenon could be partially explained by “the need for closure” effect

Recent evidence has shown a significant interaction between the need for closure, judgmental performance and changing judgment task demands (for example, altered task difficulty)

Limitations

A non-random sampling method to enrol nurse participants was a limiting feature of the study. Whilst deliberately sampling experienced and inexperienced nurses allowed us to investigate the mediating effect of clinical experience on confidence calibration, we could have increased the risk of non-representativeness within subgroup samples. Furthermore, the focus on judgment task of risk assessments in acute care means that the generalisation of the findings to other settings is limited. Further research is required to establish whether the patterns of confidence miscalibration observed in this study are replicated in different clinical contexts.

Conclusions

Nurses were miscalibrated when matching judgment confidence to judgment performance in a high fidelity simulated environment. Simply being clinically experienced did not help: it just increased the probability of being overconfident. The study revealed that time pressure is an important influence in nurses’ judgments; as time pressure increased – for easier cases so did nurses’ confidence. However increased time pressure led to reduced nurses’ confidence in difficult cases. While time pressure had little effect on nurses’ overall calibration, the observed ‘hard-easy effect’ suggests that nurses’ confidence miscalibration is contingent on task difficulty. These findings highlight the need for nurses to recognise the ‘uncertainty’

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

HY and CT were responsible for the study conception and design. HY performed the data collection. HY and MB performed the data analysis. HY was responsible for the drafting the manuscript. HY, CT and MB made critical revisions to the paper for important intellectual content. HY and MB provided statistical expertise. All authors read and approved the final manuscript.

Acknowledgment

We thank Dr C. P. Subbe for consenting to use his dataset in constructing our high fidelity case scenarios. This study was funded by the Department of Health Sciences of the University of York, United Kingdom.

Pre-publication history

The pre-publication history for this paper can be accessed here: