Are two readers more reliable than one? A study of upper neck ligament scoring on magnetic resonance images
1 Department of Radiology, Haukeland University Hospital, Jonas Liesvei 65, 5021, Bergen, Norway
2 Department of Surgical Sciences, University of Bergen, Bergen, Norway
BMC Medical Imaging 2013, 13:4 doi:10.1186/1471-2342-13-4Published: 17 January 2013
Magnetic resonance imaging (MRI) studies typically employ either a single expert or multiple readers in collaboration to evaluate (read) the image results. However, no study has examined whether evaluations from multiple readers provide more reliable results than a single reader. We examined whether consistency in image interpretation by a single expert might be equal to the consistency of combined readings, defined as independent interpretations by two readers, where cases of disagreement were reconciled by consensus.
One expert neuroradiologist and one trained radiology resident independently evaluated 102 MRIs of the upper neck. The signal intensities of the alar and transverse ligaments were scored 0, 1, 2, or 3. Disagreements were resolved by consensus. They repeated the grading process after 3–8 months (second evaluation). We used kappa statistics and intraclass correlation coefficients (ICCs) to assess agreement between the initial and second evaluations for each radiologist and for combined determinations. Disagreements on score prevalence were evaluated with McNemar’s test.
Higher consistency between the initial and second evaluations was obtained with the combined readings than with individual readings for signal intensity scores of ligaments on both the right and left sides of the spine. The weighted kappa ranges were 0.65-0.71 vs. 0.48-0.62 for combined vs. individual scoring, respectively. The combined scores also showed better agreement between evaluations than individual scores for the presence of grade 2–3 signal intensities on any side in a given subject (unweighted kappa 0.69-0.74 vs. 0.52-0.63, respectively). Disagreement between the initial and second evaluations on the prevalence of grades 2–3 was less marked for combined scores than for individual scores (P ≥ 0.039 vs. P ≤ 0.004, respectively). ICCs indicated a more reliable sum score per patient for combined scores (0.74) and both readers’ average scores (0.78) than for individual scores (0.55-0.69).
This study was the first to provide empirical support for the principle that an additional reader can improve the reproducibility of MRI interpretations compared to one expert alone. Furthermore, even a moderately experienced second reader improved the reliability compared to a single expert reader. The implications of this for clinical work require further study.