Diagnostic reasoning is a key competence of physicians. We explored the effects of knowledge, practice and additional clinical information on strategy, redundancy and accuracy of diagnosing a peripheral neurological defect in the hand based on sensory examination.
Using an interactive computer simulation that includes 21 unique cases with seven sensory loss patterns and either concordant, neutral or discordant textual information, 21 3rd year medical students, 21 6th year and 21 senior neurology residents each examined 15 cases over the course of one session. An additional 23 psychology students examined 24 cases over two sessions, 12 cases per session. Subjects also took a seven-item MCQ exam of seven classical patterns presented visually.
Knowledge of sensory patterns and diagnostic accuracy are highly correlated within groups (R2 = 0.64). The total amount of information gathered for incorrect diagnoses is no lower than that for correct diagnoses. Residents require significantly fewer tests than either psychology or 6th year students, who in turn require fewer than the 3rd year students (p < 0.001). The diagnostic accuracy of subjects is affected both by level of training (p < 0.001) and concordance of clinical information (p < 0.001). For discordant cases, refutation testing occurs significantly in 6th year students (p < 0.001) and residents (p < 0.01), but not in psychology or 3rd year students. Conversely, there is a stable 55% excess of confirmatory testing, independent of training or concordance.
Knowledge and practice are both important for diagnostic success. For complex diagnostic situations reasoning components employing redundancy seem more essential than those using strategy.
Keywords:diagnostic reasoning; clinical decision making; medical education; cognitive psychology; entropy; experimental studies
A major part of the undergraduate medical curriculum is dedicated to teaching the art and science of diagnosing illness and disease. Furthermore, when assessing the clinical competence of medical students, examiners must infer knowledge and reasoning skills from the behavior and the responses of the candidates.
It stands to reason then that medical teachers should possess a thorough understanding of diagnostic reasoning as a "basic science" of medical education. In reality, however, our comprehension of the diagnostic reasoning process is hazy at best.
The present study attempts to explore diagnostic reasoning by analyzing detailed recorded data-gathering behavior of experimental subjects with different levels of expertise in a computer simulation of patients with neurological lesions of the peripheral nervous supply to the hand.
Serious reasoning research started in psychology  during the 1950s. It has taken another 20 years for diagnostic reasoning to become an area of empirical research in medicine [2,3]. At a time when pragmatic medical educators believed in the existence of generic problem-solving skills, diagnostic reasoning research reestablished the primacy of content specific knowledge .
Initially research evolved along two intertwined threads, which alternatively supported and confused each other: the reasoning by (medical) experts and the reasoning by computers. By now, these two fields of research have largely gone their separate ways.
Three factors (Table 1.) have determined the type of experimental studies of diagnostic reasoning: Firstly the subjects studied, secondly the clinical information provided to subjects both by content and method, and thirdly the products of reasoning subjected to analysis.
Table 1. Factors in empirical research on diagnostic reasoning.
This type of research is very labor intensive and, consequently, expensive. Thus it is difficult to collect sufficient data to reach adequate statistical power based on diagnostic success and process items alone. As a consequence, diagnostic reasoning research leans heavily on recall, introspection and reflection data . It comes, therefore, as no surprise that the theories derived from this research tend towards models of semantic, analytical reasoning [18,19]. The literature is replete with a panoply of cognitive structures  – mainly semantic in nature – that are supposed to underlie diagnostic reasoning. The situation may be obscured further by the effect of social desirability bias, which may restrain experimental subjects from admitting to employing less than superlative reasoning strategies.
There is ample evidence [21,22] that analytical, semantic models alone do not fully explain diagnostic reasoning. Research based primarily on semantic recall, introspection and reflection contains blind spots, when it comes to unconscious and implicit reasoning processes that are not based on semantic information. Methods focusing on such processes are thus required to look beyond semantic networks.
For further discussion, we define inference or inferential reasoning as: logical, algorithmic, mainly semantic, sequential, propositional, forward and/or backward directed, purposeful, open to reflection and introspection. In contrast, pattern recognition is: holographic, heuristic, mainly perceptual, parallel, redundant, unconscious, probabilistic and intuitive. Inferential reasoning is characterized by strategy, pattern recognition by redundancy.
By "strategy" we mean a purposeful sequence of tests, where the specifics of the next test are selected on the basis of previous tests such as to return maximum new information. "Redundancy" on the other hand expresses the number of tests that fail to provide any new information for inference.
A suitable experimental model should, therefore, involve a sufficient number of perceptual cues to allow for good statistical power. One such candidate is eye-movement scanning in the interpretation of histological slides or x-rays. Unfortunately, the fact that the ocular axis is directed at a certain location on the image does not indicate, what is actually seen by the central visual field or that visual information is indeed being recorded and processed.
We have selected a simple deterministic computer simulation involving the (sensory) neurological examination of the peripheral nervous system in the hand. The collected sequence of responses and coordinates of each sensory stimulus allow statistical inference on the reasoning strategies, be they inferential or based on pattern recognition.
For this experiment we asked ourselves 5 questions:
1. How do subjects pick the specific locations on the hand to be tested (strategy)?
2. How many additional points in excess of what is required for strict inference, do they test before reaching a diagnosis (redundancy)?
3. How often is the selected diagnosis correct (accuracy)?
4. How are strategy, redundancy and accuracy related to knowledge and practice?
5. How are strategy, redundancy and accuracy affected, if subjects receive additional clinical information (symptoms and history) that is concordant, neutral or discordant with respect to the sensory pattern?
The key to answering these questions is the ability to quantify the information revealed by each successive sensory test. The accepted measure of information content is entropy, as introduced by Claude Shannon  in 1948 (Appendix A, see 1). Specifically, it indicates the potentially available information not yet revealed by the test sequence. An entropy value of 1.0 indicates that none of the available diagnostic information has yet been revealed and that all diagnostic possibilities are still equally likely. Conversely, an entropy value of 0.0 indicates that all relevant diagnostic information has been revealed and that only one diagnosis remains possible.
Entropy does not attempt to estimate or model the current state of a typical diagnostician's knowledge regarding the case. It indicates simply, how much information has been revealed to an ideal inference engine. This allows us to demonstrate the gap that exists between the information content revealed and the information actually used by the diagnostician.
If an individual sensory test ("pin prick") does not change entropy, the test adds no new information – it is redundant. Thus redundancy is defined as the total number of sensory tests in a sequence that did not alter entropy.
The faster a subject accumulates sufficient information to arrive at the correct diagnosis, the more efficient is his diagnostic strategy. Quantitatively, this is indicated by a smaller area under the entropy/number-of-test curve (Figure 6).
Figure 6. Example of one illustrative sequence. For all diagnoses except D4 the plausibilities successively disappear. Correspondingly, the entropy falls from 1.0 to 0.0. After seven tests total certainty exists.
A subject's strategy can be strictly inferential (i.e. no redundant tests), in which case it is automatically optimal, subjects could systematically attempt to refute the apparently likeliest hypothesis (Popperian strategy) or, as happens often in reality, they may try to confirm those abnormal findings that support their currently favorite hypothesis.
To determine, which information gathering strategy was used, three measures were calculated: (i) how quickly relevant information was collected as expressed by the area under Shannon's entropy as a function of the number of tests; (ii) the specific number of refutations of discordant cues (Refutation matrix, Appendix B, see 2); and (iii) the excess of confirmatory testing (Confirmation matrix, Appendix C, see 3).
Seven familiar neurological patterns of sensory loss in the hand were simulated: C6, C7 and C8 nerve root injury, radial, median and ulnar nerve lesion and poly-neuropathy. Photographs of the dorsal and volar aspects of either the left or right hand were displayed on the screen. With mouse clicks subjects could "test" individual points on the hand. Depending on the location tested and the underlying predetermined diagnosis, one of three verbal responses was returned deterministically in a small pop-up window at the point tested:
• "it feels normal",
• "it feels different", or
• "I can hardly feel it".
The simulation ran as a Java applet within a regular Web page. Subjects were not provided with feedback regarding individual diagnoses during the actual experiment; they did, however, receive detailed feedback after they had completed all the cases.
Each pattern was presented in the context of additional clinical information (symptoms, history and a functional photograph of the hand) concordant, neutral or discordant relative to the sensory pattern. The additional clinical information was relatively bland, providing only subtle suggestions as to the actual diagnosis whether concordant or discordant, although the concordant information was more specific. The neutral items contained no clues. For example, the discordant cases of radial and median nerve deficits had a history vaguely suggestive of a mild cervical injury. Sensory patterns and additional clinical information were repeatedly checked by experienced neurologists for realism.
The experimental subjects consisted of a convenience sample of 23 psychology students, 21 3rd year medical students, 21 6th year medical students (Switzerland has a six year medical curriculum; during the first two years students concentrate on basic sciences) and 21 senior neurology residents. The junior medical students had studied neuroanatomy, but were unfamiliar with the detailed sensory patterns and clinical pictures. They had never practiced sensory examination. Senior medical students had studied sensory patterns, had limited knowledge of clinical pictures and had been introduced to sensory testing. Neurology residents acted as substitute experts, since we were not able to recruit sufficient certified neurologists. The psychology students served as a control group with roughly matching intelligence but no medical education. Psychology students knew neither neuroanatomy nor clinical pictures. Neither had they been taught sensory examination. They were exposed to visually presented maps of the sensory patterns as part of the experimental protocol.
Psychology students participated in two sessions one week apart, the rest in one session each. In their first session psychology students were shown the seven patterns as visual maps together with diagnostic labels during 15 minutes. Otherwise all sessions followed the same sequence: (i) an MCQ test of the seven patterns presented visually as sensory maps; (ii) a single practice case that was not recorded; (iii) a series of 12 cases each for the psychology students and 15 cases each for the rest in a balanced block design. As result of an oversight, the blocks were not perfectly balanced across the 21 possible combinations (6 × 7 / 2). There were only seven unique blocks each with three different sequences of cases. Altogether, all 21 cases occurred with equal frequency for each group. We do not, therefore, believe that this error introduced any significant bias.
After each test subjects had the option of picking a diagnosis from a menu and proceeding to the next case. As part of a further study to be reported separately, the test sequence was interrupted automatically at 5, 10, 20 and 40 tests. Subjects were then asked to indicate their current best estimates for the likelihood of each diagnostic hypothesis.
Test coordinates and time since the previous test were stored test by test for the whole case sequence in the client side Java applet and sent to the Web server as part of the active server page request, upon the selection of a specific diagnosis. Data were automatically stored in a relational database (Microsoft Access) keyed to case and subject. After completion of the experimental phase of the study, data were preprocessed by means of a Microsoft Visual Basic program to determine the expected findings at each point tested for the actual diagnosis as well as for the alternative hypotheses. These results were again stored in a relational table. Based on these findings, plausibility, entropy, redundancy, refutation- and confirmation-counts were calculated with a second MS-VB program (Method described in Appendix A, B & C, see 1, 2 and 3). SPSS was used for the statistical analysis of these derived dependent variables.
For each subject the knowledge of sensory patterns was calculated as the ratio of correctly identified patterns over seven, the total number of patterns in the multiple choice exam. Diagnostic accuracy was calculated for each subject as the ratio of correct diagnoses over total cases processed.
Psychology students participated in two sessions with 12 cases each thus diagnosing a total of 24 cases. Since only 21 unique cases existed, each of these subjects encountered three of 21 cases twice. Using a random number generator, either the first or second of these double cases was dropped from further analysis.
A total of 1,428 sequences with 27,524 test points were analyzed. In 17 sequences subjects guessed the diagnosis without performing any tests. Residents guessed 12 concordant cases correctly. Students guessed three of the remaining five sequences incorrectly. The two correct guesses were in concordant cases, one by a 3rd and one by a 6th year student.
Group means of MCQ scores and diagnostic accuracy correlate well (Fig. 1). For psychology students the mean values for the initial and follow-up session one week later were calculated separately. Diagnostic accuracy of residents is higher than one would expect from the knowledge of patterns alone.
Figure 1. MCQ score and diagnostic accuracy fort the experimental groups. The values for psychology students in their initial session and in the second session, one week later, are plotted separately.
The diagnostic accuracy of the diagnosis is examined by ANOVA (Table 2). Diagnostic accuracy is affected significantly both by the level of training and the degree of concordance.
Table 2. Diagnostic Accuracy – Results of ANOVA: Tests of between sequence effects
The diagnostic accuracy of psychology students is not affected by the degree of concordance, while the accuracy of the residents is significantly eroded by discordant information (Fig. 2).
Figure 2. Estimated marginal means of diagnostic accuracy as function of the level of training and the degree of concordance.
Discordant cases form a homogeneous subset against neutral and concordant cases at α = 0.05. The level of training does not separate into homogeneous subsets.
For all four groups of experimental subjects the residual entropy does not differ significantly for cases diagnosed correctly and incorrectly (Fig. 3). The 6th year students, in fact, show borderline increased entropy (lower certainty) for correct diagnoses. Diagnostic errors, therefore, do not appear to be due to insufficient information gathering but rather to flawed reasoning.
Figure 3. Results of unpaired T-tests of residual entropy for correct and incorrect diagnoses.
Area under the entropy curve and redundancy were examined by MANOVA (Table 3).
Table 3. Results of MANOVA: Tests of Between-Sequence Effects
Both level of training and degree of concordance have a significant effect on redundancy. But the area under the entropy curve depends only on the level of training, not the degree of concordance. Post hoc analysis (Scheffé test) shows the area under the entropy curve to split up into two homogeneous subsets: 3rd year medical students versus the rest. Redundancy splits into three homogeneous subsets: 3rd year students, residents, and 6th year medical together with psychology students as a middle group.
In regards to degree of concordance, redundancy splits into two homogeneous subsets: concordant versus neutral and discordant.
The redundancy of psychology students is not affected by the additional clinical information – they do not recognize its implication (Fig. 4). For the residents, on the other hand, it affects redundancy by almost a factor of three.
Figure 4. Estimated marginal means of redundancy as function of level of training and degree of concordance.
For the area under the entropy curve (strategy), the effect of either the level of training or the degree of concordance is less clear cut (Fig. 5).
Figure 5. Estimated marginal means of area under the entropy curve as function of the level of training and the degree of concordance.
It is obvious, though, that 3rd year medical students show least evidence of strategy, independent of additional clinical information.
To test whether subjects specifically attempt to refute the diagnostic hypotheses suggested by the additional clinical information in the discordant cases, we employ Popperian analysis (Table 4) as described in Appendix B (see 2).
Table 4. Contingency table analysis of Popperian refutation counts.
Psychology and 3rd year medical students are not affected by the additional discordant clinical data. They don't know enough about the clinical syndromes.
Residents and 6th year students show a significant though small attempt at refuting the clinically suggested diagnoses. The excess of specific refutations is about 11% for residents and 17% for 6th year students respectively (Common Odds Ratio).
There is a significant difference in the increase of redundancy between residents and 6th year students: χ2 = 17.24; p < 0.001; C.O.R. = 1.24. In other words, in the presence of discordant information 6th year students seem to use more strategy while residents rely more on redundancy. This could be explained by the "intermediacy effect".
Finally, we looked for evidence of confirmatory testing (Appendix C see 3). In fact, confirmatory testing seems to be a stable feature, independent of level of training or degree of concordance (Table 5).
Table 5. Chi-square, significance and estimated ratio of actual over expected confirmatory tests.
The tendency to selectively confirm expected hypotheses rather than testing randomly or refute alternative hypotheses appears inherent in this diagnostic reasoning experiment.
Discussion and Conclusions
Diagnostic accuracy, strategy and redundancy depend primarily on the knowledge of sensory patterns and associated syndromes. The effect of knowledge on accuracy and redundancy appears to be stronger than on strategy. In fact, effective data-gathering strategies seem to play a minor role. Even where appropriate, little refutation of alternative hypotheses occurs. Just the opposite: confirmatory testing seems to be dominant.
In addition, both accuracy and redundancy, but not strategy appear to depend on practice independently from knowledge.
These results appear somewhat counterintuitive. Experts should have vastly better problem-solving strategies than novices. True, in the real world, experts also have an edge on knowledge. The knowledge spread in our experiment was insufficient to demonstrate that aspect.
There might be another explanation, however. In our experiment, to reach a diagnosis by inference requires not only the seven diagnostic hypotheses to be present in short-term memory, but also the roughly seven tests in strategically placed locations and their combinations must be available at all times. In other words, for purely inferential diagnostic reasoning one needs to operate on approximately 49 items or 5.6 bits of information. As George A. Miller  has shown the capacity of short term memory is only about seven items or 2.8 bits of information. The scope of short-term memory, therefore, would appear insufficient to support pure inference. Short of using memory substitutes, such as paper and pencil, the only alternative is to resort to what Miller refers to as "recoding" – an implicit reasoning strategy. This is a hypothesis that requires further testing.
It remains surprising, however, that the psychology students were able to set up an efficient recoding scheme after only 15 minutes' training that allows easy shifting from overt to latent pattern recognition.
The reported findings may also have implications for teaching and assessment. If the rate limiting factor for inference is the number of items that have to be kept in short term memory, teachers can assist learners by constructing diagnostic trees that involve only two or three branches at each decision point, rather than long lists of differential diagnoses. Such cognitive structures correspond to Bordage's  key features or Mandin's schemes .
In the assessment of diagnostic reasoning, redundancy of requested information appears as a second independent, sensitive measure of competence besides the accuracy of the diagnosis.
RB conceived the experiment, wrote the initial Java applet, analyzed the results and wrote the paper. DH and SF refined the software, designed and supervised the experimental details. MH and SF prepared the cases, recruited subjects and collected the data.
Table 6. Popperian refutation matrix for the seven discordant cases. The '+' indicates the cells favored by the discordant information, whereas the '-' designates non-favored cells.
This study has been made possible by grant #1153-055603 of the Swiss National Science Foundation (SNF). We wish to thank R. Hofer for statistical advice and P. Tobler for assisting in the pilot study. We are grateful to Ch. Hess, H.P. Mattle and M.Mumenthaler for critically reviewing cases and sensory patterns.
Annual Rev Psychol 1972, 23:105-130. Publisher Full Text
J Med Educ 1972, 47:85-92. PubMed Abstract
Arch Neurol 1972, 26:273-277. PubMed Abstract
Clin Invest Med 1982, 5(1):49-55. PubMed Abstract
Med Educ 1990, 24(5):413-425. PubMed Abstract
Invest Radiol 1993, 28(3):214-217. PubMed Abstract
Acad Med 1994, 69(5):428-9. PubMed Abstract
Acad Med 1994, 69(10 Suppl):S34-S36. PubMed Abstract
Am J Dis Child 1989, 143(5):575-579. PubMed Abstract
J Exp Psychol Learn Mem Cogn 1989, 5:1166-74. PubMed Abstract
Med Educ 1986, 20(1):3-9. PubMed Abstract
Acad Med 1990, 65(10):611-21. PubMed Abstract
Acad Med 1991, 66(9 Suppl):S70-S72. PubMed Abstract
Acad Med 1990, 65(10):611-621. PubMed Abstract
Acad Med 1996, 71:555-61. PubMed Abstract
Acad Med 1999, 74(7):791-4. PubMed Abstract
Cognitive Science 1989, 13:1-49. Publisher Full Text
Psychol Rev 1956, 63:81-97. PubMed Abstract
Acad Med 1994, 69(11):883-5. PubMed Abstract
Acad Med 1997, 72(3):173-9. PubMed Abstract
The pre-publication history for this paper can be accessed here: