Patient records contain valuable information regarding explanation of diagnosis, progression of disease, prescription and/or effectiveness of treatment, and more. Automatic recognition of clinically important concepts and the identification of relationships between those concepts in patient records are preliminary steps for many important applications in medical informatics, ranging from quality of care to hypothesis generation.
In this work we describe an approach that facilitates the automatic recognition of eight relationships defined between medical problems, treatments and tests. Unlike the traditional bag-of-words representation, in this work, we represent a relationship with a scheme of five distinct context-blocks determined by the position of concepts in the text. As a preliminary step to relationship recognition, and in order to provide an end-to-end system, we also addressed the automatic extraction of medical problems, treatments and tests. Our approach combined the outcome of a statistical model for concept recognition and simple natural language processing features in a conditional random fields model. A set of 826 patient records from the 4th i2b2 challenge was used for training and evaluating the system.
Results show that our concept recognition system achieved an F-measure of 0.870 for exact span concept detection. Moreover the context-block representation of relationships was more successful (F-Measure = 0.775) at identifying relationships than bag-of-words (F-Measure = 0.402). Most importantly, the performance of the end-to-end system of relationship extraction using automatically extracted concepts (F-Measure = 0.704) was comparable to that obtained using manually annotated concepts (F-Measure = 0.711), and their difference was not statistically significant.
We extracted important clinical relationships from text in an automated manner, starting with concept recognition, and ending with relationship identification. The advantage of the context-blocks representation scheme was the correct management of word position information, which may be critical in identifying certain relationships. Our results may serve as benchmark for comparison to other systems developed on i2b2 challenge data. Finally, our system may serve as a preliminary step for other discovery tasks in medical informatics.
The era of Electronic Health Records (EHRs) brings the necessity of automatic recognition for clinical concepts, and the relationships that tie them together. Patient records contain comprehensive accounts of the patients’ visits to the hospital. Such information can be invaluable for pharmaco-vigilance, detection of adverse effects, comparative effectiveness studies, etc. Patient records contain a wealth of information regarding what is the patient’s discomfort, what medical measures are taken and what procedures are performed with what results. These documents differ from the general biomedical text data in these aspects. First, clinical documents are of a sensitive nature. Automatically removing personal information from these documents is a research problem in itself [1,2]. For this reason, obtaining sufficient amounts of de-identified clinical data as required to build a robust machine learning model is difficult. Second, patient records and doctor notes often are not well-structured documents. The syntactic structure, presentation style and terminology used in EHRs significantly differ from those used in a published research paper. The language is often similar to daily discourse, and it contains a lot of (mostly non-standard) abbreviations. For this reason, obtaining real clinical data instead of possible synthetic variations is important in order to build reliable systems.
The i2b2 challenges  are one of the most important recent community efforts to develop scalable informatics frameworks that will enable scientists to use existing clinical data for discovery research. The first step in the automatic processing of a clinical document is the recognition of text phrases which refer to clinically relevant concepts such as: medical problems, treatments and tests. Medical problems are observations about the patient’s clinical health. Treatments include procedures, or medications administered to patients. Tests include lab procedures or measurements prescribed to patients. The second step is the identification of relationships among the recognized concepts. A relationship between two clinical concepts identifies how problems relate to treatments, tests and other medical problems in text. In this study, we focus on the identification of relationships as defined in the 4th i2b2 challenge , and we build an end-to-end system that first performs concept recognition, next identifies relationships. Figure 1 shows a diagram of the possible relationships found in clinical documents. Table 1 also lists these relationships with mappings to i2b2 notation, and example sentences from annotated patient records.
Figure 1. The diagram of clinical relationships Concepts appear in blue boxes, while the relationships between them appear in coloured diamonds.
Table 1. Examples of relationships between medical concepts in patient records.
Extraction of relevant clinical concepts is an ongoing problem in the clinical domain, and several tools have been developed for this purpose. For example, MedLee  maps clinical text to Unified Medical Language System© (UMLS©) concepts, whereas MedEX  focuses on extracting medication information from discharge summaries. Machine learning methods can also be used successfully for concept extraction from biomedical text. For instance, Hidden Markov Models (HMM) and Conditional Random Fields (CRF) were compared for the extraction of Proteins and Cell types . Machine learning techniques have also been shown effective in mining patient smoking and medication status [7,8] from unstructured patient records.
Extraction of relationships between biomedical concepts has also produced a significant body of literature in the biomedical domain [9-14]. However, this research mostly addresses the extraction of relationships between biological entities (e.g. protein-protein interaction). In contrast, fewer studies are found on the extraction of relationships between diseases, symptoms, and medication in patient records. The proposed methods are typically based on co-occurrence statistics, semantic interpretation, and machine learning. For example, Chen et al. proposed an automatic method for extracting disease-drug pairs which applied MedLEE for identifying associations. More recently, similar methods were developed for the identification of association between symptoms and diseases  and the detection of adverse drug effects [17,18].
Relationships between clinical entities can be identified with co-occurrence based methods. However, such methods are unable to further characterize specific relations. A significant research effort addressing the extraction of relationships between biomedical entities has resulted in the development of the semantic representation program SemRep . SemRep exploits linguistic analysis of biomedical text and domain knowledge in the UMLS. This tool achieved competitive performance in [20,21] for extracting drug-disease treatment relationships from biomedical text. However, the set of relationships that can be extracted by SemRep does not match those of our dataset. Similarly, the set of relationships defined in  and  do not match those in our dataset, so that a direct comparison is not possible.
In this study we address the relationship identification task, expanding from , in an end-to-end system, starting with the recognition of concept phrases and then predicting possible relationships between two concepts found in the same sentence. This problem is very similar to the focus of the 4th i2b2 challenge, and our work is developed on the same dataset and annotations. The system we describe in this work was inspired by our own participation in this challenge and should be directly comparable to other work built on the same data.
We developed a context-based scheme of representing relationships between two different kinds of entities. In this study, the relationship between two concepts is defined as a structure of five context-blocks (see Fig. 2). We characterized each of these five different blocks, and built a machine learning model that makes an informed decision based on the present characteristics. In addition, in order to make this scheme easily applicable for real document processing, we implemented the automatic extraction of concepts using a CRF based model. We built a highly accurate concept recognizer which we used to predict concepts intended for relationship extraction in the test set of 477 clinical documents. The performance of the relationship extraction system using automatically extracted concepts was comparable to that obtained using the manually annotated concepts, and the difference was not statistically significant.
Figure 2. Relationship representation between two concepts as five context-blocks Five context-blocks: introductory block—the set of words from the beginning of the sentence to the occurrence of the first concept, 1st concept block—the set of words that comprise first concept (not necessarily first in the sentence), connective block—the set of words that tie the two concepts in the relationship, 2nd concept block—the set of words that comprise second concept, and conclusive block—the set of words from the 2nd concept to the end of the sentence.
In order for a relationship to be identified between two co-occurring concepts, those two concepts need to be identified first. A reliable concept recognizer is a prerequisite for the relationship identification to take place. Therefore, we start the methods description first with the description of the data and the concept identification procedure. Next, we discuss the features characterizing both concepts and relationships between them, the relationship representation model and relationship identification procedure, as well as the evaluation measures for comparison.
Our participation in the 4th i2b2 challenge  allowed us to have access to a corpus of fully de-identified medical records manually annotated for concept, assertion, and relationship information. The training data contained discharge summaries from these different hospitals: Partners Healthcare (97 documents), Beth Israel Deaconess Medical Center (73 documents) and University of Pittsburgh Medical Center (98 documents). In addition, a set of progress notes from the University of Pittsburgh Medical Center (81 documents) was also included. The test data contained 477 records, also coming from the same sources. Our particular interest involved the classification of relations between medical problems, tests and treatments.
The pre-requisite step of relationship identification is the correct identification of the concepts pertaining in the relationship. One particular challenge of our relationship representation scheme is that the boundaries of the related concepts need to be precisely specified (exact span). Here we describe our concept identification method.
Concept identification features
The difference between any two implementations of a named entity recognition task lies in the set of features that are used to represent the entity phrase of interest. These features are traditionally divided into the following groups:
• Word features
• Context features
• Semantic features derived from other sources.
In accordance with those groups we represented each token of a given sentence with the following features: For the word features group we used the current token, its part of speech tag as identified by MEDPOST , and a surface feature that identified whether the current token was a number, a stop-word, or a punctuation symbol. For the context features group, we listed the two tokens before and the two tokens after the reference token, when available, as well as their respective part of speech tags, and their surface features. From the semantic features group, we used the priority model prediction class of each token.
Priority model , is a statistical method which has been successfully used to identify gene and disease names in biomedical text strings [26,29]. This method, given a phrase representing a named entity, assumes that a word to the right is more likely to be the head word of the phrase—the word more likely to determine the nature of the entity—than a word to the left. This model trains several variable order Markov Models, given a set of strings as training data, for sequences of tokens which represent concept phrases, versus others. While the priority model performance is high on the classification task for all three types of concepts of interest in patient records (problems, tests and treatment), a significant issue with using this approach in a real production setting is the identification of the boundaries of concept phrases. That is why we decided to incorporate its output as a feature in our concept identification system.
Concept identification method
Due to characteristics of the clinical text we decided to use Conditional Random Fields (CRF)  trained with the data that we had available. CRFs have been shown to provide state-of-the-art performance in the natural language processing community for named entity recognition. We used the MALLET toolkit  to implement concept recognition models for each of our concept classes: medical problem, treatment and test. A common representation for an HMM or CRF based entity recognition model uses three tags (BIO) to label the tokens of a sequence—B to represent the first token of a concept phrase, I to represent any following token part of the concept phrase, and O to represent any other token, outside of the phrase of interest. Due to the modest amount of training data available, we simplified this representation to two tags (IO): I representing any token part of the concept phrase, and O representing any token outside of the concept phrase. We used the above features to train our model. Next, for any unknown piece of text, our system first extracted the features of each token (word, context and semantic), and then it decided whether it was part of the entity phrase of interest (I) or not (O).
The goal of relationship identification is to determine whether two concepts are related, and the type of their relationship. Table 2 summarizes all annotations for the specified relations. These relationships (from our gold standard data) connected concepts appearing in the same sentence. In order to build our machine learning model, first we built the negative examples.
Table 2. Data description.
To build the negative dataset, we employed the following procedure: We scanned the text sentence by sentence. For any given sentence, for each identified pair of concepts, we counted all the possible relationships there could exist between them. The number of relationship candidates that could be generated in this manner is limited. First, there were a limited number of concepts that could be found in a sentence. And second, for any pair of concepts of type (problem, problem), there was only one relationship candidate: relates. For any pair of concepts of type (problem, test), there were two relationship candidates: conducted and reveals. For any pair of concepts of type (problem, treatment), there were five relationship candidates: improves, worsens, causes, given and not given. After extracting all pairs of concepts in this manner, we found that 14% of the relationship candidates were true (found in the gold standard annotations).
Next, we describe the features that we selected to represent these relationships. These features captured the lexical information, information about the type of concept of each medical entity, and the sentence-context information about the pair of medical concepts.
Relationship identification features
We considered all unique token features extracted from our corpus. We experimented both with word stemming and stop-word elimination . Also, for each concept phrase, we used MetaMap  to identify all matching UMLS Concept Unique Identifiers (CUI features) and their corresponding Semantic Type categories (SemTyp features) similarly to our previous work . Finally, for each concept, we also used the provided assertion category as annotated in the data: absent, conditional, present, hypothetical, possible, and associated-with-someone-else. All features had binary values, 1 if present and 0 if absent.
Relationship representation scheme
We represent a relationship between two concepts as a schema of five, not necessarily consecutive, context-blocks, as shown in Figure 2. This structure—Introductory Block, 1st Concept Block, Connective Block, 2nd Concept Block and Conclusive Block—is naturally marked by the location of the two concepts in the sentence. As an operational decision, the introductory and conclusive blocks contained a maximum of five words. We extracted features to represent each context-block, which all-combined, represented the relationship. This was contrasted with the Naïve Bag-of-Features approach, which used all the available features without taking into account the context-block that they were identified from.
Relationship identification method
For each relationship, we built a machine learning model that recognized the true relationships (gold standard) from the rest of the candidates (negative examples). The classification algorithm of choice was a linear SVM. We employed a five-fold cross-validation setting with balanced positive and negative instances for each fold. Our approach was to train our SVM-learner repeatedly, and eliminate a fixed number of lowest-weight features, after each step. Then a new model was learned on the remaining features. We reduced the number of features 500 at a time, until the system’s performance did not improve any more. Finally, given a test sentence annotated for concept and assertion, all relevant relationship models were tested for each pair of concepts. Next, each score result was converted to a probability value. The two concepts were predicted to have the relationship which score provided the highest probability. If none of the relationship models provided a probability value higher than 0.5, than the two concepts were not predicted to be related.
Each relationship model was implemented as a context-blocks model, where all available features were organized according to the specific context blocks their appeared in. Each feature had a binary value 0/1 depending on being absent/present in the context block of interest. This representation was contrasted with the naive bag-of-features approach which used all the available features, without distinguishing their position in the sentence, whatsoever. This served as our baseline.
We used precision, recall, and F-measure to measure and evaluate the performance of our systems. Precision measures the percentage of correct answers in the result set relative to its complete size and recall measures the percentage of correct answers relative to all true results (gold standard). F-measure is a metric that reflect the overall quality of recall and precision as a harmonic mean on the complete result set,
Where p is precision, r is recall, and β measures the trade-off between precision and recall. In this study, we chose β = 1, as it is commonly chosen for a balanced F-measure.
For all evaluations on the training data, these values were averaged over the five folds of cross validation. For a system balanced both in precision and in recall, we used the F-measure results to select the best models. When the same F-measure was obtained, we broke ties by choosing the model with the smaller number of features. We performed per-relationship and per-record evaluation of our system on the training data. The first measured the system performance on the relationship candidates, regardless of the patient records they were collected from. The latter measured the system performance on each relationship type, first, for each patient’s record, and next, averaged the results over all patient records. In this case, for each relationship type, only the records that had at least one annotated positive example of that relationship were considered.
For all evaluations on the test data, we measured precision, recall and F-measure, using the 4th i2b2 evaluation package in order to have fair comparison between other systems tested on the same dataset.
We conducted a wide range of experiments to identify the medical concepts in patient records and the relationships between them. Here we present a summary of our data analysis, concept extraction and relationship identification models.
Our training dataset  contained 349 fully de-identified medical records from four different hospitals. This corpus was manually annotated for concept, assertion type and relationship information at the sentence level. Clinically relevant concepts are medical problems, treatments and tests. Our training dataset contained more than 27,000 instances of medical problems, divided into 7,073 unique medical problem phrases, 4,844 unique treatment phrases and 4,608 unique test phrases. Table 3 shows sample annotated sentences from the corpus. In addition, each medical problem was annotated with one of the following assertion categories: absent, conditional, present, hypothetical, possible, and associated-with-someone-else. Examples of assertion categories are given in Table 4. Lastly, there were eight different relationship categories between medical problems, treatments and tests. These clinical relationships are illustrated in Figure 1 and detailed with examples in Table 1.
Table 3. Examples of medical concepts in patient records.
Table 4. Examples of assertion categories of problem concepts in patient records.
Table 2 shows the number of examples found in training data for each relationship type. These examples correspond to the corpus annotations. We created the negative examples for our machine learning model using all the pairs of annotated concepts for all the sentences in the corpus. Each pair of (problem, problem) concepts contributed one candidate to the relates relationship, each pair of (problem, test) concepts contributed one candidate for each of the conducted and reveals relationships, and each pair of (problem, treatment) concepts contributed one candidate for each of improves, worsens, causes, given and not given relationships.
The set of documents used to test the systems at the end of the 4th i2b2 challenge consisted of an additional set of 477 discharge summaries, which were accordingly provided labelled and annotated. The number of annotated relationships in the test data is also given in Table 2. These numbers are used to compute the overall weighted average of our system’s performance.
Our results of concept identification for the three medical concept classes are listed in Table 5. These results are expressed in terms of precision, recall and F-measure and measurements are produced for exact matching of the phrase to the annotated phrase, and partial or inexact matching to the annotated concept phrase. They are computed using the i2b2 evaluation tool for compatible comparison with other systems. Table 5 also contrasts our results for concept extraction with the best reported result at the 4th i2b2 challenge (Ozlem Uzuner, personal communication).
Table 5. Concept identification
We approached the relationship identification task as a classification task. Similarly to other participants of the i2b2 challenge we also utilized support vector machines, and we modeled each relationship separately. Differently from other approaches though, we conceptualized relationships as a composite structure of consecutive context blocks.
Concept-blocks relationship model performs best
Table 6 shows detailed evaluation results for the relates relationship identification using the string matching model, the naïve bag-of-features model and the context-blocks relationship representation model. In addition, we experimented with other features for each of the Concept blocks, such as assertion category, and mappings to UMLS Concept Identifiers (CUI) and UMLS Semantic Types (SemTyp). Our exploration of feature space in building a relationship identification model is also shown in Table 6.
Table 6. Performance evaluation for the relates relationship, using string matching and SVM models.
Assertions, Concept identifiers and Semantic Types are important for different relationships
Table 7 shows performance evaluation for all eight clinical relationships. For each relationship, we used the context-blocks representation to identify the best model combining the word features with the assertion, CUI and SemTyp features. We selected the best model based on the F-measure values. These results illustrate that different relationships benefited from different additional concept features. In addition, we used the same features in the non-context-blocks setting, or naive bag-of-features, with the same SVM classifier, and those results are also listed in Table 7.
Table 7. Performance evaluation for the best models of all relationships
Feature selection refined relationship identification
We applied the SVM iterative feature selection to each context-blocks relationship model selected in Table 7. After feature selection we identified 1000 features for each relationship. Table 8 presents the F-measures obtained, both before and after feature selection, for each relationship using five-fold cross validation. Metrics are computed using both per-relationship and per-record evaluation.
Table 8. Per-relationship and per-record f-measures computed prior to and after feature selection.
Context-blocks model is important for relationship identification
We studied the feature composition of the selected models for each relationship category. We found that specific words were selected in specific context blocks. Consider, for example, the conducted and relates relationships, as illustrated in Figure 3. The word “revealed” was weighted positively in the Connective block of the relates relationship, but it was weighted negatively in the Connective block of the conducted relationship. Stop-words were also highly weighted features, both positively and negatively, in all relationship models.
Figure 3. Comparison between features that represent relationship conducted (Test is Conducted for Medical Problem) and reveals (Test Reveals Medical Problem). The sentence blocks are shown sequentially for conducted on the right, and for reveals on the left. Each sentence block is named, and the positively selected features are highlighted in the green block, while the negatively weighted features are highlighted in the red block. From this diagram, we can see that some features which are weighted highly positive for one relationship are weighted, in fact, negatively for the other.
Relationship identification model robust after concept extraction
One of the main goals of this study was to demonstrate that a realistic application setting is possible. In a realistic application test, one would start with the concept extraction, and precisely identify the concept boundaries, so that relationship identification may be performed. In order to address this, we ran our concept extraction model on the i2b2 test dataset, and marked the predicted concept phrases and their type. Next, for each test sentence in the test records, all relevant relationship models were tested for each pair of concepts. In the end, each score result was converted to a probability value. Two concepts were predicted to have a relationship, if the probability was higher than the threshold (0.5). The relationship type, however, was assigned to the one with highest probability value amongst competing relationships.
Table 9 lists these results. The first two columns show the results of the relationship identification models using the test data annotated concepts, and the last two columns show the results using the automatically extracted concepts. The first and the third columns show results when the original model is applied, and the other two columns show results when the feature-selection refined model is applied. The results of column three and four are statistically different (T-test, p=0.005), but the results presented in column four are not statistically different from those presented in the first two columns. Overall averages are computed by weighting each F-measure with the number of examples of that particular relationship type (shown in Table 2).
Table 9. F-measures computed prior to and after feature selection for the test dataset relationship prediction. Results are computed using the annotated concepts (columns 1 and 2), and using the predicted concepts as identified in the concept recognition step (columns 3 and 4).
In this study, we defined the relationship between two concepts as a structure of five distinct context-blocks: the introductory block, the first concept block, the connective block, the second concept block, and the conclusive block. Such a representation was successful in identifying eight relationships between medical problems, treatments and tests in patient records. The performance degraded considerably when the context-blocks structure was removed for the same relationships, with the same set of features and the same classification algorithm (the naïve bag-of-features model). The context-blocks representation captured the individual word positions, and treated them accordingly. For example, for the conducted relationship the word “without” was a highly weighted negative feature in the introductory block and a highly weighted positive feature in the connective block.
Also, stop-words were very valuable in this study as also reported in . For example, the word “no” was a highly weighted negative feature in the introductory block of the given relationship, while being a highly weighted positive feature in the same block of the not given relationship. Similarly, the words “for”, “but”, “because”, and other stop-words, were observed to fulfill analogous roles.
In this study, we also addressed its natural pre-requisite problem that, in order for a relationship to be identified between two co-occurring concepts; those two concepts need to be identified first. We built a reliable concept recognizer that exhibited high accuracy at identifying concept boundaries; critical for the context-blocks relationship model. While feature selection did not have a significant impact on relationship extraction based on manually annotated concepts, it significantly improved the performance of relationship extraction based on automatically extracted concepts (T-test, p=0.005). Overall, as can be seen from Table 9, feature selection was a key step allowing us to obtain similar relationship extraction performance as high for automatically extracted concepts as for manually annotated concepts.
Naturally, a careful study of the clinical texts may define other types of relationships between medical concepts. In that case, the context-blocks model could be easily adapted. Finally, this model only considered the text within a sentence. Such a simplification, by definition, puts a limitation on the sensitivity of the produced results. Future work should include natural language techniques in order to obtain a better understanding of the text, as well as resolve pronouns and inference.
In this work, we present a successful end-to-end method for relationship extraction from clinical documents. Automatic recognition of medical concepts in clinical records is a challenging first step towards semantically relating the concepts and more advanced reasoning applications of text mining in the patient records. To address this, we built a reliable concept recognizer that exhibited high accuracy (F-measure = 0.870) at identifying concept boundaries; critical for the context-blocks relationship model. We defined a relationship identification schema between two concepts in text. In this scheme, the relationship is represented as a structure of five context- blocks: the introductory, first concept, connective, second concept, and conclusive block. This scheme automatically captured the word positions information; critical in certain relationships. We found that assertion information was useful in detecting clinical tests conducted to investigate medical problems, and treatments which cause medical problems to get worse. Semantic types were useful in identifying treatments that improved a medical problem and UMLS concept identifiers were relevant in identifying two medical problems that were related to each other. Our system benefited from inclusion of stop-words, especially when found in the introductory and connective blocks of the relationship representation. Our results may serve as benchmark for comparison to other systems developed on i2b2 challenge data. Finally, our system may serve as a preliminary step for other discovery tasks in medical informatics.
The authors declare that they have no competing interests.
RID designed the study, developed the concept and relationship models, performed the evaluation and wrote the draft of the manuscript. AN and ZL contributed to the study design, data preparation and evaluation. All authors have read and approved the final manuscript.
Funding: This research was supported by the Intramural Research Program of the NIH, National Library of Medicine.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 3, 2011: Machine Learning for Biomedical Literature Analysis and Text Retrieval. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S3.
Proc HLT-NAACL BioNLP Workshop 2006, 33-40. Publisher Full Text