Development of three classification trees (CT) based on the CART (Classification and Regression Trees), CHAID (Chi-Square Automatic Interaction Detection) and C4.5 methodologies for the calculation of probability of hospital mortality; the comparison of the results with the APACHE II, SAPS II and MPM II-24 scores, and with a model based on multiple logistic regression (LR).
Retrospective study of 2864 patients. Random partition (70:30) into a Development Set (DS) n = 1808 and Validation Set (VS) n = 808. Their properties of discrimination are compared with the ROC curve (AUC CI 95%), Percent of correct classification (PCC CI 95%); and the calibration with the Calibration Curve and the Standardized Mortality Ratio (SMR CI 95%).
CTs are produced with a different selection of variables and decision rules: CART (5 variables and 8 decision rules), CHAID (7 variables and 15 rules) and C4.5 (6 variables and 10 rules). The common variables were: inotropic therapy, Glasgow, age, (A-a)O2 gradient and antecedent of chronic illness. In VS: all the models achieved acceptable discrimination with AUC above 0.7. CT: CART (0.75(0.71-0.81)), CHAID (0.76(0.72-0.79)) and C4.5 (0.76(0.73-0.80)). PCC: CART (72(69-75)), CHAID (72(69-75)) and C4.5 (76(73-79)). Calibration (SMR) better in the CT: CART (1.04(0.95-1.31)), CHAID (1.06(0.97-1.15) and C4.5 (1.08(0.98-1.16)).
With different methodologies of CTs, trees are generated with different selection of variables and decision rules. The CTs are easy to interpret, and they stratify the risk of hospital mortality. The CTs should be taken into account for the classification of the prognosis of critically ill patients.
Stratifying the patients into risk groups, according to their severity, is essential for the comparison of treatments and the establishment of differences between different units or hospital centres. As a result, working in an intensive care unit (ICU) necessitates making prognoses for patients within the first 24 hours of their admission. Establishing a prognosis consists of assigning a probability of death by using variables commonly used for the diagnosis and treatment of critically ill patients .
Severity scores are classic tools used in establishing this probability. The most commonly used scores are the APACHE II (Acute Physiology and Chronic Health Evaluation II), the SAPS II (Simplified Acute Physiology Score II) and the MPM II-24 (Mortality Probability Models II-24) scores [2-4].
Other systems of severity classification based on different mathematical strategies have also been used .
In the last decade, classification trees (CT), which were developed more than 20 years ago, have acquired greater importance in the immediate interpretation of the decision rules that they generate, and they are readily accepted by professionals in clinical practice .
A CT is a graphic representation of a series of decision rules. Beginning with a root node that includes all cases, the tree branches are divided into different child nodes that contain a subgroup of cases. The criterion for branching (or partitioning) is selected after examining all possible values of all available predictive variables. In the terminal nodes (the "leaves" of the tree), a grouping of cases is obtained, such that the cases are as homogeneous as possible with respect to the value of the dependent variable .
The different CT types are distinguished by the manner of node partitioning. In the specific case of CARTs (Classification And Regression Trees), possibly the most widely used CT in medicine, an impurity function (the so-called Gini index) is calculated, and for each division of the tree, the variable and its cut-off value are defined such that the decrease in the impurity function is the greatest . There are many types of CTs (or improved versions) such as CHAID (Chi-square Automatic Interaction Detection) and C4.5 (developed from the so-called Concept Learning Systems). Table 1 illustrates, in a schematic fashion, the particularities of these CTs. A CT has a growth phase, a pruning phase (removal of branches that do not provide general information to the system) and a selection of the optimal tree .
Table 1. Characteristics of the classification tree methods
The aim of the present study was to develop (with a population of critically ill patients) three classification trees (based on CART, CHAID and C4.5 methodologies) to calculate the probability of hospital mortality and to compare these trees with each other, with the classic scores (APACHE II, SAPS II and MPM II-24) and with a model based on multiple logistic regression.
This is a retrospective study carried out using the database of a mixed ICU (with medical and surgical services) of 14 beds located at the University Hospital Arnau de Vilanova of Lleida. The ethical committee of the hospital was informed that the study was being carried out, and informed consent was not deemed necessary, since all the variables were collected for the diagnosis and treatment of the patients and their anonymity was assured at all times.
Data collected over ten years (from January 1997 to December 2006) were used. In this study, all patients were over the age of 16 years and remained in the ICU for more than 24 hours. Patient records with incomplete data were not used.
A random partition, in a 70:30 ratio, was made to establish the development and the validation sets, respectively.
Data concerning age, sex, length of stay in the ICU and procedures specific to the ICU were used. The outcome variable of interest was the probability of hospital mortality. The patients were divided according to their diagnostic groups following the Knaus classification . Six diagnostic groups were established according to the case mix and level of severity of the ICU, including two trauma categories of TBI (traumatic brain injury) and Multiple trauma (multiple trauma without brain injury), Respiratory (chronic respiratory problems with decompensation), Neurological (ischemic or hemorrhagic strokes), Surgery (surgical problems not included in other categories) and Medicine (medical pathology not included in other categories).
Each patient's medical records and laboratory database files were used to obtain information pertaining to baseline (at ICU admission) demographics, pre-existing comorbidities and scores (APACHE II, SAPS II and MPM II-24). The data were then compiled (manual recording) into single data using a relational database management system (Microsoft Access©).
The presence of acute renal failure was defined (according to the model MPM II-24) by levels of serum creatinin above 2 mg/dL . The antecedents of chronic organ insufficiency (defined according to the APACHE II model) were included in the variable COI .
Logistic models and classification trees
Models were created with the development set and were subsequently checked in the validation set.
Working with the development set, first, a univariate analysis was performed for all the variables included in the three scores to select those that predicted survival. Those that were statistically significant predictors were included in the development of the multivariable models.
We used a model of multiple logistic regression (LR) with forward stepwise selection of variables .
The computer programs used for creating the CTs are presented in Table 1. The program WEKA (a project of Waikato University) is freely accessible and includes a CT module, named J48, that includes CART and C4.5 .
Answer-Tree©, a module of SPSS (Statistical Package for the Social Sciences), includes options for CART and CHAID, and the program DTREG© (version 3.5) is based on a CART-type methodology.
To create the three types of CTs, a cross-validation system with ten partitions was used, and the only common restriction for terminating the growth of the tree was the minimum number of subjects in the terminal nodes (which was fixed at 50 patients).
The variables are presented as the mean (standard deviation), the median (interquartile interval) or as a percentage. For a comparison of the variables, the chi-squared (χ2) test was used for categorical variables, and the ANOVA test or non-parametric Mann-Whitney test was used for continuous variables, depending on the characteristics of the distribution.
To compare the different models, we measured their precision (discrimination and calibration) with the Brier score. The discrimination was measured by calculating the percentage of correctly classified patients (PCC) with a cut-off point with a probability of 0.5 and by the area below the ROC curve (AUC) . For calibration, the Hosmer-Lemeshow C test (HL-C) was used  by constructing the calibration curve and calculating the standardized mortality ratio (SMR) . These calculations were made both in the development set and in the validation set. We used a correlation matrix (Spearman correlation coefficients) and the Bland-Altman test to analyse the individual probabilities generated by the CT models .
The statistical analysis was carried out with the program SPSS (version 14.0).
Among 2823 patients, 139 were excluded due to incomplete or erroneous data (4.9%), leaving 2684 eligible patients. The development group consisted of 1880 patients (70%) and the validation group consisted of 804 (30%).
The demographic characteristics of the patients are shown in Table 2; there were no major differences between the development and the validation groups. Some characteristics are particular to the ICU, such as the low proportion of scheduled patients (6.5%), the prolonged length of stay (median of 7 days) and the high mortality rate (31.4%).
Table 2. Demographic characteristics of patients
Table 3 shows the evolution (during the 10 years observed) of hospital mortality, the severity scores and the participation percentage in the development set. There are no significant differences (only the evidence that the number of admissions has kept on increasing).
Table 3. Outcome trend over the observation period
Variable selection: univariate analysis
A total of 24 variables showed significant differences between the survivors and non-survivors (Table 4). The table also shows the scores for which the different variables were included. No significant differences were found for respiratory frequency (APACHE II), serum potassium (APACHE II and SAPS II), hematocrit (APACHE II), leuckocyte count (APACHE II y SAPS II), bilirubin (SAPS II), PaO2 (MPM II-24) or antecedents of cirrhosis and neoplasia (MPM II-24).
Table 4. Univariate analyses of characteristics of patients at discharge, by survival status.
Only the COI variable reflected the chronic illnesses of the patient. For variables related to diagnoses, the surgery group was associated with a greater possibility of hospital mortality, while the trauma group was associated with a lower likelihood of mortality.
Multiple Logistic Regression Model
Table 5 shows the LR model including 9 variables (Continuous: Age, HR, Glasgow and (A-a)O2 gradient. Discrete: Inotropic therapy, MV, Acute renal failure, COI and Trauma) selected from the 24 variables.
Table 5. Results of multiple logistic regression
Classification Tree Models
The variables common to the three CTs and the LR model are inotropic therapy (INOT), Glasgow value, (A-a)O2 gradient ((A-a)O2), age and COI.
Figure 1 shows the CT based on the CART methodology (the three programs gave the same result). It used only five variables and began with INOT. It generated eight decision rules with an assignment rank of probability ranging from 5.9% to a maximum of 71.3%.
Figure 1. Classification tree by CART algorithm. The gray squares denote terminal prognostic subgroups. INOT: Inotropic therapy; (A-a)O2 gradient: Alveolar-arterial oxygen gradient (mmHg); MV: Mechanical ventilation; COI: Chronic organ insufficiency.
It is noted that a CT can use the same variables in various decision rules and that, for continuous variables, different cut-off points can be selected.
Figure 2 illustrates the CT based on the CHAID methodology. It used seven variables, and it also began with the variable INOT. It generated fifteen decision rules with an assignment rank of probability ranging from 0.7% to a maximum of 86.4%. In this type of CT, the Glasgow value, age and (A-a)O2 variables were divided into intervals with more than two possibilities.
Figure 2. Classification tree by CHAID algorithm. The gray squares denote terminal prognostic subgroups. INOT: Inotropic therapy; (A-a)O2 gradient: Alveolar-arterial oxygen gradient (mmHg); MV: Mechanical ventilation; COI: Chronic organ insufficiency.
Figure 3 depicts the C4.5 model, which used six variables (the five common variables and the MAP, which is not included in the LR model) and generated ten decision rules. The probabilities ranged between 7.6% and 76.2%. In contrast to the other CTs, in this CT, the first variable was the point value on the Glasgow scale.
Figure 3. Classification tree by C4.5 algorithm. The gray squares denote terminal prognostic subgroups. INOT: Inotropic therapy; (A-a)O2 gradient: Alveolar-arterial oxygen gradient (mmHg); MV: Mechanical ventilation; COI: Chronic organ insufficiency; MAP: Mean arterial pressure.
Comparison of model properties
The three CT models and the LR model were also compared with those generated using the APACHE II, SAPS II and MPM II-24 scores.
The severity scores were applied without making recalibration in all the population (development and validation sets).
Table 6 shows the values for the properties evaluated. It can be seen that all models achieved an acceptable discrimination (an AUC greater than 0.70) both in the development and the validation set.
Table 6. Performance of the classification models: development and validation sets
Figure 4 presents the calibration curves of the models. It is notable that some curves were displaced to the observed mortality; this coincided with an SMR greater than 1 (with a CI of 95% that does not include 1) (Table 6). The models based on the CTs were better calibrated (this was observed both in the calibration curves and in the obtained SMR (see Table 6)).
Figure 4. Calibration curves for the classification models. Validation set.
All the models correctly classified approximately 75% of the cases evaluated.
Comparison of individual probabilities generated by the CT models
Table 7 shows the correlations between the probabilities calculated with the 3 CTs and the LR model (all of them statistically significant).
Table 7. Correlation matrix of the probabilities (CTs and LR models)
Figure 5 shows the Bland-Altman results obtained in the validation set by comparing the probabilities determined by the CART CT with those of the LR, CHAID and C4.5 CTs.
Figure 5. Bland-Altman plot analysis. (a) CART vs Logistic Regression. (b) CART vs CHAID. (c) CART vs C4.5. The dotted lines are the limits of agreement (mean ± 2 SD). Validation set.
We observed that there were patients for whom the difference in the probabilities exceeded the acceptable limit of the test. There were 116 patients included in the comparison of the CART and CHAID CTs, and 245 in the comparison of the CART and C4.5 CTs. The differences can be partly attributed to the behaviour of the Glasgow variable (different cut-off points or partitions) and to the influence of the COI variable in the different divisions of the tree branches.
The different models generate, in some patients, different allocation of death provability. When performing a validation with records not used in the phase of development, the different allocation of probability determines in our case a conservation of a similar discrimination but that the calibration is different (being better for the AC).
The results illustrate that the ICU had particular demographic characteristics due to its case mix, with a low percentage of scheduled patients, a long length of stay and high mortality. These data are important when it comes to appraising and generalising the results obtained with our database .
The results yielded mortality rates that were higher than expected (according to the classic APACHE II, SAPS II and MPM II-24 scores), which can be partly attributed to these individual characteristics . However, this finding also necessitates a recalibration of these models in order to achieve a correct stratification of the patients' risk of hospital mortality .
Previously, CTs have been used with critical patients, e.g., for the calculation of the probability of death from coronary pathology , intracerebral haemorrhages  or traumatic brain injuries , for the prediction of persistent vegetative states  or (as in our study) for stratifying the probability of death in a general population of ICU patients [23,24].
The common variables selected by the three CT types (and also by the model based on LR) were: the necessity of inotropic therapy, the point value on the Glasgow coma scale, the alveolar-arterial gradient in oxygen, age and the presence of antecedents of important chronic diseases. This group of variables included information concerning chronic health and age (which were variables specific to the patient), the point value on the Glasgow coma scale and the (A-a)O2 gradient as deviations from the normal state as well as a variable specific to the intensive treatment (INOT). The selection of some of these variables has also been reported in other studies of mortality in other groups of critical patients .
These five variables are capable of stratifying the examined population of critical patients (for example, as in the CART CT), using eight simple decision rules, with acceptable properties of discrimination and calibration.
We also observed that the three CT types exhibited differences. Even when incorporating the five common variables mentioned earlier, these CTs differed in the first variable to be selected, in the details of "branching", in the cut-off points (and subgroups), in the order of variable selection and in the incorporation of other variables.
We have already seen that the CART CT includes the five general variables. The C4.5 CT adds the MAP (Mean Arterial Pressure) variable and the CHAID CT includes MV (Mechanical Ventilation) variables and the fact of belonging to the trauma group. The LR model uses those of the CHAID CT model, also including the Acute Renal Failure and HR (Heart Rate) variables.
The CT software allows to adjust the levels and the number of partitions for each branch in order to get more complex models . In our case, our only restriction (in the 3 CT models) was that the minimum number of subjects in the terminal nodes should be of 50 patients.
We cannot state which CT was optimal (since they had similar general properties). The CART and CHAID CTs were similar in their order of partitioning, although the CHAID CT (due to its inherent characteristics) separated the continuous variables into more than two possibilities and generated more decision rules. The CART CT was simpler, while the CHAID CT showed greater complexity (and also selected more variables). Different CTs can select different first variables, and in the C4.5 CT, the first variable, the Glasgow point value, was different from that of the other CTs; the C4.5 CT also incorporated different variables. The analysis of the individual probabilities generated by the different CTs (in spite of a good correlation) assisted in the identification of possible "problem" variables, e.g. the Glasgow point value and the COI variable, in their order of appearance in the decision rules generated.
When there is a classification problem, there is no model that can be chosen a priori to be the best . Even with the same information, different CTs develop models with different interpretations . Based on our data, the CTs do not compete with the classic scores in their function of calculating individual probabilities. In the case of a large database, the CTs generated would be too complex to interpret and use with regularity (many branches and decision rules). The immediate interpretative advantage of CTs is only obtained with simple trees [31,32].
Our study had several limitations. In the first place, it was carried out in only one ICU and within a ten-year span database (although no variation was observed during the period of study). It would also have been possible to employ more methodologies for comparison or to improve those that were used, by incorporating relations and/or ranks of a priori variables, as do the classic scores.
As exposed by one of the reviewers, we found a great difference between the observed and expected mortality in the validation group in the LR model. The LR-based model could have been carried out using the variables as categorical, thus minimizing the possible effect that outlier values (using the variables as continuous) have on the predicted outcome. One of the advantages of CT-based models is that they automatically change the continuous variables into categorical ones and that their cut-off points could also be used to create a LR model with discreet variables
We must mention the effort at Waikato University (New Zealand) regarding the free-access program WEKA, which strives to collect (in a single tool) the majority of the methodologies that are used to classify, select and group variables .
The principal advantage of CTs is that they are easy to interpret. However, this advantage could turn into an obstacle, since we tend to choose the optimal CT as that which more closely approaches the clinical reasoning that coincides with that of the program user . An understanding of the clinical problem is necessary in order to adequately interpret CTs.
One contribution of our effort was the demonstration that the CT methodology is not unique and that different CTs could be generated according to various methodologies. The CTs assisted in both selecting variables of greater importance in the problem of classification and determining the best cut-off points for the continual variables.
We believe that CTs (e.g., the model based on CART) are mainly useful in obtaining homogenous groups for the assignation of the probability of hospital mortality. These groups with different characteristics (defined by rules of classification that can be interpreted) can serve, for example, as a basis for the creation of new scores.
We intend to do further research including a multi-centre study, with the incorporation of more methodologies and the possible use of hybrid models. In order to generalise our results, external validation will be required .
The main benefits to CT analysis are to identify a relatively small number of groups that are reasonably homogeneous with regard to the outcome. The CTs can be used in intensive care medicine for assisting in diagnosis and prognosis [37,38]. Those less familiar with CTs should realise that this us a class of methods including many different approaches, and that these different approaches may result in considerable differences in classifications.
The authors declare that they have no competing interests.
JT and JM provided the CT procedure, implemented the algorithms, performed statistical and drafted the manuscript. MB, LS and ARP analyzed the data, interpreted the results and wrote the final paper. All authors read and approved the final manuscript.
Hosmer DW, Lemeshow S: Applied logistic regression. 2nd edition. John Wiley & Sons New York; 2000. Publisher Full Text
Lancet 1986, 1:307-10. PubMed Abstract
Zhu BP, Lemeshow S, Hosmer DW, Klar J, Avrunin JS, Teres D: Factors affecting the performance of the models in the Mortality Probability Model II system and strategies of customization: A simulation study.
Mann JJ, Ellis SP, Waternaux CM, Liu X, Oquendo MA, Malone KM, Brodsky BS, Haas GL, Currier D: Classification trees distinguish suicide attempters in major psychiatric disorders: a model of clinical decision making.
Köhne CH, Cunningham D, Di CF, Glimelius B, Blijham G, Aranda E, Scheithauer W, Rougier P, Palmer M, Wils J, Baron B, Pignatti F, Schöffski P, Micheel S, Hecker H: Clinical determinants of survival in patients with 5-fluorouracil-based treatment for metastatic colorectal cancer: results of a multivariate analysis of 3825 patients.
The pre-publication history for this paper can be accessed here: