Email updates

Keep up to date with the latest news and content from BMC Medical Informatics and Decision Making and BioMed Central.

Open Access Research article

Polychotomization of continuous variables in regression models based on the overall C index

Harukazu Tsuruta1* and Leon Bax2

Author Affiliations

1 Department of Medical Informatics, School of Allied Health Sciences, Kitasato University, Sagamihara, Kanagawa, 228-8555, Japan

2 Department of Medical Informatics, Graduate School of Medical Sciences, Kitasato University, Sagamihara, Kanagawa, 228-8555, Japan

For all author emails, please log on.

BMC Medical Informatics and Decision Making 2006, 6:41  doi:10.1186/1472-6947-6-41


The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1472-6947/6/41


Received:23 May 2006
Accepted:14 December 2006
Published:14 December 2006

© 2006 Tsuruta and Bax; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

When developing multivariable regression models for diagnosis or prognosis, continuous independent variables can be categorized to make a prediction table instead of a prediction formula. Although many methods have been proposed to dichotomize prognostic variables, to date there has been no integrated method for polychotomization. The latter is necessary when dichotomization results in too much loss of information or when central values refer to normal states and more dispersed values refer to less preferable states, a situation that is not unusual in medical settings (e.g. body temperature, blood pressure). The goal of our study was to develop a theoretical and practical method for polychotomization.

Methods

We used the overall discrimination index C, introduced by Harrel, as a measure of the predictive ability of an independent regressor variable and derived a method for polychotomization mathematically. Since the naïve application of our method, like some existing methods, gives rise to positive bias, we developed a parametric method that minimizes this bias and assessed its performance by the use of Monte Carlo simulation.

Results

The overall C is closely related to the area under the ROC curve and the produced di(poly)chotomized variable's predictive performance is comparable to the original continuous variable. The simulation shows that the parametric method is essentially unbiased for both the estimates of performance and the cutoff points. Application of our method to the predictor variables of a previous study on rhabdomyolysis shows that it can be used to make probability profile tables that are applicable to the diagnosis or prognosis of individual patient status.

Conclusion

We propose a polychotomization (including dichotomization) method for independent continuous variables in regression models based on the overall discrimination index C and clarified its meaning mathematically. To avoid positive bias in application, we have proposed and evaluated a parametric method. The proposed method for polychotomizing continuous regressor variables performed well and can be used to create probability profile tables.

Background

In modern diagnostic and descriptive prognostic research, regression models are often used to model an illness-related outcome based on a number of independent regressor variables, also referred to as diagnostic indicators or prognostic predictors [1]. Such regressor variables can be categorical or numerical. From the vantage point of applicability in a clinical setting, categorization (often dichotomization) of continuous independent variables can be useful. Obtaining a prediction at the bedside without computer is easier with a prediction table based on categorized variables than with a prediction formula. Even if calculation is not problematic, table presentation of the risks has the practical advantages that (1) repeated use of the table will give physicians an intuitive feel for the disease risk, and (2) even if the value of one or two of the prognostic variables is not available, physicians can obtain a probability range corresponding to the patient's risk by referring to the most extreme cases in the table.

Depending on the setting, several different approaches have been proposed for dichotomization. One popular method is to find a cutoff point to discriminate whether a patient belongs to a normal group or a disease group based on the observed value of a predictive factor. This type of discriminant function analysis was first developed by R.A. Fisher [2] in 1930's. The Mahalanobis distance [3] can be used to find the optimal cutoff point if the variable distributes normally.

Another solution, sometimes used in clinical chemistry, is to find a cutoff point that maximizes the sum of sensitivity (SE) and specificity (SP) [4,5]. There are different versions of this approach where one can maximize the weighted sum of SE and SP, or maximize the SE while fixing SP to an acceptable value [6,7]. Cantor claimed that these methods have been used in many published articles without giving a theoretical foundation or scientific justification [8].

Yet another straightforward and popular method is to select a classification that maximizes a measure of difference between the two groups, such as the p-value of a chi square statistic [9,10]. This method, sometimes called the minimum p-value approach, has been described and used for the prognosis of cancers [11,12]. Several authors have pointed out that the naïve selection used in this method overestimates the significance of the predictor or indicator's relationship to the dependent variable because of multiple testing, and several adjustment methods of the observed p-values have been proposed [9-18].

Besides using the data at hand to come to a dichotomization of continuous variables, it is also possible to use profit (benefit) or loss (cost) information. In that case, the optical cutoff point is defined so as to maximize the expected utility. Metz showed that the optimal point is the spot on the ROC curve at which the slope is (L/B)(1-p)/p, where B is the net benefit of treating diseased individuals, L the net loss of treating non-diseased individuals, and p the prevalence of the disease under study [19]. Nevertheless, Cantor et al., in a review of studies in the medical literature that referred to "ROC" and "cutoff", found that only a few articles included a L/B ratio in the analysis for determining an optimal cutoff point [8].

The above methods all concern dichotomization. However, when central values refer to normal states and dispersed values to diseased states, two (or more) cutoff points are necessary to discriminate these states. Consequently, one is inevitably faced with the challenge of polychotomization. Unfortunately, methods for polychotomization are less developed. Although Kristjansson et al. [20] described a method for choosing optimal cutoff points in a screening test with a continuous score to divide people into a number of disease categories, their method is not applicable to polychotomization of regressor variables in regression models; their criterion loses its meaning in this setting.

The major goal of our study is to develop a theoretical and practical method for polychotomization. We propose a novel approach for independent continuous variables in regression models based on the overall discrimination index C introduced by Harrel et al. [21,22]. We will show that this index is closely related to the area under the ROC curve for the original continuous variable and that the resulting categorized variables have predictive properties comparable to the original continuous variable. However, the naïve search of the maximum C index gives rise to positive bias, not unlike the minimum p-value approach [9-18] or the method of maximizing the sum of the sensitivity and specificity [4,5]. We therefore propose a parametric version in which the estimates of the predictive performance and cutoff points are both essentially unbiased. We evaluate this method and present means and standard deviations of predictive performance and cutoff point estimates for typical cases via Monte Carlo simulation. Finally, we provide a simple application example with a predictive regression model for rhabdomyolysis and show how our method can be used to create a probability profile table.

Methods

The categorization criterion

We assume there is an existing predictive model based on patients that belong to either a normal group or a diseased group and that the distribution of the relevant independent continuous variable X is known or that we have observations on it. Our goal is to find a method of optimal polychotomization for this continuous variable with a minimum loss of predictive ability. This involves making the number of possible patient's profiles finite, and replacing the regression formula with a table of the risk probabilities for all patient profiles. Different from most previously developed approaches we have no a priori intention to categorize the variable into two classes and we assume that it might be necessary to compare categorizations to three or more classes.

For this discussion we need a measure to evaluate the predictive power of a predictive variable. Our choice for a measure of predictive power is the overall discrimination index C [21-24], or the 'pair consistency probability', as we like to call it. This measure refers to the probability that the relative position of single normal-disease pair values is consistent with the relative position of their values of central tendency.

Without losing generality, we assume that the central value of the distribution of the random variable X in the group of healthy cases is smaller than the central value in the group of diseased cases. Next we take a sample xi[h] from the healthy group and another sample xi[d] from the diseased group randomly. Then the pair (xi[h], xi[d]) is considered consistent if xi[h] <xi[d], tied if xi[h] = xi[d], and inconsistent if xi[h] > xi[d] and the pair consistency probability C is defined as:

<a onClick="popup('http://www.biomedcentral.com/1472-6947/6/41/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1472-6947/6/41/mathml/M1">View MathML</a>

where pcon and ptied denote the probabilities that the pair is consistent and tied respectively.

Next, if we let fh represent the probability density function (PDF) of X in the healthy group and fd represent the PDF of X in the diseased group, and let z represent a cutoff point for dichotomization, then the true positive fraction Tp and false positive fraction Fp are defined by

<a onClick="popup('http://www.biomedcentral.com/1472-6947/6/41/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1472-6947/6/41/mathml/M2">View MathML</a>

In the case that the variable is continuous, as z increases, Tp and Fp both decrease continuously. The ROC curve [19,25] can be depicted as the trace of points (Fp , Tp ). Green and Swets [25] demonstrated that

<a onClick="popup('http://www.biomedcentral.com/1472-6947/6/41/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1472-6947/6/41/mathml/M3">View MathML</a>

This means that the pair consistency probability is equivalent to the area under the ROC curve for continuous variables. We will demonstrate that this relation also holds for polychotomized variables, and that the pair consistency probability C is a good measure to compare the predictive ability with the original continuous variable.

Optimal cutoff point for dichotomization

First, we discuss our method for dichotomization in which a continuous independent variable in a predictive model is categorized to one of two classes by a cutoff point. If we denote the value of the cutoff point z and assume that X is continuous in both the healthy and the diseased groups, that is, P(x[h] = z) = 0 and P(x [d] = z) = 0, the results of random pair sampling are classified into the following four cases:

x[h] <z and x[d] <z,     tied

x[h] <z and x[d] > z,     consistent

x[h] > z and x[d] <z,     inconsistent

x[h] > z and x[d] > z,     tied.

Let α denote the probability that x[h] is greater than z, and β denote the probability that x[d] is less than z. Assuming that the central value of the distribution of the random variable X in the group of healthy cases is smaller than the central value in the group of diseased cases, we have

<a onClick="popup('http://www.biomedcentral.com/1472-6947/6/41/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1472-6947/6/41/mathml/M4">View MathML</a>

Then the probability of a consistent pair becomes

pcon = (1 - α)(1 - β),

and the probability of a tied pair becomes

ptied = (1 - α) β + α (1 - β).

Assigning these probabilities into (1), we have

C = 1 - (α + β)/2.     (3)

It follows that the highest pair consistency probability is achieved when the sum of the two types of errors, α + β, is minimized. Since sensitivity is (1 - β) and specificity is (1 - α), we have

C = (sensitivity + specificity)/2.     (4)

Therefore the highest pair consistency probability is achieved when the sum of sensitivity and specificity is maximized.

Figure 1 illustrates the changes of C when fh and fd are normal. Let z be the cutoff point where fh and fd cross between two peaks. If the cutoff point is shifted to the right from z, then α will decrease and β will increase. In this case, since fd is greater than fh in this interval, the increase of β is greater than the decrease of α. If the cutoff point is shifted to the left, then the opposite is true. Therefore, the sum of the two types of errors, α + β, occupies the local minimum at the point where fh and fd intersect between the peaks. If fh and fd are unimodal and cross only at one point, α + β occupies the true minimum at the cross point.

thumbnailFigure 1. Sample illustration of the change of pair consistency probability C. Lower curves: sample illustration of the probability density functions in the healthy group (fh) and in the diseased group (fd); Upper curve: pair consistency probability C (=(1- (α + β)/2)) as a function of cutoff point z. The sum of the two types of errors, α + β, takes a local minimum at the point where fh and fd intersect.

Generation and meaning of the ROC straight line graph for a dichotomous variable

As we have described earlier, when the independent variable is continuous, Tp and Fp both decrease continuously and the ROC curve can be depicted as the trace of points (Fp , Tp ). But what happens to the ROC curve when the variable is dichotomous? Let z0 represent the cutoff point and Fp0 and Tp0 denote the false positive and true positive fractions for z0, respectively. Unlike the continuous variables, only three points (1, 1), (Fp0, Tp0) and (0, 0) are depicted in Fp - Tp coordinates and we cannot obtain a true curve (see Figure 2). We jointed these points with straight lines, and labelled this graph the ROC straight line graph. Then area A under the ROC straight line graph becomes:

thumbnailFigure 2. The ROC curve and ROC straight line graph for the sample distributions in Figure 1. The ROC curve was derived from the distributions in Figure 1 and a ROC straight line graph for the cutoff point z0, which gives the maximum C, was also plotted. Filled part A shows the area under the ROC straight line graph.

A = Fp0Tp0/2 + (1 - Fp0)Tp0 + (1 - Fp0)(1 - Tp0)/2

= 1 - (α + β)/2 = C.     (5)

This means that for a dichotomous variable, the area under the ROC straight line graph for a dichotomous variable is, analogous to the case with a continuous variable, equivalent to the pair consistency probability C. Therefore, finding a cutoff point that maximizes C is equivalent to the problem of finding the point (Fp0 , Tp0 ) on the original ROC curve that maximizes the area A under the ROC straight line graph.

Optimal cutoff points for polychotomization

Next, consider the polychotomous case. Again, let x[h] be a sample from the continuous random variable X in the healthy group and x[d] a sample from the same variable in the disease group, both taken randomly. Let z0 = -∞, zn = ∞ and z1, z2,..., zn-1 be cutoff points where z1<z2 <...<zn-1. We define that

<a onClick="popup('http://www.biomedcentral.com/1472-6947/6/41/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1472-6947/6/41/mathml/M5">View MathML</a>

<a onClick="popup('http://www.biomedcentral.com/1472-6947/6/41/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1472-6947/6/41/mathml/M6">View MathML</a>

Then the probabilities for tied and concordant pairs become

<a onClick="popup('http://www.biomedcentral.com/1472-6947/6/41/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1472-6947/6/41/mathml/M7">View MathML</a>

and the pair consistency probability C can be calculated from equation (1).

We also define

Tpk = P (x[d] > zk) and Fpk = P (x[h] > zk)     (k = 0,..., n).

The points (Fpk, Tpk) lie on the original ROC curve, and the set of points (Fpk, Tpk) jointed by straight lines yields the ROC straight line graph. Let A represent the area under the ROC straight line graph and Ak represent the area under the line whose ends are (Fpk-1, Tpk-1) and (Fpk, Tpk). As illustrated in Figure 3, the area Ak is

thumbnailFigure 3. Area Ak under the ROC straight line graph. The filled part shows the area Ak under the ROC straight line graph with end points (Fpk-1, Tpk-1) and (Fpk, Tpk).

<a onClick="popup('http://www.biomedcentral.com/1472-6947/6/41/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1472-6947/6/41/mathml/M8">View MathML</a>

Therefore,

<a onClick="popup('http://www.biomedcentral.com/1472-6947/6/41/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1472-6947/6/41/mathml/M9">View MathML</a>

Then we have

<a onClick="popup('http://www.biomedcentral.com/1472-6947/6/41/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1472-6947/6/41/mathml/M10">View MathML</a>

Again, the pair consistency probability C for the polychotomized variable is equivalent to the area under its ROC straight line graph, and the problem of finding the optimal cutoff points that maximize C is mathematically equivalent to finding the set of edge points of the ROC straight line graph that maximizes the area A under that graph.

Optimal cutoff points for variables for which normal and diseased cases have a common central tendency

There are many predictive variables whose central values refer to a normal state and whose more dispersed values refer to less preferable states. In the example of rhabdomyolysis prognosis that will follow later, body temperature, pulse rate, plasma sodium, and plasma pH are such variables. For these predictors, we need to find at least two cutoff points to discriminate normal and abnormal states. If we denote the values of the cutoff points z1 and z2 (z1 <z2), and regard the value between these two cutoff points as normal, then type I error α and type II error β become:

<a onClick="popup('http://www.biomedcentral.com/1472-6947/6/41/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1472-6947/6/41/mathml/M11">View MathML</a>

and

<a onClick="popup('http://www.biomedcentral.com/1472-6947/6/41/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1472-6947/6/41/mathml/M12">View MathML</a>

The pair consistency probability C can now be calculated with equation (3) and the combination of cutoff points (z1, z2) which maximizes (3) becomes the solution. In case of categorization of the variable into more than three states, we can define the optimal combination of cutoff points as follows: Let zn = -∞, wn = ∞ and z1, z2,..., zn-1, w1, w2,..., wn-1 be cutoff points where zn-1 <...<z2 <z1 <w1 <w2 <...<wn-1, and

<a onClick="popup('http://www.biomedcentral.com/1472-6947/6/41/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1472-6947/6/41/mathml/M13">View MathML</a>

<a onClick="popup('http://www.biomedcentral.com/1472-6947/6/41/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1472-6947/6/41/mathml/M14">View MathML</a>

<a onClick="popup('http://www.biomedcentral.com/1472-6947/6/41/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1472-6947/6/41/mathml/M15">View MathML</a>

Then the probabilities for tied and concordant pairs become

<a onClick="popup('http://www.biomedcentral.com/1472-6947/6/41/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1472-6947/6/41/mathml/M16">View MathML</a>

and the pair consistency probability C can be calculated from equation (1). The combination of cutoff points that maximizes C becomes the solution.

Parametric method for estimating cutoff points and predictive performance

The polychotomization methods proposed in the previous sections have been developed under conditions where the exact distribution of a prognostic or diagnostic factor in a population is known. However, in research practice we work with samples and we need to discuss whether our methods can be applied in situations involving parameter uncertainty. Although some methods were developed for correct estimation of the pair consistency probability C in these situations, including non-parametric ones [22-24], none of them addressed the estimation of cutoff points and they can therefore not be applied to our setting.

The challenge we are faced with is that if we repeat the evaluation of the pair consistency probability to find optimal cutoff points, for instance by increasing the possible value of the cutoff point with a certain step, it gives rise to estimation error just like the minimum p-value approach [9-18] and would mistakenly lead to an optimistic conclusion on the predictive performance of the model in future observations.

It is clear that we need a practical method that does not suffer from this over-estimation bias. In this paper we show that if fh and fd can be transformed to normal distributions, a parametric method provides essentially unbiased estimators of predictive performance and cutoff points.

Our method is based on the following:

a) the assumption that the probability density functions of an independent variable on the healthy and disease groups, fh and fd, are both normally distributed or can be transformed to a normal distribution,

b) the estimation of the means and standard deviations of fh and fd, mh, sh, md, and sd from sample data,

c) the localization of the optimal cutoff points based on the estimated distributions <a onClick="popup('http://www.biomedcentral.com/1472-6947/6/41/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1472-6947/6/41/mathml/M17">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1472-6947/6/41/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1472-6947/6/41/mathml/M18">View MathML</a>, and

d) the calculation of the predictive performance based on the estimated cutoff points.

Distributions of the estimators for the cutoff point and the pair consistency probability

If fh and fd are both normal and sh = sd, then the two curves intersect at x = (mh + md)/2. The pair consistency probability C takes the maximum value at this point as mentioned earlier. In the case that sh is not equal to sd, the two curves intersect at the following two points:

<a onClick="popup('http://www.biomedcentral.com/1472-6947/6/41/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1472-6947/6/41/mathml/M19">View MathML</a>

and the point that is located between mh and md can be used to calculate the true maximum value of the pair consistency probability C with equations (2) and (3). As it is difficult to evaluate the statistical properties of the above formulae analytically, even for the simplest dichotomization case, we performed a Monte Carlo simulation to assess the estimation of the cutoff points and the corresponding C. For these purposes, a custom simulation program was written in the programming language Pascal with the following characteristics:

a) the assumption that fh and fd are both normal,

b) generation of samples of healthy and disease groups, each with a given number of measurements, by randomly generating the value of the prognostic variable,

c) estimation of the optimal cutoff points and pair consistency probability C by naïve stepwise repeated search, in which the cutoff point is changed with a certain small step Δz and the corresponding C is evaluated based on the sample data to find a point which gives the maximum C. In case of polychotomization, this step is iterated for every combination of possible cutoff values,

d) estimation of the parameters of fh and fd and calculation of the optimal cutoff points based on the estimated distributions (including the corresponding predictive ability C), in which cutoff points are searched numerically in the same manner as the above stepwise repeated search based not on the sample data but on the estimated PDFs,

e) repeat the above sample generation and estimating steps 10,000 or 100,000 times for each of various combinations of population parameters.

Extension for multiple associated independent variables

Thus far, we have discussed a method for selecting cutoff points that maximizes the predictive ability of each prognostic variable individually. When a regression model has more than one explanatory variable, the version of our method presented in this article can only be applied if the variables are not associated (no correlation and no interaction). Since associations between prognostic variables are common, our method requires a multivariable extension in which cutoff points are found while taking such associations into account.

Our maximum C index approach can be applied to multivariate scenario if the distributions of a number of prognostic variables for healthy and diseased groups can be described by multivariate normal distributions and if the calculation times are acceptable [26]. However, because we are still in the process of assessing the performance of multivariable extensions and comparisons with other approaches, we will only give a short summary below:

a) determine the regression model that best fits the observations,

b) estimate the multivariate normal distribution parameters from the observed data,

c) for a set of categorized variables defined by a combination of cutoff points, calculate the regression equation and evaluate its overall C index (based not on the observed data but on the estimated distributions),

d) iterate (c) systematically for every combination of cutoff points and select the combination of cutoff points which gives the maximum overall C index for the regression equation.

Results

Evaluation of the parametric method by Monte Carlo simulation

In this section, we present an evaluation of our parametric method, together with the naïve application of a stepwise repeated search based on multiple evaluations. In the absence of a standard method for polychotomization, the latter is currently probably the first choice for researchers, mainly due to its simplicity.

Figures 4, 5, 6, illustrate the frequency distributions of the estimates of predictive performance C for the repeated search method and the parametric method for dichotomization (Figure 4), trichotomization (Figure 5), and polychotomization into four categories (Figure 6), when fh and fd are both normally distributed and nh = nd = 30. Since the true values for the C were 0.722, 0.748 and 0.755 for dichotomization, trichotomization and polychotomization into four categories, the parametric method provides essentially unbiased normally distributed estimators (means and SDs: 0.725 ± 0.043, 0.751 ± 0.048, and 0.758 ± 0.050), whereas the repeated search method has relatively large positive biases (0.752 ± 0.048, 0.786 ± 0.051, and 0.795 ± 0.053).

thumbnailFigure 4. Distributions of estimated pair consistency probability C in 100,000 simulations of dichotomization. The frequency distributions of the estimate of the pair consistency probability C by the repeated search method (dotted line) and the parametric method (solid line) in 100,000 simulations of dichotomization, with fh ~ N(0, 12), fd ~ N(1.5, 22) and nh = nd = 30. The class width for the graph is 0.0167.

thumbnailFigure 5. Distributions of estimated pair consistency probability C in 100,000 simulations of trichotomization. The frequency distributions of the estimate of the pair consistency probability C by the repeated search method (dotted line) and the parametric method (solid line) in 100,000 simulations of trichotomization, with fh ~ N(0, 12), fd ~ N(1.5, 22) and nh = nd = 30. The class width for the graph is 0.0167.

thumbnailFigure 6. Distributions of estimated pair consistency probability C in 100,000 simulations of polychotomization to four categories. The frequency distributions of the estimate of the pair consistency probability C by the repeated search method (dotted line) and the parametric method (solid line) in 100,000 simulations of polychotomization to four categories, with fh ~ N(0, 12), fd ~ N(1.5, 22) and nh = nd = 30. The class width for the graph is 0.0167.

Figure 7 shows the frequencies of the optimal cutoff point in dichotomization estimated by the each of two methods. Whereas the true cutoff point is 1.150, the estimated values and their standard deviations are 1.175 ± 0.209 with the parametric approach and 1.071 ± 0.433 with the repeated search method, which means the former provides a more accurate estimator for the cutoff point with higher precision.

thumbnailFigure 7. The frequency distributions of the estimated optimal cutoff point. The frequency distributions of the optimal cutoff points estimated by the repeated search method (dotted line) and the parametric method (solid line) in 100,000 simulations of dichotomization for the same case in Figure 4 with fh ~ N(0, 12), fd ~ N(1.5, 22) and nh = nd = 30. The class width for the graph is 0.02.

We repeated the above simulations for various nh and nd (nh = nd) and Figure 8 and Figure 9 summarize the results. The graphs show that the estimation by the parametric method is almost unbiased even if the sample size is relatively small, both for dichotomization (Figure 8) and trichotomization for variables whose realizations in healthy and diseased groups have a similar central tendency (Figure 9), whereas the naïve repeated search method shows non-negligible bias even when the sample size is large (n = 300).

thumbnailFigure 8. Changes of estimated pair consistency probability C in dichotomization as a function of sample size. Results from Monte Carlo simulation of the changes of the mean value of the estimated pair consistency probability C by the repeated search method (red line with squares) and the parametric method (blue line with circles) for various sample sizes each of which is calculated by 10,000 simulations of dichotomization with fh ~ N(0, 12) and fd ~ N(1.5, 22).

thumbnailFigure 9. Changes of estimated pair consistency probability C in trichotomization as a function of sample size. The changes of the mean value of the estimated C for various sample sizes each of which is calculated by 10,000 simulations of trichotomization for a variable with a common central tendency with fh ~ N(0, 12) and fd ~ N(0, 22).

Distributions of estimators from the parametric method

Table 1 shows how the pair consistency probability C increases when the number of the cutoff point changes from one to three for the case that x[h]~ N(0, 12) and x[d] ~ N(μd, 1.52). For instance, when the pair consistency probability for the original continuous variable is 0.8 (μd = 1.517), the pair consistency probability for the dichotomized, trichotomized and quatrochotomized variables are 0.738, 0.775 and 0.787, respectively.

Table 1. Changes of the pair consistency probability C by the number of cutoff points

Table 2 summarizes the means and standard deviations of the pair consistency probability C estimated by the parametric method for dichotomization when the sample sizes of the two groups are equal (n = 10, 25, 50, 100, 200, and 500) and σd = 1.5σh. Table 3 gives the results for trichotomization when the continuous variable in healthy and diseased cases has a common central tendency and the sample sizes of the two groups are equal. These tables can be used to evaluate the accuracy and precision of the estimated predictive ability of C for various sample sizes.

Table 2. Means and standard deviations of the estimates of C for a dichotomized variable

Table 3. Means and standard deviations of the estimates of C for a trichotomized variable

Example: Polychotomization of the prognostic factors of rhabdomyolysis

Rhabdomyolysis is a potentially lethal complication, often observed in patients who have attempted suicide with large doses of psychotropic drugs. Though it is important to make the diagnosis and begin proper treatment at an early stage, the diagnosis of rhabdomyolysis is difficult unless specific enzymes and myoglobin in skeletal muscle are detected by laboratory tests.

To find prognostic variables of rhabdomyolysis at an outpatient clinic where laboratory data are not available, we previously evaluated 131 cases of acute drug toxicosis [27-29] and found twelve variables to be significantly contributing to diagnosis of rhabdomyolysis (rhabdomyolysis group: n = 34, non-rhabdomyolysis group: n = 97). For this example, we selected three non laboratory data variables to predict the risk at the outpatient clinic: (1) qtc: ECG QTc (non-dimensional); (2) t: time from taking the drug to hospitalization (hours); and (3) bt, body temperature (Celsius).

Applying the maximum pair consistency probability criterion, the three continuous variables are categorized, assuming that qtc is a normal variable, t a log-normal variable and bt a variable with a common central tendency. Table 4 shows the selected cutoff points and the changes of the pair consistency probability. Comparing the pair consistency probabilities of the categorized variable, we can observe how predictive ability changes with polychotomization and the pair consistency probability C can be used as a measure to evaluate the loss of predictive ability by categorization.

Table 4. Optimal cutoff points for the prognostic factors of rhabdomyolysis

Considering the predictive performance of the each of the categorized variables and convenience in the clinical setting, we finally chose the cutoff point values 0.45 for qtc, 5.0 and 12.0 for t, and 34.0 and 37.2 for bt. We then converted the continuous variables to categorical variables. Next, we applied the cross-split-half-method [30] to validate the effectiveness of prediction by these variables with logistic regression [31] and evaluated the amount of over estimation of prediction performance by a single data set. The estimated optimism for the overall C index was 0.018, which is sufficiently small.

Example: Risk table for prognosis of rhabdomyolysis

Based on categorized variables, we obtained the new prediction formula:

p = 1/(1 + exp(7.96 - 3.13QTC - 6.22T1 - 3.11T2 - 1.97BT)     (8)

where QTC is ECG QTc (1 for more than or equal to 0.45 and 0 for less than 0.45), T1 is the time from drug ingestion to hospitalization (1 for more than or equal to 12 hours, 0 for otherwise), T2 is also the time from drug ingestion to hospitalization (1 for less than 12 hours and more than or equal to 5 hours, 0 for otherwise), and BT is body temperature (1 for more than or equal to 37.2° or less than or equal to 34.0°, and 0 for otherwise). Since the overall index C for this formula was 0.945, we estimate the predictive performance in future data will be around 0.927(= 0.945 - 0.018).

To ascertain the fitness of the selected regression model, we conducted the Hosmer-Lemeshow goodness-of-fit test [32] by dividing disease probability into eight classes. The actual number of occurrences for each class showed good agreement with the expected number of occurrences of rhabdomyolysis (p = 0.618).

Since all the three prognostic variables are categorized, the number of patient profiles becomes twelve and the risk probabilities of rhabdomyolysis for all possible patient profiles can now be obtained by assigning a combination of the values of categorized variables into regression formula (8). This yields a risk table for rhabdomyolysis occurrence (Table 5). For instance, if T, QTC and BT are "++ ", "+ " and "- " respectively, we can read from the table that the risk of rhabdomyolysis is 0.801. Repeated use of this table over time will give physicians a "sense" of the disease risk.

Table 5. Probability profile table for rhabdomyolysis

Discussion

The criterion for optimal categorization of continuous variables in regression models may vary depending on the object of the categorization, and there have been several different approaches. Many of these approaches are inadequate for our purpose. We have proposed to use the overall discrimination index C introduced by Harrel and other authors [21-24] as the measure for predictive performance of a categorized variable. Since the overall discrimination index C has a clear and straight forward meaning as the pair consistency probability, it is intuitively logical to use it as a measure for the predictive discrimination for polychotomized variables.

Though mathematically distinct, our method has much in common with previously developed methods [2-20,33-38], which can be explained through the relations between the pair consistency probability C, SE and SP, and the area under the ROC straight line graph, as is expressed in formulae (2) to (6). In addition, our ROC straight line graph has a close relation with ordinal dominance or the OD curve proposed by Darlington to visualize the ordering feature of two comparative sets [39]. He showed that the OD curve is a complete representation of the rank-order properties of data and many statistical procedures follow naturally from assessment of the curve. Bamber clarified the relation between the area above the OD curve and a measure identical to the pair consistency probability [40]. Our proof of formula (6) related to the ROC straight line graph corresponds to Bamber's OD curve related proof.

Monte Carlo simulation showed that the naïve search of the maximum C index will give rise to an estimation bias, which is very much like the positive bias that affects the minimum p-value method. Such bias is also seen in the method where the cutoff point is selected in a way that maximizes the sum of SE and SP. Linnet and Brandt calculated the sample distribution of (SE + SP)/2 in the case of dichotomization using computer simulation assuming that distributions are normal, and evaluated the positive bias induced by the selection of an optimal cutoff point [4]. They found that estimates of test performance are too optimistic when the sample size is small, with an average positive bias up to 15% for a sample size of 25. We have shown that this problem does not affect our proposed parametric method.

However, there may be cases where a transformation to a normal distribution does not work well. For such cases, we conceive that approximation of distribution curve by a more suitable function or a restricted cubic spline function [41] creates a workable situation. We are currently in the process of evaluating this approach and the results will be reported elsewhere.

To keep this introduction of the maximum C index approach for polychotomizing predictive variables short and readable, we have used an example in which a regression model without correlated independent variables and without interaction fitted the observed data well (p = 0.618 by Hosmer and Lemeshow goodness-of-fit test) [42,43]. However, if correlation and interaction are relevant for the regression function, our maximum C index approach must be extended to a multivariable setting. Mazumdar extended a cutoff point search based on the maximum chi-square method to a multivariable setting [44], and showed that the cutoff points obtained by a multivariable search were closer to the true cutoff points.

Another method that is appealing for regression settings with correlated independent variables, is the so-called 'simplified integer score' method in which continuous variables are transformed into semi-continuous interval variables [41]. It has been used in numerous articles and is based on the categorization of the continuous variable, and the transformation of the products of the regression coefficient and the value of the variable into integers. This method is clinically useful and can be applied to the situation where explanatory variables are correlated. If the number of variables is small enough and they have few classifications, this method can also be used to create the simple probability profile tables that result from our approach. We are currently in the process of evaluating a multivariable extension of the C index maximization approach, including a comparison with this method.

Along with regression models, decision trees can also be used in diagnostic or prognostic decision making [36]. Breiman et al. developed an approach called classification and regression trees (CART) to build a decision tree for medical diagnosis based on a training data set [41,45]. In these decision trees, diagnosis is made by a sequential decision making process, in which a question on an independent variable is posed at each step and, depending on the answer, a different "branch" of the tree is selected until the final result is achieved. If an independent variable is continuous, dichotomization (or polychotomization) will be necessary to build a decision tree. Typically, the cutoff points are found by maximizing the total utility of decision scheme [46,47], which appears to be closely related mathematically to our approach. Further study is necessary to make a theoretical and practical comparison.

We have indicated that it is easier for most people to read a probability profile table to obtain the risk probability than to calculate the risk with a regression formula. Additionally, probability profile tables give physicians an intuitive feel for the disease risk. Even if the value of one or two of the prognostic variables is not available, physicians can obtain a probability range corresponding to the patient's risk by referring to both the positive and negative cases from the table. By making simplified risk tables in advance, physicians can obtain the patient's risk from an auxiliary table, even if the value of a predictor is missing. Since the table presentation of probabilities has these practical advantages, we believe our method for categorizing prognostic variables can be a helpful tool to make diagnostic or descriptive prognostic research with regression models become more applicable in clinical practice.

Conclusion

We have proposed a new approach for polychotomization (including dichotomization) of independent continuous variables in regression models based on the overall discrimination index C, or the pair consistency probability, introduced by Harrel. We have shown that this index is closely related to the area under the ROC curve for the original continuous variable and that the resulting categorized variables have predictive properties comparable to the original continuous variable. We showed that the naïve application of the method gives rise to positive bias, not unlike the minimum p-value approach or the method of maximizing the sum of sensitivity and specificity, and we proposed a parametric version in which the estimates of the predictive performance and cutoff points are essentially unbiased. To evaluate the accuracy and precision of the estimate of the predictive performance, we presented tables of the means and standard deviations of the estimate of predictive performance for typical cases by the use of Monte Carlo simulation. Finally we provided an application of our method to a prediction rule with continuous predictor variables for rhabdomyolysis and showed that our method for polychotomizing continuous regressor variables can be a valid and useful tool to create probability profile tables. All programs (and their source codes) used in this study are available from the authors.

Competing interests

The author(s) declare that they have no competing interests.

Authors' contributions

HT derived the polychotomization method, drafted the manuscript and supervised the study. LB provided feedback on methodological issues and contributed to data analysis and manuscript writing. All authors read and approved the final manuscript.

Acknowledgements

The authors would like to thank the late Dr. T. Tsutsumi and Dr. S. Morita for their contributions to the analysis of the rhabdomyolysis data. The authors would also like to thank Assistant Professor J. Goddard for providing useful comments and we are grateful to K. Doi for her technical assistance.

References

  1. Miettinen OS: The modern scientific physician: 3. Scientific diagnosis.

    CMAJ 2001, 165(6):781-2. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  2. Fisher RA: The use of multiple measurements in taxonomic problems.

    Annals of Eugenics 1936, 7(2):179-188. OpenURL

  3. Mahalanobis PC: Mahalanobis distance.

    Proceedings National Institute of Science of India 1936, 49(2):234-256. OpenURL

  4. Linnet K, Brandt E: Assessing diagnostic tests once an optimal cutoff point has been selected.

    Clin Chem 1986, 32(7):1341-1346. PubMed Abstract OpenURL

  5. Bairagi R, Suchindran CM: An estimator of the cutoff point maximizing sum of sensitivity and specificity.

    Indian J Stat 1989, 51(B-2):263-269. OpenURL

  6. Schäfer H: Constructing a cut-off point for a quantitative diagnostic test.

    Stat Med 1989, 8:1381-1391. PubMed Abstract OpenURL

  7. Gail MH, Green SB: A generalization of the one-sided two-sample Kolmogorov-Smirnov statistics for evaluating diagnostic tests.

    Biometrics 1976, 32:561-570. PubMed Abstract | Publisher Full Text OpenURL

  8. Cantor SB, Sun CC, Tortolero-Luna G, Richards-Kortum R, Follen M: A comparison of C/B ratios from studies using receiver operating characteristic curve analysis.

    J Clin Epidemiol 1999, 52:885-892. PubMed Abstract | Publisher Full Text OpenURL

  9. Miller R, Siegmund D: Maximally selected chi square statistics.

    Biometrics 1982, 38:1011-1016. Publisher Full Text OpenURL

  10. Lausen B, Schumacher M: Maximally selected rank statistics.

    Biometrics 1992, 48:73-85. Publisher Full Text OpenURL

  11. Altman DG, Lausen B, Sauerbrei W, Schumacher M: Dangers of using "optimal" cutpoints in the evaluation of prognostic factors.

    J Natl Cancer Inst 1994, 86(11):829-835. PubMed Abstract | Publisher Full Text OpenURL

  12. Mazumdar M, Glassman J: Categorizing a prognostic variable: review of methods, code for easy implementation and applications to decision-making about cancer-treatments.

    Stat Med 2000, 19:113-132. PubMed Abstract | Publisher Full Text OpenURL

  13. Hilsenbeck SG, Clark GM, McGuire WL: Why do so many prognostic factors fail to pan out?

    Breast Cancer Res Treat 1992, 22:197-206. PubMed Abstract | Publisher Full Text OpenURL

  14. Cantor AB: Re: Dangers of using "optimal" cutpoints in the evaluation of prognostic factors.

    J Natl Cancer Inst 1994, 86(23):1798. PubMed Abstract | Publisher Full Text OpenURL

  15. Lausen B, Schumacher M: Evaluating the effect of optimized cutoff values in the assessment of prognostic factors.

    Comp Stat Data Analysis 1996, 21:307-326. Publisher Full Text OpenURL

  16. Hilsenbeck SG, Clark GM: Practical p-value adjustment for optimally selected cutpoints.

    Stat Med 1996, 15:103-112. PubMed Abstract | Publisher Full Text OpenURL

  17. Faraggi D, Simon R: A simulation study of cross-validation for selecting an optimal cutpoint in univariable survival analysis.

    Stat Med 1996, 15:2203-2213. PubMed Abstract | Publisher Full Text OpenURL

  18. Contal C, O'Quigley J: An application of changepoint methods in studying the effect of age on survival in breast cancer.

    Comp Stat Data Analysis 1999, 30:253-270. Publisher Full Text OpenURL

  19. Metz CE: Basic principles of ROC analysis.

    Semin Nucl Med 1978, 8(4):283-298. PubMed Abstract OpenURL

  20. Kristjansson B, Hill G, McDowell I, Lindsay J: Optimal cut-points when screening for more than one disease state: an example from the Canadian study of health and aging.

    J Clin Epidemiol 1996, 49(12):1423-28. PubMed Abstract | Publisher Full Text OpenURL

  21. Harrell FE Jr, Califf RM, Pryor DB, Lee KL, Rosati RA: Evaluating the yield of medical tests.

    JAMA 1982, 247(18):2543-2546. PubMed Abstract | Publisher Full Text OpenURL

  22. Harrell FE Jr, Lee KL, Mark DB: Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors.

    Stat Med 1996, 15(4):361-387. PubMed Abstract | Publisher Full Text OpenURL

  23. Nam B, D'Agostino RB: Discrimination index, the area under the ROC curve. In Goodness-of-Fit Tests and Model Validity. Edited by Huber-Carol C. Boston: Birkhauser; 2003:267-279. OpenURL

  24. Pencina MJ, D'Agostino RB: Overall C as a measure of discrimination in survival analysis: model specific population value and confidence interval estimation.

    Stat Med 2004, 23(13):2109-2123. PubMed Abstract | Publisher Full Text OpenURL

  25. Green DM, Swets JA: Signal detection theory and psychophysics. New York: Wiley; 1966. OpenURL

  26. Tsuruta H, Tsutsumi K, Doi K: The changes of predictive ability when prognostic factors are categorized. In Proceedings of the 24th Joint Conference on Medical Informatics: 26–28 November 2004; Nagoya. Japan Association for Medical Informatics; 2004:824-825. OpenURL

  27. Morita S, Tsutsumi K, Doi K, Tsuruta H: Prediction of rhabdomyolysis occurring in patients with acute drug toxicosis by logistic regression model.

    Jpn J Gen Hosp Psychiatry 1998, 10:37-43. OpenURL

  28. Tsuruta H, Tsutsumi K, Doi K: Prediction of rhabdomyolysis in patients with acute drug toxicosis. In Proceedings of the 21th Joint Conference on Medical Informatics: 26–28 November 2001; Hamamatsu. Japan Association for Medical Informatics; 2001:514-515. OpenURL

  29. Tsuruta H, Tsutsumi K, Mochizuki M: Table presentation of the risk of rhabdomyolysis by the use of an optimal categorization method for prognostic factors and logistic regression analysis.

    In Proceedings of the 11th World Congress on Medical Informatics: 7–11 September 2004; San Francisco. AMIA Edited by Fieschi M, Coiera E, Li YCJ. 2004, 1888. OpenURL

  30. Cooper RG: An empirically derived new product project selection model.

    IEEE Trans Eng Manag 1981, 28(3):54-61. OpenURL

  31. Walker SH, Duncan DB: Estimation of the probability of an event as a function of several independent variables.

    Biometrika 1967, 54(1 and 2):167-179. PubMed Abstract | Publisher Full Text OpenURL

  32. Lemeshow S, Hosmer DW Jr: A review of goodness of fit statistics for use in the development of logistic regression models.

    Am J Epidemiol 1982, 115(1):92-106. PubMed Abstract OpenURL

  33. Metz CE, Kronman HB: Statistical significance tests for binormal ROC curves.

    J Math Psych 1980, 22:218-243. Publisher Full Text OpenURL

  34. Hanley JA, McNeil BJ: The meaning and use of the area under the receiver operating characteristic (ROC) curve.

    Radiology 1982, 143(1):29-36. PubMed Abstract OpenURL

  35. Hanley JA, McNeil BJ: A method of comparing the areas under receiver operating characteristic curves derived from the same cases.

    Radiology 1983, 148(3):839-43. PubMed Abstract OpenURL

  36. Hunink M, Glasziou P, Siegel J, Weeks J, Pliskin J, Elstein A, Milton CW: Decision Making in Health and Medicine: Integrating Evidence and Values. Cambridge: Cambridge University Press; 2001. OpenURL

  37. Faraggi D, Reiser B: Estimation of the area under the ROC curve.

    Stat Med 2002, 21(20):3093-3106. PubMed Abstract | Publisher Full Text OpenURL

  38. Copas JB, Corbett P: Overestimation of the receiver operating characteristic curve for logistic regression.

    Biometrika 2002, 89(2):315-331. Publisher Full Text OpenURL

  39. Darlington RB: Comparing two groups by simple graphs.

    Psychol Bull 1973, 79(2):110-116. Publisher Full Text OpenURL

  40. Bamber D: Area above the ordinal dominance graph and the area below the receiver operating characteristic graph.

    J Math Psych 1975, 12:387-415. Publisher Full Text OpenURL

  41. Harrell FE: Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis. New York: Springer; 2001. OpenURL

  42. Kleinbum DG, Kupper LL, Muller KE: Applied regression analysis and other multivariable methods. Boston: PWS-Kent Publishing Company; 1998. OpenURL

  43. Hosmer DW, Lemeshow S: Applied Logistic Regression. New York: John Wiley and Sons; 2000. OpenURL

  44. Mazumdar M, Smith A, Bacik J: Methods for categorizing a prognostic variable in a multivariable setting.

    Stat Med 2003, 22:559-571. PubMed Abstract | Publisher Full Text OpenURL

  45. Breiman L, Friedman JH, Olshen RA, Stone CJ: Classification and Regression Trees. Belmont: Wadsworth; 1984. OpenURL

  46. Long WJ, Griffith JL, Selker HP, D'Agostino RB: A comparison of logistic regression to decision-tree induction in a medical domain.

    Comput Biomed Res 1993, 26:74-97. PubMed Abstract | Publisher Full Text OpenURL

  47. Shannon CE: A Mathematical Theory of Communication.

    The Bell System Tech J 1948, 27:379-423. OpenURL

Pre-publication history

The pre-publication history for this paper can be accessed here:

http://www.biomedcentral.com/1472-6947/6/41/prepub