Boston University School of Public Health, Department of Biostatistics, Boston, Massachusetts, USA

Center for Health Quality, Outcomes and Economic Research, Bedford VAMC, Bedford, Massachusetts, USA

Boston University School of Public Health, Department of Health Services, Boston, Massachusetts, USA

University of California, Division of General Internal Medicine and Health Services Research, David Geffen School of Medicine, Los Angeles, California, USA

Abstract

Background

Providers use risk-adjustment systems to help manage healthcare costs. Typically, ordinary least squares (OLS) models on either untransformed or log-transformed cost are used. We examine the predictive ability of several statistical models, demonstrate how model choice depends on the goal for the predictive model, and examine whether building models on samples of the data affects model choice.

Methods

Our sample consisted of 525,620 Veterans Health Administration patients with mental health (MH) or substance abuse (SA) diagnoses who incurred costs during fiscal year 1999. We tested two models on a transformation of cost: a Log Normal model and a Square-root Normal model, and three generalized linear models on untransformed cost, defined by distributional assumption and link function: Normal with identity link (OLS); Gamma with log link; and Gamma with square-root link. Risk-adjusters included age, sex, and 12 MH/SA categories. To determine the best model among the entire dataset, predictive ability was evaluated using root mean square error (RMSE), mean absolute prediction error (MAPE), and predictive ratios of predicted to observed cost (PR) among deciles of predicted cost, by comparing point estimates and 95% bias-corrected bootstrap confidence intervals. To study the effect of analyzing a random sample of the population on model choice, we re-computed these statistics using random samples beginning with 5,000 patients and ending with the entire sample.

Results

The Square-root Normal model had the lowest estimates of the RMSE and MAPE, with bootstrap confidence intervals that were always lower than those for the other models. The Gamma with square-root link was best as measured by the PRs. The choice of best model could vary if smaller samples were used and the Gamma with square-root link model had convergence problems with small samples.

Conclusion

Models with square-root transformation or link fit the data best. This function (whether used as transformation or as a link) seems to help deal with the high comorbidity of this population by introducing a form of interaction. The Gamma distribution helps with the long tail of the distribution. However, the Normal distribution is suitable if the correct transformation of the outcome is used.

Background

The proportion of patients with Mental Health (MH) and Substance Abuse (SA) disorders in the Department of Veterans Affairs (VA) is higher than in other health care systems

Cost data typically present challenges because of their skewness and high percentage of zeros. Two-part models have been suggested as a method for solving the problem where data contain a high percentage of zero costs

The method most recommended for predicting cost in the health services literature, such as in "Risk Adjustment for Measuring Health Outcomes,"

Sample sizes in many health economic cost studies presented to date vary widely, ranging from a low of 45

Two questions of interest arise: 1) Which statistical model works best for our sample of over 500,000 patients if the criteria for the prediction model are low root mean square error (RMSE), low mean absolute prediction error (MAPE), and predictive ratio (PR) around 1 for the entire range of patient costs? and 2) Using these criteria, would the best model for the entire sample be chosen as best if small sample sizes are used? The focus in this paper is on methods for assessing overall model fit and not on the specific effects of the independent variables contained in the risk-adjustment system.

Methods

Data description

VA administrative databases were used to select all veterans with a MH/SA diagnosis (ICD-9-CM codes 290.00–312.99, 316) and who used VA healthcare services in FY99. All inpatient and outpatient diagnoses from FY99 were mapped into twelve MH/SA categories (Dementia/Alzheimer's Disease, Alcohol Disorder, Drug Disorder, Schizophrenia, Other Psychoses, Bipolar Disorder, Major Depression, Other Depression, Posttraumatic Stress Disorder (PTSD), Anxiety Disorder, Adjustment Disorder, and Personality Disorder). These categories are used in the VA for mental health performance monitoring

The majority of VA patients do not pay for their care. The Health Economics Resource Center (HERC), a national VA center, has created estimated costs at the visit/stay level using patient utilization data and cost data at the VA department level. In order to calculate MH/SA costs, we obtained all specialty MH/SA visits or any non-specialty visit (e.g., primary care) for which a MH/SA diagnosis had been assigned. Inpatient drug costs also are included in a patient's annual cost value. A thorough description of the cost data is found in Rosen et al.

Model specification

Estimates from the models presented in Table

Model's description

Model Name

Dependent Variable

Model Specification

Family

Link

v(

Equation

Normal with identity link (OLS)

Normal

Identity

1

Log Normal*

Normal

Identity

1

Sqrt Normal*

Normal

Identity

1

Gamma with log link

Gamma

Log

^{2}

Gamma with square-root link

Gamma

Square Root

^{2}

* Transformation of dependent variable.

Note:

Models

Cost data are non-negative and usually right-skewed. These characteristics guide our choice of models for comparison. In addition, we consider models commonly used in the literature.

Ordinary least squares (OLS) regression on cost is the most common model used. It is the easiest to understand because the independent variables have additive effects:

**Y**} = **X**

where **Y **are independent Normal variables with constant variance ^{2}. Predictions are given in the original scale

Even though OLS is very popular, it does not directly deal with the two characteristics of cost data: non-negative and right-skewed. Transformation of the dependent variable can sometimes solve these issues. In this paper we tested two transformations: log and square root transformation. The first case is commonly called a Log Normal model and the independent variables have additive effects on the log scale:

**ln**(**Y**)} = **X**

where **ln**(**Y**) are independent Normal variables with constant variance ^{2}. Retransformation is necessary in order to get predictions on the original scale. However, direct retransformation **ln**(**Y**)} = **ln**{**Y**)}, which is not usually true. Several adjustments have been recommended to deal with this problem

where

Square-root transformation models are similar to the Log Normal model in that retransformation is necessary to bring prediction to the original scale. In this case the model is specified as:

**X**

where ^{2}. Direct retransformation gives ^{2 }and after forcing mean predicted to equal mean observed, final predictions are given by

where

Generalized linear models (GLMs) are a class of models of the form **Y**)} =

Some authors have recommended the use of the Gamma distribution in cost models

where

Model selection and validation

The models' predictive ability was evaluated using the root mean square error (RMSE) and the mean absolute prediction error (MAPE). These are common statistics used to assess models in the risk-adjustment and health economics literature_{i }as the observed cost and

and

Large values indicate a poor fit. To further test each model's performance, we also looked at predictive ratios (PRs), which are a group-level type of measure

Sample size study

Data access problems, costs, and other obstacles often limit the amount of data available for building a risk adjustment model. To study the effect of developing the model on a subset of the data, we examined the effect of sample size on model choice using randomly selected samples of various sizes. We sought to answer the question: is model choice affected by the size of the sample used? We varied sample sizes between 5,000 and the entire sample. However, the results for samples above 80,000 were very stable. For each size, we randomly selected 100 samples, ran all five models, and computed all the statistical measures mentioned above (RMSE, MAPE, and PRs by decile of predicted cost). We compare 95% percentile confidence intervals of each measure within each sample size for all five models.

Results

The total sample consisted of 525,620 patients; 95% were males, with a mean age of 57 years (SD = 14.0); about 30% were age 65 and older. The dependent variable, total MH/SA cost, had a mean of 2,602 (SD = 11,052), a median equal to 385, and a skewness of 14. The residual plot from the Normal Identity model indicated heteroscedasticity and the QQ plot showed non-normality of the error term. The QQ plot of the residuals from the Log Normal model also indicated non-normality. Predicted values for costs in the Normal Identity model took on negative values for 19.6% of the total sample.

Table

Root mean square error (RMSE) and mean absolute prediction error (MAPE) results obtained for the 5 models run in the full sample of 525,620 patients

Model

RMSE

MAPE

Estimate

95% Conf.

Int.*

Estimate

95% Conf.

Int.*

Normal with Identity link (OLS)

10,397

10,130

10,657

2,997

2,941

3,052

Log Normal

13,974

13,585

14,352

2,801

2,759

2,840

Sqrt Normal

9,860

9,644

10,070

2,554

2,514

2,592

Gamma with log link

21,374

20,246

22,552

3,324

3,249

3,395

Gamma with square-root link

10,434

10,193

10,708

2,797

2,744

2,859

* Bias-corrected bootstrap confidence interval

Inspection of the PRs among deciles of predicted total MH/SA cost (see Figure

Predictive ratios (PR) per decile of predicted cost in full sample (N = 525,620)

**Predictive ratios (PR) per decile of predicted cost in full sample (N = 525,620)**. PR is computed as the ratio of predicted cost to observed cost for deciles of predicted cost. For each decile, PR = 1 when mean predicted cost equals mean observed cost. Also shown, are 95% bias-corrected bootstrap confidence intervals.

Sample size study

The Gamma with square-root link model had convergence problems with the smallest samples. The frequencies of samples for which models did not converge, ranged from 1 (sample with 55,000 patients) to 50 (samples with 5,000 patients) out of 100 replications within each sample size tested. There were three categories for which this problem occurred the most and they were for females in three age groups: 70–74, 80–84, and 85 or older. Each one of these groups had, in the overall sample, 74, 70, and 35 patients, respectively. When resampling, these small groups become extremely small, with categories having only 1 or 2 patients after sampling. Even though the procedure had problems converging when fitting the Gamma with square root link, this did not present a problem for the other four statistical models.

Figure

95% root mean square error (RMSE) percentile intervals per model at each simulation of various sample sizes

**95% root mean square error (RMSE) percentile intervals per model at each simulation of various sample sizes**.

95% mean absolute prediction error (MAPE) percentile intervals per model at each simulation of various sample sizes

**95% mean absolute prediction error (MAPE) percentile intervals per model at each simulation of various sample sizes**.

95% predicted ratio for decile 10 (PR10) percentile intervals per model at each simulation of various sample sizes

**95% predicted ratio for decile 10 (PR10) percentile intervals per model at each simulation of various sample sizes**. PR is computed as the ratio of predicted cost to observed cost whithin decile 10 of predicted cost. PR = 1 when mean predicted cost equals mean observed cost. The simulations at each sample size are based on 100 samples with the exception of the simulations for the Gamma Square Root model. Samples for which the model did not converge are dropped: 50 when sampling 5000 subjects, 15 for 10,000, 16 for 15,000, 8 for 20,000, 7 for 25,000 and 30,000, 5 for 35,000, and 1 for 50,000 and 55,000.

In Figure

For samples with 5,000 patients, both the Gamma with square-root link and OLS models have percentile intervals for the predicted ratio in decile 10 that include 1.0 (see Figure

Discussion

This analysis used five statistical models to predict cost for a population of patients with MH/SA disorders in the VA. Several methods for overall model fit, as well as fit within deciles of predicted costs, were used to test the predictive ability of the models. Moreover, a test of sensitivity of model choice to sample size was performed using simulation methods.

Ordinary least squares is often used to regress cost on patient characteristics. The population tested in this study has multiple comorbidities, with some patients (or a large proportion) incurring very high costs. This causes the tail of the distribution of costs to be very right-skewed and residuals from the model are not distributed normally. Nevertheless, even for distributions that account for long tails, often there are not enough observations with extremely high values to estimate the tail accurately.

The sample used in this study is large (more than 10 times larger than what is reported for other studies) and allows for extensive study of how well each of the models predict and also how well they predict for smaller sample sizes. This is of extreme importance, given that in many studies, researchers do not have access to such large datasets or for other reasons cannot analyze data from an entire population.

The Gamma Log model was found to be the worst model in every statistic analyzed. It did particularly poorly for the RMSE, with a value that was more than double the smallest RMSE value corresponding to the Normal Identity model. It also performed poorly for deciles of predicted cost, underpredicting consistently for the first 9 deciles and overpredicting in the 10th decile.

Nixon and Thompson

The models tailored to deal with the skewed sample perform reasonably well. In the overall sample, models with square-root transformation or link perform the best. This could be due to the fact that the square root transformation forces a form of interaction among the independent variables that might be needed in this sample because many of the patients have multiple MH/SA conditions. Interactions usually are not used in risk-adjustment systems except for systems that use hierarchies within conditions. However, hierarchies are a limited form of interactions and are designed primarily to avoid double counting specific diagnoses within a disease category, e.g., for a patient with paranoid schizophrenia and psychoses NOS ("not otherwise specified"), only the paranoid schizophrenia is counted. The Square-root Normal model has the smallest MAPE and RMSE that are statistically different from the other models values. The Gamma with square-root link has PRs that are (for each decile) consistently very close to 1.

The Log Normal is a multiplicative model. It does well when assessed on the log scale (not shown here) but after retransformation and even with adjustments, it does poorly. One reason is the fact that we are using a sample of all MH/SA patients, which are, in general, a highly comorbid population within the VA. Those individuals most comorbid are the ones found in the upper deciles. When bringing predictions back to the original scale, the multiplicative effect in this model causes large predictions as evidenced by an extremely high PR in the 10th decile. The overprediction in the 10th decile, together with the fact that we are forcing the mean predicted to equal the mean observed, translates into very poor predictions in the middle deciles.

Simulation results show that even though on average results do not differ from those in the larger sample, gamma models have some convergence problems for smaller sample sizes. However, this problem is directly related to the extremely small number of subjects in certain cells. This can be dealt with by inspecting the data before running the model. In the case of obtaining a sample with very small numbers for certain categories, the investigator should consider combining categories with small cell sizes before deciding that gamma models cannot be run. In the sample presented, the Gamma with square-root link model gives a very good fit in the overall sample; even for the small samples where the model converges, this is a reasonable choice based on the statistics we assessed.

Choosing a parsimonious model is an important statistical practice. This argument often is used to justify the choice of OLS risk-adjustment models. However, parsimony requires that the model be the simplest one possible that also fits the data well. The large percentage of negative predictions from our OLS models invalidates, in this study, this characterization for the OLS model.

More advanced models have been introduced in the literature that are an extension of the GLM models. Basu and Rathouz 2005

Conclusion

This work provides further statistical information on model performance and model choice for risk-adjustment models used for predicting costs in patients with MH/SA diagnoses. We use one MH/SA risk-adjustment system and compare five different statistical models. We found that the models with square-root transformation or link performed best in the full sample. This function (whether used as transformation or as a link) seems to help deal with the high comorbidity of this population by introducing a form of interaction. The Gamma distribution is modeling the variance better, as seen in better predictions throughout all 10 deciles. However, the Normal distribution is suitable if the correct transformation (square-root in our case) of the outcome is used and this should be true when this method is applied to highly comorbid populations. For smaller samples, the Gamma with square-root link model had problems converging. However, this was directly tied to very small numbers in certain categories and this can be solved by collapsing some of the categories. OLS on untransformed cost and the Log Normal and Square-root Normal model are relatively unaffected by the sample size for the criteria we used, while the GLMs assuming a Gamma distribution are less consistent for smaller sample sizes.

Authors' contributions

MMR carried out the statistical analysis and drafted the manuscript including the preparation of tables and figures. CLC carried out the modelling and drafted most of the discussion. SL constructed the cost data. SLE and AKR critically revised different sections of the manuscript. All authors contributed to commenting on drafts of the manuscript and have read and approved the final manuscript.

Acknowledgements

This research was supported by VA HSR&D Service, Grant # IIR 20-035-2 awarded to Amy K. Rosen. The authors appreciate all the help provided by Priti Shokeen in the management of the project. The authors would also like to thank Tim Heeren and two reviewers for their comments that greatly improved the content of the paper.

Pre-publication history

The pre-publication history for this paper can be accessed here: