Abstract
Background
Within longitudinal epidemiological research, ‘count’ outcome variables with an excess of zeros frequently occur. Although these outcomes are frequently analysed with a linear mixed model, or a Poisson mixed model, a twopart mixed model would be better in analysing outcome variables with an excess of zeros. Therefore, objective of this paper was to introduce the relatively ‘new’ method of twopart joint regression modelling in longitudinal data analysis for outcome variables with an excess of zeros, and to compare the performance of this method to current approaches.
Methods
Within an observational longitudinal dataset, we compared three techniques; two ‘standard’ approaches (a linear mixed model, and a Poisson mixed model), and a twopart joint mixed model (a binomial/Poisson mixed distribution model), including random intercepts and random slopes. Model fit indicators, and differences between predicted and observed values were used for comparisons. The analyses were performed with STATA using the GLLAMM procedure.
Results
Regarding the random intercept models, the twopart joint mixed model (binomial/Poisson) performed best. Adding random slopes for time to the models changed the sign of the regression coefficient for both the Poisson mixed model and the twopart joint mixed model (binomial/Poisson) and resulted into a much better fit.
Conclusion
This paper showed that a twopart joint mixed model is a more appropriate method to analyse longitudinal data with an excess of zeros compared to a linear mixed model and a Poisson mixed model. However, in a model with random slopes for time a Poisson mixed model also performed remarkably well.
Keywords:
Twopart joint model; Excess of zeros; Count; Mixed modelling; Longitudinal; Statistical methodsBackground
Within longitudinal epidemiological research, ‘count’ outcome variables frequently occur. Nowadays it is possible to analyse longitudinal ‘count’ outcome variables with advanced statistical techniques such as mixed models. Because ‘count’ data often follow a Poisson distribution, these data are mostly analysed with longitudinal Poisson regression. In many situations ‘count’ data does not exactly follow a Poisson distribution; they are often overdispersed, (i.e. the variance of the outcome variable is higher than the mean value). One of the solutions to deal with this overdispersion in count data is to use a negative binomial regression analysis [1]. However, overdispersion in the count variable is mostly caused by an excess of zeros, which cannot completely be controlled by assuming a negative binomial distribution. Examples of data with an excess of zeros (which are also known as ‘semicontinuous’ data) [2] within the field of epidemiology are: the number of hypoglycaemic events in diabetics, the number of hospitalisations in the general population, the number of sports injuries, the number of falls in a group of elderly people and the number of cigarettes smoked.
The classical methods to analyse outcome variables with an excess of zeros are to reduce the information in the data to either a dichotomous outcome variable (mostly comparing zero versus nonzero) or a categorical outcome variable (mostly comparing zero versus two groups of nonzero outcomes in which the groups are divided according to the median of the nonzero part). Sometimes, researchers try to transform (with a logarithmic transformation) a Poisson distribution with many zeros into a normally distributed variable. However, zeros cannot be log transformed and other computations such as adding ‘1’ to the ‘count’ outcomes with an excess of zeros before log transforming does not solve the problem either.
To properly address the problem of excess of zeros, several socalled twopart statistical models have been developed. These models, which are particularly popular in econometrics, are also known as mixed response or mixed distribution models and they include zeroinflated Poisson (ZIP) regression, zeroinflated negative binomial (ZINB) regression, sample selection methods, and hurdle models [316]. The idea behind these twopart approaches is that the outcome variable has a mixed distribution (i.e. a binomial distribution to deal with zero versus nonzero, and a Poisson (or other) distribution to deal with the nonzero part of the distribution). In the standard twopart approaches the two processes are split and for every process different regression coefficients are obtained. This also means that different sets of covariates can be included, one set for the binomial process (zero versus nonzero) and one set for the Poisson process. In a ZIP model, for instance, one regression coefficient reflects the relationship of a certain covariate with zero versus nonzero, while another regression coefficient reflects the relationship with the ‘count’ outcomes above zero. [17,18]. For some research questions (e.g. investigating the determinants of smoking behaviour) this is a nice feature. However, in many situations one regression coefficient for each covariate would be preferable (e.g. the analysis of hypoglycaemic events). Despite the preference of one regression coefficient, it should be realized that this regression coefficient is somewhat difficult to interpret, because it combines a binomial and a Poisson process into one coefficient. Models that provide one set of regression coefficients for the binomial distribution and Poisson (or other) distribution combined are known as twopart joint regression models [1922]. For longitudinal data analysis these twopart joint regression models are almost never used in epidemiological practice.
The objective of this paper is to introduce a relatively ‘new’ method of a twopart joint mixed model (binomial/Poisson) in longitudinal data analysis for ‘count’ outcome variables with an excess of zeros. Furthermore, the performance of this new method will be compared to a linear mixed model and a Poisson mixed model; two models that are frequently used for longitudinal epidemiological data.
Methods
Dataset
The observational longitudinal dataset used for the analyses was obtained from the Study of the Psychological Impact in Real care of Initiating insulin glargine Treatment (SPIRIT) conducted between 2005 and 2009. This study aimed to examine the use of insulin glargine (a long acting insulin analog) on general emotional wellbeing, diabetes symptom distress and worries about hypoglycaemia in Dutch type 2 diabetes patients who previously used oral antihyperglycaemic medication. Type 2 diabetes patients who used oral antihyperglycaemic agents were recruited from 363 Dutch primary care practices, which were spread across the country. This resulted in a total sample of 889 patients. Measurements were conducted at baseline, after three and after six months. Results from this study have been presented previously [23]. We reanalysed the data in order to assess the development over time in hypoglycaemic events for diabetic patients, and the difference between low and high educated diabetes patients with three different mixed models.
Statistical methods
All analyses were performed within the framework of longitudinal mixed models, The general idea behind mixed models for longitudinal data analysis is that an adjustment is made for the correlated outcome observations within individuals over time by estimating either the differences in average values of the outcome and/or the differences in relationships with timedependent covariates. These differences i.e. variances are known as random effects and can be added to the intercept of the regression model (i.e. random intercept) and/or to the different regression coefficients of timedependent covariates (i.e. random slopes) [2426]. In this paper two ‘standard’ approaches, i.e. a linear mixed model treating the outcome variables as normally distributed and a Poisson mixed model treating the outcome variables as Poisson distributed, will be compared with a twopart joint regression model in order to analyse the development over time and to analyse the differences between low and high educated patients. Equation 1a shows the linear mixed model with only a random intercept, while equation 1b shows the linear mixed model with both a random intercept and a random slope for time.
Where y_{ij} is the hypoglycaemic score for the j^{th} patient at the i^{th} time, x_{ij} is the corresponding time, β_{1} the fixed intercept of the patients,ζ_{1j} the patientspecific random intercept, β_{2} the fixed slope of the patients, ζ_{2j} the patientspecific random slope, and ζ_{ij} is a patientspecific residual error term at the i^{th} time [26]. It was assumed that each of the two variations in the random intercept and the random slope was normally distributed with an average of zero and a variance σ^{2}. Furthermore, the Poisson (ln(μ_{ij})) mixed model can be specified in a similar way.
For the twopart joint approach, a binomial/Poisson mixed distribution was used. The general idea behind this mixture is that the outcome variable has a binomial distribution for the zero versus nonzero part and a Poisson distribution for the nonzero part. The binomial distribution is modelled by a logit link function, while the Poisson distribution is modelled by a log link function. The response probability of a longitudinal twopart joint binomial/Poisson regression model can be written down as:
The first part of the equation has a mean of zero and the second part of the equation has a mean that depends on the covariates (time). π_{1} and π_{2}=1−π_{1} are the component weights/latent class probabilities and g(y_{ij};μ_{ij}) is the Poisson probability for count y_{ij} with mean μ_{ij}[27]. A full explanation of the mathematical background of the analyses with mixed distribution models can be found in other papers [2836]. For the twopart joint model, random intercepts and random slopes can be added in a similar fashion as for the linear mixed model.
In the present analyses, educational level was modelled as a dichotomous variable distinguishing between low and high education (with low education as reference), time was modelled as a categorical variable (represented by two dummy variables, with baseline as reference). Two model fit parameters were used to compare the three models with each other. Firstly, the Bayesian information criterion (BIC) was used. The BIC is an indicator of model fit, based on the −2 log likelihood, but taking into account the number of parameters estimated [37]. A lower BIC indicates a better performance of the model. Secondly, predicted frequencies (including the random effects) of the outcome variable, obtained when fitting the models, were compared to observed frequencies in hypoglycaemic events to compare the accuracy of the different models. This comparison was graphically presented in scatter plots. In addition, the means of the squared residuals (MSR) were computed for the different models. A lower MSR indicates a better performance of the model.
All analyses were performed with Stata (version 11.1) [38]. Estimations were performed with the GLLAMM procedure [26,39], using adaptive quadrature to estimate the random effects. Scatter plots were created within PASW Statistics 18 [40].
Results
Table 1 shows the number, the proportion, and the median of the patients who have experienced ≥1 hypoglycaemic event for the three measurements over time by education as well as for the total number of patients. The proportion of both lower and higher educated patients that experienced ≥1 hypoglycaemic event increased over time. 34.5% of the lower educated patients experienced ≥1 hypoglycaemic event at baseline and after six months this increased to 37.9%, for the higher educated patients the percentage increased from 43.1% to 50.4%. In contrast, the median number of events for subjects with ≥1 hypoglycaemic event decreased over time. For the lower educated patients the median decreased from 4 to 2 and for the higher educated patients from 4 to 3. Table 2 shows the results of the analyses relating the hypoglycaemic events (dependent variable) to educational level (independent variable) when the number of hypoglycaemic episodes was treated as normal, Poisson or binomial/Poisson (twopart joint) distributed. All three models showed a significant positive relationship between education and the number of hypoglycaemic events. The model fit was best for the twopart joint mixed model (binomial/Poisson) (BIC: 6687.64, MSR: 7.26). Furthermore, Figure 1 depicts the accuracy of the different analyses in scatter plots of observed vs. predicted values. The binomial/Poisson model clearly performed best especially in correctly predicting the number of patients with zero events.
Table 1. The proportion and median of diabetes patients with ≥ 1 hypoglycaemic event by time and educational level*
Table 2. Regression and model fit parameters for the three longitudinal models with a random intercept, evaluating the difference in hypoglycaemic events for education*
Figure 1. Scatter plots of the observed vs. predicted values for the three longitudinal models with a random intercept, evaluating the hypoglycaemic events for education.
Table 3 shows the results of the analyses regarding the development over time as independent variable with only a random intercept. In all three models the regression coefficients for time were negative, and the corresponding Pvalues at T2 were significant. Comparing both the fit indicators (Table 3) and the accuracy (Figure 2), similar results were found as for the analyses comparing higher and lower educated patients. The twopart joint model (binomial/Poisson) had the best model fit (BIC: 7013.64, MSR: 6.56) and was also best in correctly predicting the zero events. However, the models changed considerably once random slopes for time were added to the models (Table 4): The signs of the regression coefficients for the Poisson mixed model and the twopart joint mixed model (binomial/Poisson) changed from negative to positive. The regression coefficients derived from the Poisson mixed model changed from −0.11 (3 months) and −0.26 (6 months) to 0.28 (3 months) and 0.38 (6 months) when random slopes were included. For the twopart joint mixed model (binomial/Poisson) the regression coefficients changed from −0.18(3 months) and −0.27 (6 months) to 0.12 (3 months) and 0.25 (6 months). Adding random slopes to the models resulted in a much better fit for the Poisson (BIC: 6774.75, MSR: 0.24) and the twopart joint mixed model (binomial/Poisson) (BIC: 6467.55, MSR: 0.30). Furthermore, the predicted values (Figure 3) for the Poisson mixed model and the twopart joint mixed model (binomial/Poisson) were in close accordance to the observed values. However, to a small extent there was still a difference in the correctly estimated zeros in favour of the twopart joint mixed model (binomial/Poisson). In total, 89.5% of the zeros were correctly estimated for the Poisson mixed model and 92.8% of the zeros were correctly estimated for the longitudinal twopart joint mixed model.
Table 3. Regression and model fit parameters for the three longitudinal models with a random intercept, evaluating the difference in development of the hypoglycaemic events over time
Figure 2. Scatter plots of the observed vs. predicted values for the three longitudinal models with only a random intercept, evaluating the difference in development of the hypoglycaemic events over time.
Table 4. Regression and model fit parameters for the three longitudinal models with a random intercept and random slopes for time, evaluating the difference in development of the hypoglycaemic events over time
Figure 3. Scatter plots of the observed vs. predicted values for the three longitudinal models with a random intercept and random slopes for time, evaluating the difference in development of the hypoglycaemic events over time.
Discussion
This study showed that the twopart joint mixed model (binomial/Poisson) model performed much better than the ‘conventional’ mixed models when only a random intercept was added to the models. This was especially the case in estimating the excess of zeros. However, when random slopes were added to the models, performance of the Poisson mixed model increased considerably and performed more or less the same as the twopart joint mixed model (binomial/Poisson).
It is known from the literature that Poisson regression can handle even a high fraction of zeros [1]. In the present study the percentage of subjects having zero events was relatively high and decreased over time from 61.4% to 56.4%. However, it is not exactly known to what extent the Poisson distribution would be able to model the excess of zeros. In addition, the performance of the Poisson mixed model regarding the number of zeros improved considerably when random slopes were added to the model. Surprisingly, adding random slopes to the model resulted not only in a much better fit, but also in a sign change for the development over time in both the Poisson mixed model and the twopart joint mixed model (binomial/Poisson). Although is it not clear why this sign change occurs, a possible explanation can be that in a model with a random intercept, only the ‘average’ values are allowed to differ between the subjects and therefore the regression coefficient obtained from these analyses only reflects the ‘average’ decrease in the number of events. When adding random slopes to the models, also the development over time is allowed to differ between the subjects. Therefore, an analysis with both a random intercept and random slopes also reflects the increased probability of having an event. This leads to a much better fit and a positive regression coefficient instead of an inverse one.
The interpretation of the regression coefficients of the linear mixed model and the Poisson mixed model are quite straightforward. For example, the interpretation of the relation between education and hypoglycaemic events can be interpreted as following for the linear mixed model (Table 2): Higher educated diabetic patients have 1.14 more events than (on average over time) compared to lower educated diabetic patients. For the Poisson mixed model the regression coefficient can be interpreted as (Table 2): exp (0.66) = 1.93. On average over time, higher educated diabetic patients have an increased prevalence rate of 93% in hypoglycaemic events compared to lower educated diabetic patients. The interpretation of the regression coefficient of the twopart joint mixed model is somewhat more complicated, since the model gives a combined regression coefficient for the binomial process and the Poisson process. However, some researchers have interpreted the regression coefficient of a twopart joint model as being the result for the cases that are above the limit [41] p. 320, [42] p. 503. These cases above the limit would be interpreted in the same way as a Poisson model i.e. higher educated diabetic patients have an increased prevalence rate of 86% in hypoglycaemic events compared to lower educated diabetic patients (exp(0.62) = 1.86). To overcome the problem of the interpretation of a combined regression coefficient, McDonald and Moffit [41] have developed a decomposition technique for the regression coefficient of a twopart joint binomial/normal (tobit) model. The general idea of their decomposition technique is that the regression coefficient combines two interpretations: 1) The difference in the outcome variable of being above the limit, weighted by the probability of being above the limit; and 2) the difference in the probability of being above the limit, weighted by the expected value of the outcome variable if above the limit [43]. In theory, this technique could also be used for twopart joint models that, instead of using a normal distribution, use another distribution such as the Poisson distribution for the values that are above zero.
In the present paper a twopart joint model was used to model the number of hypoglycaemic events obtaining a shared regression coefficient for both the binomial and the Poisson distribution combined. An important reason why one regression coefficient is preferred is that the outcome variable in this example (i.e. the number of hypoglycaemic events) should be seen as one process that cannot be split into two processes with separate regression coefficients. In contrast, sometimes it is better to analyse the data with a twopart separate model, leading to separate regression coefficients for both parts of the process. An example could be the analysis of determinants of smoking behaviour, which can be different for the logistic part of the analysis and the Poisson part. The logistic part of the analysis may need a set of covariates in order to model why some people smoke and others do not smoke, furthermore a different set of covariates may be needed to model how many cigarettes a person will smoke.
Conclusions
This paper showed that the twopart joint mixed model (binomial/Poisson) is a more appropriate method for the analysis of longitudinal data with an excess of zeros when only a random intercept is included into a model. However, in the model with random slopes for time, also the Poisson mixed model performed remarkably well. In addition, more research is needed on the interpretation of the regression coefficients of the longitudinal twopart joint model.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
ASS made contributions to the design, conducted the analysis and interpretation of the data, and drafted the manuscript. TRSH made contributions to the acquisition of the data, and reviewed the article critically for important intellectual content. MRB made contributions to the interpretation of the data, and reviewed the article critically for important intellectual content. MWH made contributions to the interpretation of the data, and reviewed the article critically for important intellectual content. JWRT made substantial contributions to the conception and design and the analysis of data, helped to draft the manuscript, supervised the analysis and interpretation of the data, and reviewed the article critically for important intellectual content. All authors read and approved the final manuscript.
References

Winkelmann R: Econometric analysis of count data 5edition. Springer, Berlin; 2008.

Olsen MK, Schafer JL: A twopart randomeffects model for semicontinuous longitudinal data.
J Am Stat Assoc 2001, 96(454):730745. Publisher Full Text

Berk KN, Lachenbruch PA: Repeated measures with zeros.
Stat Methods Med Res 2002, 11(4):303316. PubMed Abstract  Publisher Full Text

Bin Cheung Y: Zeroinflated models for regression analysis of count data: a study of growth and development.
Stat Med 2002, 21(10):14611469. PubMed Abstract  Publisher Full Text

Ground M, Koch SF: Hurdle models of alcohol and tobacco expenditure in South African households.
S Afr J Econ 2008, 76(1):132143. Publisher Full Text

Karazsia BT, van Dulmen MHM: Regression Models for Count Data: Illustrations using Longitudinal Predictors of Childhood Injury.
J Pediatr Psychol 2008, 33(10):10761084. PubMed Abstract  Publisher Full Text

Lachenbruch PA: Analysis of data with excess zeros.
Stat Methods Med Res 2002, 11(4):297302. PubMed Abstract  Publisher Full Text

Lewsey JD, Thomson WM: The utility of the zeroinflated Poisson and zeroinflated negative binomial models: a case study of crosssectional and longitudinal DMF data examining the effect of socioeconomic status.
Community Dent Oral 2004, 32(3):183189. Publisher Full Text

Madden D: Sample selection versus twopart models revisited: The case of female smoking and drinking.
J Health Econ 2008, 27(2):300307. PubMed Abstract  Publisher Full Text

Cohen AC: Estimating the Parameters of a Modified PoissonDistribution.
J Am Stat Assoc 1960, 55(289):139143. Publisher Full Text

Ghosh SK, Mukhopadhyay P, Lu JC: Bayesian analysis of zeroinflated regression models.
J Stat Plan Infer 2006, 136(4):13601375. Publisher Full Text

Heilbron DC: Generalized linear models for altered zero probabilities and over dispersion in count data. University of California, In San Francisco; 1989.

Kemp AW: Weighted discrepancies and maximum likelihood estimation for discrete distribution.
Commun Statist A—Theory and Methods 1986, 15:783803. PubMed Abstract  Publisher Full Text

Lambert D: ZeroInflated Poisson Regression, with an Application to Defects in Manufacturing.
Technometrics 1992, 34(1):114. Publisher Full Text

Martin DC, Katti SK: Fitting of some contagious distributions to some available data by the maximum likelihood method.
Biometrics 1965, 21:3448. Publisher Full Text

Singh SN: A note on zero inflated Poisson distribution.
Journal of the Indian Statistical Association and Society Manager 1963, 1:140144.

Hardin JW, Hilbe JM: Generalized linear models and extensions. 2nd edition. Stata Press, College Station; 2007.

Long SJ, Freese J: Models for count outcomes. In Regression models for categorical dependent variables using stata. Stata Press, College Station; 2001:223262.

Moulton LH, Curriero FC, Barroso PF: Mixture models for quantitative HIV RNA data.
Stat Methods Med Res 2002, 11(4):317325. PubMed Abstract  Publisher Full Text

Rizopoulos D, Verbeke G, Lesaffre E, Vanrenterghem Y: A twopart joint model for the analysis of survival and longitudinal binary data with excess zeros.
Biometrics 2008, 64(2):611619. PubMed Abstract  Publisher Full Text

Zhou XH: Inferences about population means of health care costs.
Stat Methods Med Res 2002, 11(4):327339. PubMed Abstract  Publisher Full Text

Zhou XH, Tu WZ: Comparison of several independent population means when their samples contain lognormal and possibly zero observations.
Biometrics 1999, 55(2):645651. PubMed Abstract  Publisher Full Text

Hajos TRS, Pouwer F, de Grooth R, Holleman F, Twisk JWR, Diamant M, Snoek FJ: Initiation of insulin glargine in patients with Type 2 diabetes in suboptimal glycaemic control positively impacts healthrelated quality of life. A prospective cohort study in primary care.
Diabetic Med 2011, 28(9):10961102. PubMed Abstract  Publisher Full Text

Twisk JWR: Applied longitudinal data analysis for epidemiology. A practical guide. Cambridge University Press, Cambridge; 2003.

Verbeke G, Molenberghs G: Linear mixed models for longitudinal data. Springer, New York; 2000.

RabeHesketh S, Skrondal A: Multilevel and longitudinal modeling using stata. Stata Press, College Station; 2005.

Skrondal A, RabeHesketh S: Counts. Chapman & Hall/CRC, Boca Raton; 2004. [Generalized latent variable modeling: multilevel, longitudinal, and structural equation models]

Dalrymple ML, Hudson IL, Ford RPK: Finite mixture, zeroinflated Poisson and hurdle models with application to SIDS.

Gurmu S: Generalized hurdle count data regression models.
Econ Lett 1998, 58(3):263268. Publisher Full Text

Lee AH, Wang K, Scott JA, Yau KKW, McLachlan GJ: Multilevel zeroinflated Poisson regression modelling of correlated count data with excess zeros.
Stat Methods Med Res 2006, 15(1):4761. PubMed Abstract  Publisher Full Text

Miranda A: FIML estimation of an endogenous switching model for count data.

Miranda A, RabeHesketh S: Maximum likelihood estimation of endogenous switching and sample selection models for binary, ordinal, and count variables.

Terza JV: Estimating count data models with endogenous switching: Sample selection and endogenous treatment effects.
J Econometrics 1998, 84(1):129154. Publisher Full Text

Terza JV, Kenkel DS, Lin TF, Sakata S: Caregiver advice as a preventive measure for drinking during pregnancy: Zeros, categorical outcome responses, and endogeneity.
Health Econ 2008, 17(1):4154. PubMed Abstract  Publisher Full Text

Tooze JA, Grunwald GK, Jones RH: Analysis of repeated measures data with clumping at zero.
Stat Methods Med Res 2002, 11(4):341355. PubMed Abstract  Publisher Full Text

Yau KKW, Lee AH: Zeroinflated Poisson regression with random effects to evaluate an occupational injury prevention programme.
Stat Med 2001, 20(19):29072920. PubMed Abstract  Publisher Full Text

Schwarz G: Estimating Dimension of a Model.
Ann Stat 1978, 6(2):461464. Publisher Full Text

Stata Corporation: Stata Statistical Software. 11th edition. Stata Press, College Station; 2007.

RabeHesketh S, Pickles A, Skrondal A: GLAMM manual. The Berkeley Electronic Press, Berkeley; 2004.

SPSS Inc.: PASW Statistics for windows. 18th edition. SPSS Inc., Chicago; 2009.

Mcdonald JF, Moffitt RA: The Uses of Tobit Analysis.
Rev Econ Stat 1980, 62(2):318321. Publisher Full Text

Roncek DW: Learning More from Tobit Coefficients  Extending a ComparativeAnalysis of Political Protest.
Am Sociol Rev 1992, 57(4):503507. Publisher Full Text
Prepublication history
The prepublication history for this paper can be accessed here: