Abstract
Background
Methodological development of joint models of longitudinal and survival data has been rapid in recent years; however, their full potential in applied settings are yet to be fully explored. We describe a novel use of a specific association structure, linking the two component models through the subject specific intercept, and thus extend joint models to account for measurement error in a biomarker, even when only the baseline value of the biomarker is of interest. This is a common occurrence in registry data sources, where often repeated measurements exist but are simply ignored.
Methods
The proposed specification is evaluated through simulation and applied to data from the General Practice Research Database, investigating the association between baseline Systolic Blood Pressure (SBP) and the timetostroke in a cohort of obese patients with type 2 diabetes mellitus.
Results
By directly modelling the longitudinal component we reduce bias in the hazard ratio for the effect of baseline SBP on the timetostroke, showing the large potential to improve on previous prognostic models which use only observed baseline biomarker values.
Conclusions
The joint modelling of longitudinal and survival data is a valid approach to account for measurement error in the analysis of a repeatedly measured biomarker and a timetoevent. User friendly Stata software is provided.
Background
Many biomarkers such as systolic blood pressure (SBP) have been identified as key prognostic factors in the development and validation of cardiovascular risk scores [1,2]. However, often only baseline values of these biomarkers are used, despite the existence of repeated measures, especially in registry sources such as the General Practice Research Database (GPRD) [3]. Furthermore, biomarkers are often measured with error. Failing to adjust for such measurement error leads to estimates being biased towards the null [4].
A joint model of longitudinal and survival data allows us to investigate the relationship between a repeatedly measured biomarker, subject to measurement error, such as SBP, and the time to an event of interest, such as time to nonfatal stroke. The approach which has dominated the methodological literature involves linking the two component submodels using shared random effects [5,6]. From a classical perspective, these methods require computationally intensive numerical integration, which is difficult to implement. However, due to the recent introduction of userfriendly software in R [7,8] and Stata [9], these models are starting to find their place in applied research [10,11], but the potential uses of and forms of the association parameters, linking the longitudinal and survival components, are yet to be fully explored. Alternatively, many authors have proposed a Bayesian approach, proving readily available BUGS code to implement the models [12,13].
The most commonly used association structures include the current value parameterisation [5]; whereby we directly link the value of the biomarker, as estimated by the longitudinal submodel, to survival, and the first derivative or slope [10]; allowing the investigation of the effect that the rate of change of the biomarker has on survival.
There is often interest in predicting prognosis based on an initial baseline measurement [1,2]. In this paper we investigate the use of the joint model framework with a random intercept association structure as an approach to adjust for measurement error, inherent in biomarkers such as SBP. By incorporating the repeated measures we thus make the most efficient use of the data available. In particular, as a prognostic model for future patients, we describe how this framework can be used to predict survival for new patients who will only have baseline measurements.
Methods
A joint model of longitudinal and survival data consists of two component submodels: the longitudinal submodel and the survival submodel. We define a set of baseline covariates, U_{i}, which can potentially differ between submodels. The longitudinal submodel allows us to model the trajectory of a repeatedly measured biomarker over time, adjusting for baseline covariates. The standard approach assumes a linear mixed effects model [14]. We observe
with
where Y_{i}(t_{ij}) is the observed longitudinal response for the i^{th} patient measured at the j^{th} time point. W_{i}(t_{ij}) is our true unobserved trajectory function consisting of design matrices and for the fixed and random effects, β and b_{i}, respectively, where b_{i} ∼ MVN(0,Σ). We can incorporate flexibility here by allowing both and to contain restricted cubic spline functions of measurement time [15]. We also have a vector of baseline covariates u_{i} ∈ U_{i}, and corresponding regression coefficients, δ. Finally, ε_{ij} is our normally distributed measurement error with constant variance . We further assume that the random effects and error term are independent, and that cov (ε_{ij},ε_{ik}) = 0 (where j ≠ k).
The timetoevent submodel usually takes the form of a proportional hazards model
with h_{0}(t) the baseline hazard function and v_{i} ∈ U_{i} is a vector of baseline covariates with corresponding log hazard ratios, ϕ. The parameter α_{1} is commonly named the association parameter, indicating the strength of association between the longitudinal biomarker and the time to event. If α_{1} = 0, then the joint model reduces to the two separate models and fitting a joint model will not prove advantageous. This parameterisation assumes the hazard is dependent on the biomarker through its current value. This form of association is one of many ways to link the two component submodels. The baseline hazard function, h_{0}(t), can be modelled using a parametric distribution, most frequently the Weibull, or less restrictively using flexible parametric survival models [16], or of course can be left unspecified [17]. However, an unspecified baseline hazard function leads to underestimation of the standard errors of parameter estimates [18], and consequently bootstrapping is required to obtain appropriate standard errors.
For illustration, we let W_{i}(t_{ij}), the longitudinal submodel, be a linear function of time where the intercept and slope varies between subjects
giving a model with a random intercept and random linear slope. As an alternative way of linking the component models to that of Equation (3), we may link elements of the trajectory function, W_{i}(t_{ij}), to the hazard directly. For example, we can link the subject specific baseline biomarker values through the intercept association structure, where
in this expression α_{2} now estimates the strength of the association between the patient specific baseline biomarker values, as estimated by the longitudinal submodel, and the timetoevent. This way we can let the risk of event depend directly on the subject specific value of the biomarker at time t = 0.
If interest lies in prediction when a new patient is observed at baseline, the issue of measurement error can be accounted for through this approach. A benefit of this association structure also lies in the evaluation of the joint likelihood. Under most parametric survival submodels (e.g. Weibull distribution) and timedependent association structures (eg. current value), numerical quadrature is required to integrate out not only the random effects, but under Equation (3), nested quadrature is also required to evaluate the cumulative hazard function. Under the timeindependent association structure of Equation (5), we avoid this nested quadrature as the cumulative hazard function has an analytically tractable form, which provides computational benefits.
As discussed in the introduction, this model formulation can be an alternative to the standard approach of using the observed baseline biomarker value
where Y_{0i} is the observed baseline biomarker value and α_{3} is the log hazard ratio for a one unit increase in the observed baseline biomarker value. Although simple to fit, Equation (6) does not account for potential measurement error in Y_{0i}.
Simulation study
In order to assess the performance of the standard approach of including observed biomarker values, compared to the full joint model described above, we evaluated both through simulation [19]. For ease of exposition we assume a longitudinal model with random intercept and slope, assuming a continuous biomarker of interest with
where β_{0} = β_{1} = 0, and b_{0i} ∼ N(0,1), b_{1i} ∼ N(0,0.25^{2}) with correlation between (b_{0i},b_{1i}) of 0.25. Observed measurements are then generated from , where t_{ij} is the time of the j^{th} measurement for the i^{th} patient. We vary σ_{e} from {0.1,0.5,1}.
We assume a Weibull baseline hazard function with λ = 0.1 and γ = 1.5. A binary variable, X_{1} to represent treatment group was generated from Bin (1,0.5), with an associated log hazard ratio of ϕ_{1} = 0.5. A continuous covariate, X_{2}, to represent age at baseline was generated from N(65,12) with an associated log hazard ratio of ϕ_{2} = 0.01. We then generate survival times from a Weibull distribution where the hazard is defined as h(t) = h_{0}(t) exp(α_{2}β_{0i} + ϕ_{1}X_{1} + ϕ_{2}X_{2}), with α_{2} the association parameter, indicating the effect of a one unit increase in the value of the subject specific intercept on the risk of event. We vary α_{2} = {0.5,0.25,0.25,0.5}. Each simulation contained 300 patients with up to 5 annual measurements (including baseline), and administrative censoring at 5 years. This corresponds to an approximate 18.9% survival proportion at 5 years (calculated at the mean of covariate values, and β_{0i} = 0). To each dataset we fit a Weibull proportional hazards model including the observed baseline measurement, and a Weibullbased joint model with the random intercept association structure. We adjust for age and treatment in the survival submodel. Each scenario is simulated 1000 times.
To illustrate the varying measurement error standard deviations used in the simulation scenarios, we show in Figure 1 observed longitudinal measurements from the same 100 patients with σ_{e} = {0.1,0.5,1}, and when α = 0.25. Figure 1 illustrates that as the measurement error standard deviation increases, the variability in the observed biomarker values increases.
Figure 1. Example simulated observed longitudinal measurements with varying measurement error standard deviation.
The GPRD cohort
The General Practice Research Database (GPRD) Group has obtained ethical approval from a Multicentre Research Ethics Committee (MREC) for all purely observational research using GPRD data; namely, studies which do not include patient involvement The core work of the GPRD is covered by MREC approval granted by the Trent Multi Research Ethics Committee (REC reference number 05/MRE04/87) and this study was approved by the GPRD Independent Scientific Advisory Committee (ISAC) (Protocol number 09_094). This study is based in part on data from GPRD obtained under licence from the UK Medicines and Healthcare Product Regulatory Agency (MHRA). However, the interpretation and conclusions contained in this study are those of the authors alone.
The example cohort used to illustrate the methods in this paper consists of 4,850 obese patients diagnosed with type 2 diabetes mellitus. We have 107,347 measurements of SBP, with maximum followup of 22 years. There were 278 stroke events observed.
In Figure 2 we show the observed SBP measurements for 9 randomly selected patients, who had at least 10 measurements, illustrating some nonlinear trajectories. To accommodate such nonlinearities we can use restricted cubic splines in the linear mixed effects submodel. In particular, we specify the following longitudinal submodel
Figure 2. Longitudinal response measurements for SBP for 9 randomly selected patients who had at least 10 measurements. The dashed line represents the fitted longitudinal trajectories based on the joint model.
Where s_{F}(t_{ij};k_{F}) is the restricted cubic spline basis of measurement time with corresponding fixed effects, β_{F}, with knot locations k_{F}, and s_{R}(t_{ij};k_{R}) is the restricted cubic spline basis of measurement time with corresponding random effects, b_{R}, and knot locations k_{R}.
Prelimenary modelling of the longitudinal data can be conducted to guide model selection, in particular, the degrees of freedom for the spline terms capturing the underlying longitudinal trajectory over time.
To allow flexibility in the survival submodel we use the flexible parametric survival model [16,20], which models the baseline log cumulative hazard function using resticted cubic splines. We can once again undertake seperate analysis of just the survival data to inform model selection. In particular, we can use the AIC and BIC to guide the selection of the number of degrees of freedom to capture the baseline hazard function, following Rutherford et al. (2013) [21]. Our final joint model is then
where
where the baseline log cumulative hazard function, log[H_{0}(t)], is expanded into a restricted cubic spline function of log(t), s(log(t);γ,k_{S}), with knot locations k_{S} and coefficient vector, γ. This framework has recently been incorporated into a joint model [22]. In each submodel we adjust for the baseline effects of age, sex and BMI. We fit the joint model with the random intercept association structure shown in Equation (5). For comparison, we also apply the standard flexible parametric survival model, adjusting for observed baseline SBP, age, sex and BMI.
Results
Simulation study results
Bias and coverage estimates for the association parameter are presented in Table 1. Under the standard Weibull model, we observe increasing bias in the estimates of the association between baseline biomarker values and survival, as the magnitude of the measurement error standard deviation, σ_{e}, increases. In parallel we observe very poor coverage probabilities under the Weibull approach. For example, with α = 0.5 and σ_{e} = 1, we observe bias of 0.261 (percentage bias of 52.2%) and coverage of 0.4%. In contrast, under the joint modelling approach we observe minimal bias and coverage probabilities close to 95% across all scenarios.
Table 1. Simulation results of the association parameter,α
Analysis of GPRD cohort
We now present the analysis of the GPRD cohort. In all analyses we use SBP/10 so that a unit increase in SBP/10 represents a clinically meaningful 10 unit increase in SBP. Our primary interest is the association between baseline SBP and the risk of stroke. Baseline (t_{ij} = 0) corresponds to when each patient entered the cohort, i.e. the time of first SBP measurement.
We began by assuming a random intercept and selecting the degrees of freedom for the fixed spline terms using the AIC and BIC. In this case, both selected five degrees of freedom for s_{F}(t_{ij};k_{F}), with an AIC of 417565.8 and BIC of 417604.1. For the random splines of time we assumed a linear term, which equates to one spline term for s_{R}(t_{ij};k_{R}). This allows a very flexible form to take into account the variation in SBP over time. We further adjust for age, sex and BodyMass Index (BMI) at baseline.
For the flexible parametric survival submodel, both AIC and BIC selected two degrees of freedom, with an AIC of 2408.7173 and BIC of 2430.483. If one degree of freedom had been selected, then this would be equivalent to a Weibull survival model.
Results are presented in Table 2. Under the standard flexible parametric survival model we observe a hazard ratio for a ten unit increase in baseline SBP of 1.111 (95% CI: 1.051, 1.172). Under a joint model we observe an increased hazard ratio of 1.198 (95% CI: 1.107, 1.298). The increased effect using a joint model is consistent with that observed in the simulation study, i.e. that the bias in the standard survival model is towards the null. The fitted trajectories seen in Figure 2 appear to capture the subjectspecific measurements well, although some panels appear to only require a linear trend.
Table 2. Results from applying a flexible parametric proportional hazards model adjusting for observed baseline systolic blood pressure, and a full joint model using the intercept association structure
We illustrate how the bias from the standard approach increases with SBP in Figure 3, showing predictions from both models for a female patient aged 60, with low (90), medium (130) and high (200) SBP baseline measurements. To quantify the differences, at 10 years under the standard model we observe a survival probability of 0.881 for a SBP of 200, compared to 0.816 under the full joint model.
Figure 3. Predicted survival from the flexible parametric survival model and joint model, for a female, aged 60 years, BMI of 30, with SBP of 90, 130 or 200.
Discussion
A wealth of patient data is becoming available in registry sources such as the GPRD, providing extensive opportunities to utilise the joint modelling framework. We have shown that by incorporating repeated measures of a biomarker within a unified joint model framework, we reduce bias due to measurement error, even when only the baseline level of the biomarker is predictive of survival. As illustrated in the simulation study, ignoring measurement error in biomarkers such as blood pressure can lead to a marked underestimation of covariate effects. In our application, through the use of restricted cubic splines in the linear mixed effects submodel, we can model highly nonlinear trajectories over time, compared to linear slope models. Furthermore, the flexible parametric survival submodel can also capture complex baseline hazard functions, an important component when predicting survival at the patient level [22].
Given that, to our knowledge, all current cardiovascular risk scores only use baseline measures, with no adjustment for measurement error, the prospects of utilising this framework to improve prognostic risk scores is quite substantial. Predicting survival for a new patient using this framework follows naturally, as often only a first baseline biomarker observation will be available. However, such a modelling approach also allows a dynamic risk prediction approach to be adopted, whereby a patient’s estimated future risk is updated as each new biomarker value is obtained [23]. Such an approach could enable response to treatment to be monitored and patients counselled accordingly.
In the analysis of the GPRD cohort, we incorporated flexibility in both the longitudinal submodel through the use of restricted cubic splines, and the flexible parametric survival submodel. Given that both submodels require choosing the number of degrees of freedom, a simple sensitivity analysis can be undertaken to assess knot locations and number of knots. We showed recently that the flexible parametric survival submodel is very robust to both knot placement and number of knots within a joint model framework [22], and furthermore, an extensive simulation study has been conducted by Rutherford et al. (2013), which showed excellent performance of the flexible parametric model to capture simple and complex baseline hazard functions [21]. Furthermore, given that primary interest was in the survival component, and the estimate of association, often modelling the longitudinal component with a suitable sensible functional form will provide an improved estimate compared to simplistic approaches of seperate modelling.
In this paper we have concentrated on a specific association structure linking the 2 component submodels; however, it may be of interest to investigate linking multiple components of a biomarkers trajectory to the time to an event of interest. For example, recent work by Rothwell et al. (2010) [24] has shown associations between not only baseline blood pressure, but also variability over time as important predictors of cardiovascular events. Furthermore, we have only compared the standard approach of adjusting for observed baseline biomarker values to the full joint model. It would be of interest to compare alternative approaches for adjusting for measurement error, not only in baseline biomarkers, but also under a timedependent association structure [25,26].
Extensions to the modelling framework include incorporating multiple biomarkers. In particular, in our example we modelled SBP over time, whilst adjusting for baseline BMI. It may be of interest to model not only SBP but also the interrelationships between different biomarkers such as BMI, and how they are related to an event of interest [13].
To facilitate the use of the methods in practice, user friendly Stata software, written by the first author, is available, with a variety of survival model choices and association structures, including those discussed in this article [9,27]. To illustrate computational aspects of the framework, the presented joint model applied to the cohort took just over 13 minutes to converge on a standard laptop computer.
Conclusion
The joint modelling of longitudinal and survival data is a valid approach to account for measurement error in the analysis of a repeatedly measured biomarker and a time to event. User friendly Stata software is provided.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
All authors were involved in conception and design of the project. MJC conducted the simulation study, analysed the clinical dataset and wrote the first draft of the manscript. PCL and KRA both revised the manuscript. All authors read and approved the final manuscript.
Acknowledgements
MJC is funded by a National Institute for Health Research (NIHR) Doctoral Fellowship (DRF201205409) and KRA is partially supported as a NIHR Senior Investigator (NI51050810061).
The cohort of obese patients with type 2 diabetes mellitus was obtained from the General Practice Research Database (GPRD) under Independent Scientific Advisory Committee (ISAC)approved Protocol 09_094, and which was funded by a National Institute for Health Research (NHIR) Health Technology Assessment (HTA) Programme Project Grant (07/85/02).
The authors would like to thank Monica Hernández for initial preparatory work on the data from the GPRD cohort of patients, Roberta Ara for useful initial discussions regarding antiobesity treatment, and finally the reviewers whose comments greatly improved the paper.
References

Conroy RM, Pyorala K, Fitzgerald AP, Sans S, Menotti A, De Backer G, De Bacquer D, Ducimetiere P, Jousilahti P, Keil U, Njolstad I, Oganov RG, Thomsen T, TunstallPedoe H, Tverdal A, Wedel H, Whincup P, Wilhelmsen L, Graham IM, SCOREpg: Estimation of tenyear risk of fatal cardiovascular disease in Europe: the SCORE project.
Eur Heart J 2003, 24(11):9871003. PubMed Abstract  Publisher Full Text

HippisleyCox J, Coupland C, Vinogradova Y, Robson J, May M, Brindle P: Derivation and validation of QRISK, a new cardiovascular disease risk score for the United Kingdom: prospective open cohort study.
BMJ 2007, 335(7611):136.
[ http://dx.doi.org/10.1136/bmj.39261.471806.55 webcite]
PubMed Abstract  Publisher Full Text  PubMed Central Full Text 
Ara R, Blake L, Gray L, Hernandez M, Crowther M, Dunkley A, Warren F, Jackson R, Rees A, Stevenson M, Abrams K, Cooper N, Davies M, Khunti K, Sutton A: What is the clinical effectiveness and costeffectiveness of using drugs in treating obese patients in primary care? A systematic review.
Health Technol Assess 2012, 16(5):1202.
[ http://dx.doi.org/10.3310/hta16050 webcite]

Prentice RL: Covariate measurement errors and parameter estimation in a failure time regression model.
Biometrika 1982, 69(2):331342.
[ http://www.jstor.org/stable/2335407 webcite]
Publisher Full Text 
Wulfsohn MS, Tsiatis AA: A joint model for survival and longitudinal data measured with error.
Biometrics 1997, 53:330339. PubMed Abstract  Publisher Full Text

Henderson R, Diggle P, Dobson A: Joint modelling of longitudinal measurements and event time data.
Biostatistics 2000, 1(4):465480. PubMed Abstract  Publisher Full Text

Rizopoulos D: JM: An R package for the joint modelling of longitudinal and timetoevent data.
J Stat Softw 2010, 35(9):133.
[ http://www.jstatsoft.org/v35/i09 webcite]
PubMed Abstract  PubMed Central Full Text 
Philipson P, Sousa I, Diggle P, Williamson P, KolamunnageDona R, Henderson R: joineR  Joint modelling of repeated measurements and timetoevent data.
2012.
[ http://cran.rproject.org/web/packages/joineR/index.html webcite]

Crowther MJ, Abrams KR, Lambert PC: Joint modeling of longitudinal and survival data.

Wolbers M, Babiker A, Sabin C, Young J, Dorrucci M, Chêne G, Mussini C, Porter K, Bucher HC, CASCADE: Pretreatment CD4 cell slope and progression to AIDS or death in HIVinfected patients initiating antiretroviral therapy–the CASCADE collaboration: a collaboration of 23 cohort studies.
PLoS Med 2010, 7(2):e1000239.
[ http://dx.doi.org/10.1371/journal.pmed.1000239 webcite]
PubMed Abstract  Publisher Full Text  PubMed Central Full Text 
Ibrahim JG, Chu H, Chen LM: Basic concepts and methods for joint models of longitudinal and survival data.
J Clin Oncol 2010, 28(16):27962801. PubMed Abstract  Publisher Full Text

Guo X, Carlin BP: Separate and joint modeling of longitudinal and event time data using standard computer packages.
Am Stat 2004, 58:1624.
[ http://www.jstor.org/stable/27643494 webcite]
Publisher Full Text 
Rizopoulos D, Ghosh P: A Bayesian semiparametric multivariate joint model for multiple longitudinal outcomes and a timetoevent.
Stat Med 2011, 30(12):13661380. PubMed Abstract  Publisher Full Text

Laird NM, Ware JH: Randomeffects models for longitudinal data.
Biometrics 1982, 38(4):963974. PubMed Abstract  Publisher Full Text

Durrleman S, Simon R: Flexible regression models with cubic splines.
Stat Med 1989, 8(5):551561. PubMed Abstract  Publisher Full Text

Royston P, Lambert PC: Flexible Parametric Survival Analysis Using Stata: Beyond the Cox Model. College Station: Stata Press; 2011.

Hsieh F, Tseng YK, Wang JL: Joint modeling of survival and longitudinal data: likelihood approach revisited.
Biometrics 2006, 62(4):10371043.
[ http://dx.doi.org/10.1111/j.15410420.2006.00570.x webcite]
PubMed Abstract  Publisher Full Text 
Burton A, Altman DG, Royston P, Holder RL: The design of simulation studies in medical statistics.
Stat Med 2006, 25(11):42794292. PubMed Abstract  Publisher Full Text

Royston P, Parmar MKB: Flexible parametric proportional hazards and proportional odds models for censored survival data, with application to prognostic modelling and estimation of treatment effects.
Stat Med 2002, 21(15):21752197. PubMed Abstract  Publisher Full Text

Rutherford MJ, Crowther MJ, Lambert PC: The use of restricted cubic splines to approximate complex hazard functions in the analysis of timetoevent data: a simulation study.
J Stat Comput Simul 2013.
[ http://www.tandfonline.com/doi/abs/10.1080/00949655.2013.845890 webcite]

Crowther MJ, Abrams KR, Lambert PC: Flexible parametric joint modelling of longitudinal and survival data.
Stat Med 2012, 31(30):44564471.
[ http://dx.doi.org/10.1002/sim.5644 webcite]
PubMed Abstract  Publisher Full Text 
Rizopoulos D: Dynamic predictions and prospective accuracy in joint models for longitudinal and timetoevent data.
Biometrics 2011, 67(3):819829.
[ http://dx.doi.org/10.1111/j.15410420.2010.01546.x webcite]
PubMed Abstract  Publisher Full Text 
Rothwell PM, Howard SC, Dolan E, O’Brien E, Dobson JE, Dahlöf B, Sever PS, Poulter NR: Prognostic significance of visittovisit variability, maximum systolic blood pressure, and episodic hypertension.
Lancet 2010, 375(9718):895905.
[ http://dx.doi.org/10.1016/S01406736(10)60308X webcite]
PubMed Abstract  Publisher Full Text 
Zucker DM: A pseudo partial likelihood method for semiparametric survival regression with covariate errors.
J Am Stat Assoc 2005, 100(472):12641277.
[ http://www.tandfonline.com/doi/abs/10.1198/016214505000000538 webcite]
Publisher Full Text 
Liao X, Zucker DM, Li Y, Spiegelman D: Survival analysis with errorprone timevarying covariates: a risk set calibration approach.
Biometrics 2011, 67:5058.
[ http://dx.doi.org/10.1111/j.15410420.2010.01423.x webcite]
PubMed Abstract  Publisher Full Text  PubMed Central Full Text 
Crowther MJ: STJM: Stata module to fit shared parameter joint models of longitudinal and survival data.
Stat Softw Components Boston Coll Dep Econ 2012.
[ http://ideas.repec.org/c/boc/bocode/s457339.html webcite]
Prepublication history
The prepublication history for this paper can be accessed here: