Center for Disease Prevention and Health Interventions for Diverse Populations, Ralph H. Johnson Veterans Affairs Medical Center, Charleston, SC, USA
Division of Biostatistics & Epidemiology, Medical University of South Carolina, 135 Cannon St, Charleston, SC, 29425, USA
Center for Health Disparities Research, Division of General Internal Medicine, Medical University of South Carolina, Charleston, SC, USA
Department of Clinical Pharmacy and Outcome Sciences, South Carolina College of Pharmacy, Charleston, SC, USA
Abstract
Background
With the current focus on personalized medicine, patient/subject level inference is often of key interest in translational research. As a result, random effects models (REM) are becoming popular for patient level inference. However, for very large data sets that are characterized by large sample size, it can be difficult to fit REM using commonly available statistical software such as SAS since they require inordinate amounts of computer time and memory allocations beyond what are available preventing model convergence. For example, in a retrospective cohort study of over 800,000 Veterans with type 2 diabetes with longitudinal data over 5 years, fitting REM via generalized linear mixed modeling using currently available standard procedures in SAS (e.g. PROC GLIMMIX) was very difficult and same problems exist in Stata’s gllamm or R’s lme packages. Thus, this study proposes and assesses the performance of a meta regression approach and makes comparison with methods based on sampling of the full data.
Data
We use both simulated and real data from a national cohort of Veterans with type 2 diabetes (n=890,394) which was created by linking multiple patient and administrative files resulting in a cohort with longitudinal data collected over 5 years.
Methods and results
The outcome of interest was mean annual HbA1c measured over a 5 years period. Using this outcome, we compared parameter estimates from the proposed random effects meta regression (REMR) with estimates based on simple random sampling and VISN (Veterans Integrated Service Networks) based stratified sampling of the full data. Our results indicate that REMR provides parameter estimates that are less likely to be biased with tighter confidence intervals when the VISN level estimates are homogenous.
Conclusion
When the interest is to fit REM in repeated measures data with very large sample size, REMR can be used as a good alternative. It leads to reasonable inference for both Gaussian and nonGaussian responses if parameter estimates are homogeneous across VISNs.
Background
Many translational research projects are generating very large data sets (VLDS) which require fitting complex models to answer questions of public health interest. Datasets can be considered “very large” because of large numbers of study subjects or units of analysis and/or large numbers of variables, and both situations present challenges during the analysis phase, especially when observations are clustered at some level (eg. Longitudinal data). An example of VLDS with large number of observations is a twoyear group randomized trial designed to assess the impact of a quality improvement intervention on colorectal cancer screening in primary care practices. Electronic medical record data were obtained from a sample of 68,150 patients from 32 primary care practices in 19 US states, followed monthly over a 2year time period
Fitting complex models for these types of data sets can be difficult, requiring inordinate amounts of computer time for parameter estimation, requiring memory allocations beyond what are available or containing data structures that prevent model convergence, even within stateoftheart computational infrastructures of medium size research facilities such as ours. For instance, fitting complicated generalized linear mixed models (GLMMs) for data from the examples above using software such as SAS 9.2.2 (Cary, NC), Stata 11 (College Station, TX) or R (R2.11.1) may not be possible using desktop computers typically available to researchers within our institutions (64 bit server with 12GB and 667MHz dual ranked DIMMS and 48GB of RAM). Although a few methods for modeling VLDSs exist, current practice mainly involves data reduction processes, which usually result in loss of information.
Recently, we have been working on a longitudinal study of the trajectory of HbA1c control in patients with type2 diabetes treated within the Veterans Administration (VA) healthcare setting, and we have been faced with the problem of fitting GLMMs on over 890,000 patients, clustered in 23 Veterans Integrated Service Networks (VISNs) and followed over 5 years. Fitting mixed effects logistic regression model with over 30 covariates for making individual level inference resulted in an out of memory error using a 64 bit server with 12GB and 667MHz dual ranked DIMMS and 48GB of RAM.
In SAS procedures such as Proc GLIMMIX, fitting mixed effect models with the recommended standard syntax of including subject ID in a Class statement was not possible. This procedure with the standard syntax ran out of memory when we attempted to fit a model with the simplest scenario of including a random intercept. With adhoc modifications (see discussion section) to the standard syntax, however, we were able to fit the model despite it took longer time. Similar problems were observed in Stata’s gllamm, and R’s lme4 packages.
With the current focus on personalized medicine, patient/subject level inference is often of key interest in translational research. GLMMs are a very rich class of models that are traditionally used to make such individuallevel inference by breaking down the total variation in the observed response into withinsubject and betweensubject variation. These models are also used to incorporate natural heterogeneity in the estimates due to unmeasured explanatory variables
There are some recent Bayesian methods proposed for fitting parametric random effects models to VLDSs
An alternative is a 2stage “data squashing” method
Motivated by the scarcity of work in this area and the challenge we faced with the analysis of our VLDS, we propose a random effects meta regression (REMR) approach in which VISNspecific estimates are combined via meta regression. We make comparisons with two other approaches, (1) average estimates from analysis of 1000 data sets obtained via simple random sampling (SRS) of the original data with simulated 95% confidence intervals (CIs), (2) weighted average estimates from analysis of 1000 data sets obtained via VISNstratified random sampling (StRS) with simulated 95% CIs. Using simulated data, we also assess biases present within each approach, noting whether they provide equivalent inferences as would be obtained from analysis of the full data. The paper is organized as follows: section 2 presents the motivating example; section 3 describes the details of the statistical methods; section 4 presents the results of the analysis; and section 5 discusses the findings.
Motivating example
A national cohort of Veterans with type 2 diabetes was created by linking patient and administrative files from the Veterans Health Administration (VHA) National Patient Care and Pharmacy Benefits Management (PBM) databases. Veterans were included in the cohort if they had type 2 diabetes defined by two or more International Classification of Diseases, Ninth Revision (ICD9) codes for diabetes (250, 357.2, 362.0, and 366.41) in the previous 24 months (2000 and 2001) and during 2002 from inpatient stays and/or outpatient visits on separate days (excluding codes from lab tests and other nonclinician visits), and prescriptions for insulin or oral hypoglycemic agents (VA classes HS501 or HS502, respectively) in 2002
Outcome measure
The primary outcome was glycosylated hemoglobin (HbA1c) level. In addition, a binary outcome defined as HbA1c ≥ 8.0% was used.
Primary independent variable
For this project, the primary research question was whether HbA1c differed significantly by race/ethnicity, classified as nonHispanic white (NHW), nonHispanic black (NHB), Hispanic, and other/unknown/missing.
Demographic variables
Age, gender, marital status (i.e., single or married) and percentage serviceconnectedness (i.e., degree of disability due to illness or injury that was aggravated by or incurred in military service) were available and treated as covariates in the model. Location of residence was defined as urban and rural/highly rural,
Comorbidity
Variables included substance abuse, anemia, cancer, cerebrovascular disease, congestive heart failure, cardiovascular disease, depression, hypertension, hypothyroidism, liver disease, lung disease, fluid and electrolyte disorders, obesity, psychoses, peripheral vascular disease, and other (AIDS, rheumatoid arthritis, renal failure, peptic ulcer disease and bleeding, weight loss) and were defined based on ICD9 codes at entry into the cohort. In our final models, we included a categorical summary of count of comorbidities defined as (0=none, 1=one, 2=two 3=three or more), a process which has been shown to be as or more efficient than more complicated algorithms
Methods
Overview of the generalized linear mixed model (GLMM)
To model the relationship between HbA1c (Y) and covariates (X), a GLMM approach was used. For the ith subject (i=1,.,N) with n_{i} (j=1,…,n_{i}) repeated measurements, we considered the model, E(Y_{i} X_{i},Z_{i}) =g^{1}(X_{i}β + Z_{i}b_{i}), where g is a monotone link function and Y_{i} is Nx1 vector of responses, X_{i} is n_{i}xp matrix of covariates, Z_{i} is n_{i}xq matrix of covariates (q≤p), β is a px1 vector of fixed effect parameters, b_{i} is a qx1 vector of random effects. We assume that b_{i}~N(O,G), where G is a qxq covariance matrix for b_{i}. An identity link function results in a linear mixed model for the continuous HbA1c outcome, and a logit link results in logistic mixed effects model for the dichotomous HbA1c outcome. If b_{i} is a vector of random intercept and slope, it results in a 2x2 covariance matrix G which indicate natural heterogeneity among individuals in both their baseline level and changes in the expected outcomes over time. In our models, a personlevel random effect was included in all models to account for withinindividual correlations. This approach accommodates a wide range of distributional assumptions, multilevel data, measurement of subjects at different time points, modeling individual level effects, missing data, and time varying or invariant covariates
A special case is the linear mixed model given by, Y_{i} X_{i},Z_{i} =X_{i}β + Z_{i}b_{i} + e_{i}, where e_{i}~N(0,R_{i}) and independent of b_{i}. Assuming, R_{i}=σ^{2}I_{ni}, the conditional distribution of Y_{i}b_{i} is given by the multivariate Gaussian distribution with mean Z_{i}b_{i} +X_{i}β and variance σ^{2}I_{ni}. In this model, the response for the i^{th} subject is assumed to differ from the population mean, E(Y_{i})=X_{i}β, by a subject specific effect, Z_{i}b_{i}, and a withinsubject measurement error e_{i}. The estimates of the parameters in a mixed model are determined as the values that optimize an objective function which is either the likelihood of the parameters given the observed data (ML) or a related objective function called the restricted ML (REML). In practice REML is often preferred. The loglikelihood based on the observed data assuming that the vector of all variance components in G and R_{i} can be denoted by α can be written as,
where
Weighted Generalized Linear Mixed Effects Model (WGLMM)
The WGLMM is a model wellsuited for analysis of survey sampled data. We use it to analyze our type2 diabetes cohort data in the context of finitely sampled data (e.g. VISNstratified randomly sampled data).
In sample surveys, units are sometimes drawn with unequal selection probabilities, and if the design probabilities are informative (i.e. they are related to the response)
The generalizations of Equation (1) above to the special case of weighted linear mixed model (WLMM) can be described by the change in the conditional distribution (Y_{i}b_{i})~N( Z_{i}b_{i} +X_{i}β ;R_{i}(α)W_{i}
^{1}) and
Meta regression approach
Another approach to deal with fitting parametric random effects models to VLDS is to do aggregated analysis after estimating the parameters at some level of administrative or sampling based subsets of the data. This can lead to substantial gain in the time required to fit these models and can be adapted to parallel processing, leading to further computational time savings. In the case of likelihood inference, this idea leads to a pseudolikelihood
Since VHA research data are provided at VISN level, models for each VISN can be fitted, and a mechanism to combine these parameter estimates is suggested. After models relating HbA1c and covariates are fitted for each VISN, the next step is to use pooling methods to obtain national estimates. This can be done using fixed effects
Fixed effects meta regression (FEMR)
Let ψ_{i} be an effect of interest to estimate for VISN i. In our study these are the regression coefficients associated with covariates such as race/ethnicity in the GLMM model. Let ϕ_{i} be the corresponding sample estimate. The fixed effects meta regression can be given by ϕ_{i} =τ +ε_{i}, where τ is the pooled mean or FEMR estimate and ε_{i}~N (0, σ_{i} ^{2}) is the random error. This can be adjusted for covariates via weighted regression as:
Random effects meta regression (REMR)
In REMR, a standard onestep DerSimonian and Laird
where τ is the pooled mean or REMR estimate. The adjustment covariates can be VISN level covariates (z) or aggregates of individual level covariates (x) to account for additional causes of heterogeneity
Summary of modelling strategies used
In this paper, we study two broad strategies for longitudinal analyses of VLDS: random effects meta regression (REMR) and estimation based on sampling of the full data (SRS and StRS). Within each strategy we model the continuous outcome of HbA1c using a linear mixed model and the binary outcome of HbA1c (<8% vs. ≥ 8%) using mixed effects logistic regression. The primary independent variable is race/ethnicity, and a number of subjectlevel covariates are included.
Test of homogeneity
The main goal of REMR and FEMR is to obtain a single global or pooled effect summarized across VISNs. But, obtaining pooled estimates assumes homogeneity of VISN level effects. According to
Model selection
Although the purpose of this project was not to “select” an optimal model, model fit assessment was facilitated using maximum likelihood (or pseudolikelihood) information criteria, factors typically used in model selection. Two common approaches in the literature include Akaike information criterion (AIC)
Bootstrap simulation study design
Simulation studies based on 1000 repeated resamplings of (sample size: 1%, 5%, 10% and 25%) the full data are used to asses and compare the methods discussed above. This is implemented via a nonparametric bootstrapping approach
Hardware
All analyses for this investigation were run on a Dell PowerEdge 2900 III server with two dual core Intel Xeon X5260 processors with 6 megabyte cache, with a clock speed of 3.33 gigahertz, and a frontside bus of 1333 megahertz. The server has been configured with 12 four gigabyte (GB), 667MHz dual ranked dual inline memory modules for a total of 48GB of RAM. Data are stored on six one terabyte (TB) 7200 revolution per minute nearline serial attached small computer system interface, 3GB per second 3 ½ inch HotPlug hard drives forming a 3TB redundant array of independent disk level 5 storage system. This server runs a 64bit version of Windows 2003 R2 Enterprise X64 Edition Service Pack 2 operating system.
Software
Datasets were organized for this study using SAS version 9.2.2 (Cary, NC) and SAS transport data sets created. Data were read into a 64bit version of R for Windows 2.11.1 (R Development Core Team 2010) using the “Hmisc”
Results
The full cohort consisted of 890,394 Veterans with diabetes followed from 2002 through 2006. The cohort is characterized based on demographics, HbA1c and comorbidities in Table
Analysis variable
Full cohort (n=890,394)
25% (n=225,000)
10% (n=90,000)
5% (n=45,000)
1% (n=9,000)
REMR (n=890,394)
Not applicable due to sampling by VISN or aggregation by VISN.
NonHispanic White: % (n)
62
(547,645)
61
(138,470)
62
(55,489)
62
(27,853)
62
(5,529)
62
(547,645)
NonHispanic Black: % (n)
12
(107,935)
12
(27,317)
12
(10,941)
12
(5,406)
12
(1,097)
12
(107,935)
Hispanic: % (n)
14
(123,558)
14
(31,062)
14
(12,481)
14
(6,148)
14
(1,285)
13
(123,558)
Other: % (n)
12
(111,256)
13
(28,151)
12
(11,089)
12
(5,593)
12
(1,089)
13
(111,256)
Male: % (n)
98
(869,508)
98
(219,708)
98
(87,921)
98
(43,947)
98
(8,794)
98
(869,508)
Married: % (n)
65
(574,307)
64
(145,060)
65
(58,222)
64
(29,002)
65
(5,853)
64
(574,307)
Disability (mean % & sd)
12
(0.03)
12
(0.06)
12
(0.09)
12
(0.13)
13
(0.30)
12
(0.63)
Northeast
12
(103,056)
12
(25,994)
11
(10,274)
12
(5,272)
12
(1,074)

(103,056)
MidAtlantic
23
(201,058)
22
(50,579)
23
(20,328)
23
(10,230)
23
(2,000)

(201,058)
Midwest
21
(184,348)
21
(46,940)
21
(18,658)
21
(9,368)
20
(1,827)

(184,348)
South
30
(265,450)
30
(66,988)
30
(26,759)
29
(13,189)
30
(2,707)

(265,450)
West
15
(136,482)
15
(34,499)
16
(13,981)
15
(6,941)
16
(1,392)

(136,482)
Urban Residence
62
(548,786)
61
(138,339)
61
(55,324)
62
(27,701)
61
(5,513)
61
(548,786)
Rural Residence
38
(341,608)
39
(85,612)
39
(34,676)
38
(17,299)
39
(3,487)
39
(341,608)
Mean HbA1c (mean % & sd)
7.4
(0.002)
7.4
(0.003)
7.4
(0.005)
7.4
(0.007)
7.4
(0.016)
7.5
(0.030)
Mean HbA1c<8%: % (n)
73
(703,596)
73
(177,751)
73
(71,195)
73
(34,498)
71
(7,112)
70
(703,596)
No Comorbidities
57
(507,320)
57
(128,326)
57
(51,178)
57
(25,506)
57
(5,143)
57
(507,320)
1 Comorbidity
28
(248,898)
28
(62,961)
28
(25,309)
28
(12,655)
27
(2,456)
28
(248,898)
2 Comorbidities
11
(95,542)
11
(23,998)
11
(9,706)
11
(4,898)
11
(1,022)
11
(95,542)
3+ Comorbidities
4
(38,634)
4
(9,715)
4
(3,807)
4
(1,941)
4
(379)
4
(38,634)
Linear mixed models were used to model continuous HbA1c levels in both the full cohort and each of the SRS and StRSs (using the weighted approach described in section 3.2). Parameter estimates associated with betas representing the different race/ethnic and comorbidity groupings, standard errors of the betas, and 95% confidence intervals are reported in Tables
Simple random sample (SRS)
Parameter
Sample (%)
Intercept
Nonhispanic black
Hispanic
Other
1 Comorbidity
2 Comorbidities
3+ Comorbidities
*Independent variables used in fitting the linear mixed model were: linear time; race (nonHispanic white reference, indicator variables); sex (female reference); marital status (single reference), service disability percentage, residence status (urban/rural, rural reference), VISN region (Northeast, MidAtlantic, South, Midwest, and West), and number of comorbidities (1, 2, or 3+; none reference).
** Veteran Integrated Service Networks (VISNs) 13 and 14 are excluded in all these models.
β (95% CI)
100
7.54 (7.52, 7.55)
0.46 (0.45, 0.46)
0.29 (0.28, 0.30)
0.25 (0.23, 0.25)
0.01 (0.01, 0.02)
0.04 (0.04, 0.05)
0.11 (0.11, 0.13)
25
7.59 (7.55, 7.61)
0.46 (0.44, 0.47)
0.31 (0.28, 0.32)
0.24 (0.22, 0.25)
0.01 (0.00, 0.02)
0.02 (0.01, 0.04)
0.10 (0.08, 0.13)
10
7.54 (7.48, 7.58)
0.47 (0.44, 0.48)
0.30 (0.26, 0.32)
0.26 (0.23, 0.27)
0.03 (0.02, 0.05)
0.08 (0.07, 0.12)
0.08 (0.05, 0.13)
5
7.54 (7.48, 7.62)
0.44 (0.41, 0.47)
0.28 (0.23, 0.32)
0.27 (0.24, 0.30)
0.03 (0.01, 0.06)
0.05 (0.02, 0.09)
0.13 (0.08, 0.18)
SE
100
0.0115
0.005
0.007
0.005
0.0037
0.0054
0.0079
25
0.0115
0.005
0.007
0.005
0.0037
0.0054
0.0080
10
0.0115
0.005
0.007
0.005
0.0037
0.0054
0.0080
5
0.0116
0.005
0.0069
0.005
0.0037
0.0053
0.0079
Stratified random sampling (StRS)
Parameter
Sample (%)
Intercept
NonHispanic Black
Hispanic
Other
1 Comorbidity
2 Comorbidities
3+ Comorbidities
β (95% CI)
25
7.61 (7.57, 7.63)
0.47 (0.45, 0.48)
0.28 (0.26, 0.29)
0.26 (0.24, 0.27)
0.01 (0, 0.02)
0.03 (0.02, 0.05)
0.11 (0.11, 0.15)
10
7.58 (7.53, 7.63)
0.46 (0.43, 0.48)
0.28 (0.25, 0.30)
0.26 (0.23, 0.28)
0.0 (0.01, 0.02)
0.05 (0.03, 0.08)
0.16 (0.13, 0.2)
5
7.61 (7.54, 7.68)
0.38 (0.35, 0.41)
0.30 (0.26, 0.35)
0.25 (0.21, 0.28)
0.02 (0.0, 0.05)
0.05 (0.02, 0.09)
0.09 (0.05, 0.15)
SE
25
0.0111
0.0049
0.0068
0.005
0.0037
0.0054
0.0079
10
0.0111
0.0049
0.0068
0.0049
0.0037
0.0054
0.0078
5
0.0111
0.0050
0.0069
0.005
0.0037
0.0053
0.0079
Random effects metaregression without VISN 13 & 14 (REMR)
Parameter^{**}
Sample (%)
Intercept
NonHispanic Black
Hispanic
Other
1 Comorbidity
2 Comorbidities
3+ Comorbidities
β(95% CI)
100
7.58 (7.54, 7.62)
0.45 (0.41, 0.49)
0.08, (0.04, 0.12)
0.23 (0.19, 0.27)
0.01 (0.04, 0.05)
0.03 (0.01, 0.07)
0.09 (0.05, 0.13)
The REMR estimates on the other hand are very close to the full sample estimates. For example, the beta estimate for NHB indicated that HbA1c levels were 0.45 (0.41, 0.49) higher in NHB than NHW in REMR which is comparable to 0.46 (0.45, 0.46) in the full cohort. Similarly, for three comorbidities the full cohort results were 0.11 (0.11, 0.13) while REMR resulted in 0.09 (0.05, 0.13). In all these models, the intercept was very well approximated even in the 1% sampled data. It should be noted that, REMR can be highly affected by outliers in the estimates that are aggregated to get the final estimates. In our case, VISN 13 and 14 exhibited extreme values and hence were removed to maintain the homogeneity assumption required by REMR in order to get unbiased estimates. In Table
Additional tables and figures that show results for the full model that includes all the covariates under several scenarios are in the appendix. Another set of tables that include the 1% scenario and REMR results that include VISNs 13 and 14 are in the Appendix. SAS Macro for the procedures we implemented to analyze SRS, StRS and REMR are also available in our website.
Click here for file
Table
Simple random sample (SRS)
Parameter
Sample (%)
Intercept
Nonhispanic black
Hispanic
Other
1 Comorbidity
2 Comorbidities
3+ Comorbidities
†Independent variables used in fitting the general linear mixed model using a binomial distribution with a logit link function were: linear time; race (nonHispanic white reference, indicator variables); sex (female reference); marital status (single reference), service disability percentage, residence status (urban/rural, rural reference), and number of comorbidities (1, 2, or 3+; none reference).
** Veteran Integrated Service Networks (VISNs) 13 and 14 are excluded in all these models.
β (95% CI)
100
0.94 (0.98, 0.91)
0.62 (0.02, 0.01)
0.45 (0.43, 0.48)
0.36 (0.35, 0.38)
0.07 (0.06, 0.08)
0.15 (0.13, 0.17)
0.27 (0.24, 0.29)
25
1.93 (2.17, 1.69)
1.27 (1.17, 1.37)
1.07 (0.93, 1.20)
0.87 (0.77, 0.97)
0.12 (0.05, 0.21)
0.19 (0.06, 0.31)
0.39 (0.19, 0.58)
10
2.48 (2.93, 2.04)
1.39 (1.20, 1.58)
1.27 (1.01, 1.53)
0.97 (0.78, 1.15)
0.16 (0.01, 0.31)
0.36 (0.14, 0.59)
0.44 (0.07, 0.80)
5
2.16 (2.87, 1.45)
1.69 (1.40, 1.99)
1.10 (0.70, 1.50)
1.05 (0.75, 1.35)
0.39 (0.16, 0.62)
0.34 (0.09, 0.80)
0.33 (0.24, 0.90)
SE
100
0.0182
0.0076
0.0105
0.0078
0.0053
0.0085
0.0125
25
0.1221
0.0514
0.0691
0.0512
0.0402
0.0621
0.3896
10
0.2269
0.0965
0.1308
0.0959
0.0748
0.1163
0.1858
5
0.3608
0.1505
0.2038
0.1520
0.1175
0.1821
0.2905
Stratified random sampling (StRS)
Parameter
Sample (%)
Intercept
NonHispanic Black
Hispanic
Other
1 Comorbidity
2 Comorbidities
3+ Comorbidities
β (95% CI)
25
1.83 (2.07, 1.59)
1.25 (0.01, 0.01)
0.90 (0.76, 1.03)
0.88 (0.78, 0.98)
0.10 (0.02, 0.18)
0.30 (0.17, 0.42)
0.75 (0.56, 0.94)
10
2.34 (2.78, 1.89)
1.47 (1.29, 1.66)
1.03 (0.78, 1.29)
1.00 (0.81, 1.19)
0.07 (0.08, 0.21)
0.24 (0.02, 0.47)
0.69 (0.34, 1.05)
5
2.65 (3.35, 1.95)
1.44 (1.14, 1.74)
1.74 (1.34, 2.15)
1.10 (0.81, 1.40)
0.20 (0.03, 0.43)
0.16 (0.19, 0.52)
0.80 (0.22, 1.39)
SE
25
0.1209
0.0514
0.0690
0.0512
0.0401
0.0620
0.0984
10
0.2272
0.0955
0.1282
0.0949
0.0745
0.1154
0.1813
5
0.3561
0.1531
0.2060
0.1505
0.1175
0.1817
0.2984
Random effects metaregression without VISN 13 & 14 (REMR)
Parameter^{**}
Sample (%)
Intercept
NonHispanic Black
Hispanic
Other
1 Comorbidity
2 Comorbidities
3+ Comorbidities
β (95% CI)
100
0.93 (0.99, 0.87)
0.58 (0.52, 0.64)
0.11 (0.05, 0.17)
0.32 (0.26, 0.38)
0.07 (0.01, 0.13)
0.14 (0.08, 0.20)
0.25 (0.19, 0.31)
Figure
LMM Parameter Estimates and Pooled 95% Confidence Bounds for Random Effects Metaregression (Intercept, Race) without Veteran Integrated Service Networks (VISNs) 13 and 14
LMM parameter estimates and pooled 95% confidence bounds for random effects metaregression (intercept, race) without veteran integrated service networks (VISNs) 13 and 14. * Independent variables used in fitting model were: linear time; race (nonHispanic white reference, indicator variables); sex (female reference); service disability percentage, marital status (single reference), residence status (urban/rural, rural reference), VISN region (Northeast, MidAtlantic, South, Midwest, and West, South reference); and number of comorbidities (1, 2, or 3+; none reference).
GLMM Parameter Estimates and Pooled 95% Confidence Bounds for Random Effects Metaregression (Intercept, Race) without Veteran Integrated Service Networks (VISNs) 13 and 14
GLMM parameter estimates and pooled 95% confidence bounds for random effects metaregression (intercept, race) without veteran integrated service networks (VISNs) 13 and 14. * Independent variables used in fitting model were: linear time; race (nonHispanic white reference, indicator variables); sex (female reference); service disability percentage, marital status (single reference), residence status (urban/rural, rural reference), VISN region (Northeast, MidAtlantic, South, Midwest, and West, South reference); and number of comorbidities (1, 2, or 3+; none reference).
Akaike’s Information Criterion (AIC) and Bayesian Information Criterion (BIC) for LMM (top two) and GLMM (bottom two)
Akaike’s information criterion (AIC) and Bayesian Information Criterion (BIC) for LMM (top two) and GLMM (bottom two). * Independent variables used in fitting the model were: linear time; race (nonHispanic white reference, indicator variables); sex (female reference); service disability percentage, marital status (single reference), residence status (urban/rural, rural reference), VISN region (Northeast, MidAtlantic, South, Midwest, and West, South reference); and number of comorbidities (1, 2, or 3+; none reference).
Additional results corresponding to the analysis of the original full data such as the distribution of subjects in each VISN (Additional file
Discussion and conclusion
Models with random effects are useful for patient level inference just as marginal models are useful for population level inference. However, for very large data sets, it can be difficult to fit models with random effects using commonly available statistical software such as SAS. There are very few papers on this topic and the most recent work involves a 2stage Bayesian algorithm
This study assesses and compares REMR to two sampling based approaches using bootstrap simulation studies. Our results indicate that REMR provides parameter estimates that are less likely to be biased with smaller standard errors when the VISN level estimates are homogenous. The sampling approaches also provide parameter estimates that were equivalent to the full data estimates except when the outcome variable was binary. Thus, when the interest is to fit random effect models in repeated measures data with very large sample size, REMR may be used as a good alternative.
Some adhoc approaches can also be considered to ameliorate the challenges with the double optimization required when fitting GLMM to VLDS. For example, SAS Proc HPMIXED is developed to fit LMM to VLDS and provides computational advantages over Proc Mixed in certain situations. Also, sorting the data by variables that need to be in the CLASS statement of Proc MIXED or GLIMMIX, sorting by random effect subject identifiers, may also alleviate the computational burden. However, all of these methods can often not overcome the computational challenges with very large data sets, like those mentioned in the introduction, which makes REMR attractive.
One of the key problems with REMR is handling situations involving heterogeneous parameter estimates. For example, Additional file
Our work demonstrates a variety of approaches that may be used in analyses of VLDSs, especially when observations are clustered such as in a longitudinal setting. Our simulation results show that SRS and StRS approaches appear to lead to reasonable parameter estimates with Gaussian responses but may be biased when responses are nonGaussian (eg. Binary). REMR may be an optimal strategy for both Gaussian and nonGaussian responses, especially when parameter estimates are homogeneous across clusters.
Abbreviations
CI: Confidence interval; FEMR: Fixed effects meta regression; GLMM: Generalized linear mixed model; LMM: Linear mixed model; NHB: Nonhispanic black; NHW: Nonhispanic white; REM: Random effect model; REMR: Random effects meta regression; SRS: Simple random sample; StRS: Stratified random sample; VISN: Veteran’s integrated service network; VLDS: Very large data sets; VHA: Veteran’s health administration; WGLMM: Weighted GLMM; WLMM: Weighted LMM.
Competing interests
None of the authors have any financial disclosure or conflict of interest to report.
Authors' contributions
Study concept and design: MG, Acquisition of data: PM and LE. Analysis and interpretation of data: MG, LE, PM, KH, GG, PN, Drafting of the manuscript: MG, KH, PM, GG, PN, and LE. Critical revision of the manuscript for important intellectual content: MG, LE, PM, PN, and KH. Study supervision: MG and LE. All authors read and approved the final manuscript.
Acknowledgments
1. The manuscript represents the views of the authors and not those of the Department of Veterans Affairs, the United States Government
2. All authors had access to the data and contributed to the manuscript.
Funding
This work was supported, by the Veterans Health Administration Health Services Research and Development (HSR&D) program [grant #REA 08261, Center for Disease Prevention and Health Interventions for Diverse Populations]. The funding agency did not participate in the design and conduct of the study; collection, management, analysis, and interpretation of the data; or preparation, review, and approval of the manuscript.
Prepublication history
The prepublication history for this paper can be accessed here: