Department of Psychology, University of North Florida, 1 UNF Drive, Jacksonville, FL, 32224, USA

Abstract

Background

Multilevel models (MLM) offer complex survey data analysts a unique approach to understanding individual and contextual determinants of public health. However, little summarized guidance exists with regard to fitting MLM in complex survey data with design weights. Simulation work suggests that analysts should scale design weights using two methods and fit the MLM using unweighted and scaled-weighted data. This article examines the performance of scaled-weighted and unweighted analyses across a variety of MLM and software programs.

Methods

Using data from the 2005–2006 National Survey of Children with Special Health Care Needs (NS-CSHCN:

Results

Scaled weighted estimates and standard errors differed slightly from unweighted analyses, agreeing more with each other than with unweighted analyses. However, observed differences were minimal and did not lead to different inferential conclusions. Likewise, results demonstrated minimal differences across software programs, increasing confidence in results and inferential conclusions independent of software choice.

Conclusion

If including design weights in MLM, analysts should scale the weights and use software that properly includes the scaled weights in the estimation.

Background

Introduction

Multilevel models (MLM) offer analysts of large scale, complex survey data a relatively new approach to understanding individual and contextual influences on public health. Complex sampling designs organize populations into clusters (e.g., states or counties) and then collect data

MLM offer a unique solution to this problem. They take into account the clustered nature of the data

However, despite the unique contribution MLM can make to understanding public health,

In this paper, I take a non-mathematical approach and seek to address these issues. First, I briefly summarize the results of simulation studies and suggest a current best practice recommendation with regard to handling design weights in MLM. Second, I compare and contrast the results of different methods for incorporating sampling weights across a series of MLM using continuous and categorical outcomes and level-1 (individual) and level-2 (cluster) predictors in empirical data across three of the main MLM software programs: Mplus,

Incorporating Design Weights in Multilevel Models

Summary of Simulation Work

Complex sampling designs regularly incorporate unequal selection probabilities. Failing to account for this aspect of the design in the standard MLM can lead to biased parameter estimates.

Some consistent themes result from this work. First, simulations indicate that most scaling methods consistently provide better estimates than using

Fourth, the simulations point to a need for

Recommendations

Given that no study will include all possible manifestations of complex survey designs and relations among the data, it is impossible to disentangle these issues and arrive at a single gold standard.

This suggestion, using multiple scaling techniques, points to an important issue. To properly conduct MLM with complex survey data and design weights, analysts need software that can include weights scaled outside of the program and include the "new" scaled weights without automatic program modification. Currently, three of the major MLM software programs allow this: Mplus (5.2)

In sum, simulation work suggests that analysts should fit complex survey data with design weights using a variety of scaling methods (including unweighted) and compare the results of these methods. However, little work provides a comparison of the different scaling methods in real data across a variety of MLM (e.g., continuous vs. categorical outcomes, level-1 predictor models, level-2 predictor models, and models including level-1 and -2 predictors simultaneously) and software programs. Thus, it remains unclear whether real data will reflect simulation work. In the next section, I address this issue. I use data from the 2005–2006 National Survey of Children with Special Health Care Needs

Methods

Comparing Scaling Methods and Software in Real Data

To examine the performance of the various scaling methods, I fit two series of MLM. I chose these models because they represent the basic models presented by major texts on MLM (e.g., Raudenbush and Bryk

The first series of MLM I estimated examines a continuous outcome (the number of months CSHCN go without insurance) as a function of a level-1 predictor (family income relative to poverty level, hereafter labeled simply "family income") and a level-2 predictor (the proportion of families in the state with an income no greater than twice the US federal poverty level (i.e., 200% poverty level), here after labeled simply "proportion of families in poverty"). The second series of MLM examines a categorical outcome (whether a CSHCN went uninsured at any time in the previous 12 months) as a function of a level-1 predictor (family income) and a level-2 predictor (proportion of families in poverty). For both series, I fit six models: 1) an unconditional model, 2) a level-1 predictor only model specifying the level-1 slope as fixed, 3) a level-1 predictor only model that allowed the level-1 slope to vary across the states (level-2), 4) a level-2 only predictor model, 5) a model including level-1 and -2 predictors but no cross-level interaction, and, 6) a model including level-1 and -2 predictors and a cross-level interaction. For each series of analyses (continuous and categorical), the unconditional (empty) model examines whether the outcome (average number of months uninsured or odds of going without insurance) varies across states. The level-1 only predictor model asks whether family income predicts the outcome, while the level-2 predictor only model investigates whether the proportion of families in poverty in a state affects the outcome. The model including level-1 and level-2 predictors investigates the contributions of level-1 and level-2 predictors simultaneously, but does not include a cross-level interaction. Among other questions, it asks whether a relationship between family income and the outcome exists, controlling for the effects of the proportion of families in poverty in the state. The final model investigates the level-1 and level-2 predictors simultaneously and includes a cross-level interaction. This model asks several questions as well, including whether the relationship between family income and months without insurance differs according to the proportion of families in poverty in a state. For each series, all models allowed the intercept to vary across the states. Appendix C presents traditional MLM equations for each model I estimate.

For each series I fit the models in Mplus, MLwiN, and GLLAMM using unweighted data, scaling method A and scaling method B. For Mplus, I used MLR for both the continuous and categorical analyses. MLR delivers maximum likelihood parameter estimates with robust standard errors computed using a sandwich estimator. For categorical outcomes, MLR uses numerical integration and adaptive quadrature using 15 integration points per dimension. ^{st }order marginal quasi-likelihood (MQL) estimation and IGLS to obtain starting values. I then used the 1^{st }order MQL estimates as starting values for 2^{nd }order predictive (penalized) quasi-likelihood (PQL) estimation and IGLS to obtain final values. For both continuous and categorical outcomes in MLwiN, I requested robust standard errors. For all GLLAMM models, I initially used adaptive quadrature with 8 quadrature points. Consistent with Rabe-Hesketh et al.'s recommendation,

Results

First, consider the continuous results presented in Additional File

**Continuous outcome parameter and standard error estimates across level-1, level-2, combined-level models, weight scaling methods, and software programs**. The data provided present the multilevel results across continuous models, weight scaling methods, and software programs.

Click here for file

With regard to the weighted analyses, across the fixed and random effects, the programs achieved nearly identical weighted results, with two exceptions. MLwiN estimated a smaller residual variance and residual variance standard error using weight method B than either Mplus or GLLAMM. Likewise, MLwiN's estimate of the slope for state poverty and its standard error diverged slightly (but consistently) from Mplus and GLLAMM at the second decimal point under all scaled weighting analyses. To investigate the source of these differences, I reran these analyses with increasingly stringent convergence criteria. In all cases, MLwiN arrived at the same estimate of the residual variance. This suggests that the discrepancy does not result from convergence issues, but results from estimation differences. In this case, the small difference led to

For the categorical outcome presented in Additional File

**Categorical outcome parameter and standard error estimates across level-1, level-2, combined-level models, weight scaling methods, and software programs** (Mplus estimates a threshold rather than an intercept. These differ only in sign. **For presentation, I converted the threshold to a slope)**. The data provided present the multilevel results across categorical models, weight scaling methods, and software programs.

Click here for file

Somewhat surprisingly, though the standard errors for the scaled-weighted data did range somewhat larger than unweighted analyses, the standard errors for the unweighted and scaled-weighted methods achieved remarkable consistency. This may have occurred because of the large cluster sizes in the NS-CSHCN (approximately 750 individuals in each cluster). It may also have occurred because of a relatively small intraclass correlation coefficient (a measure of the proportion of variance in the outcome attributable to clustering alone) for this outcome (e.g., 0.01 for months uninsured). It also suggests that, in these data, for these outcomes, and these predictors, the sampling weights are not particularly informative (Table

Single Level Continuous and Categorical outcome parameter and standard error estimates.

Continuous Single Level Analysis

Mplus (5.1)

Mplus (5.1)

Mplus (5.1)

Fixed Effects

Unweighted

Weight Method A

Weight Method B

_{0}

(Intercept for MS_UNINS)

0.453

0.429

0.433

SE

0.010

0.011

0.011

_{1}

(Slope for Family Income)

-0.088

-0.076

-0.076

SE

0.004

0.005

0.005

Random Effects

Residual Variance

(Variation within States)

3.738

3.697

3.735

SE

0.096

0.110

0.111

Categorical Single Level Analysis

Mplus (5.1)

Mplus (5.1)

Mplus (5.1)

Fixed Effects

Unweighted

Weight Method A

Weight Method B

_{0}

(Intercept for MS_UNINS)

-2.513

-2.532

-2.524

SE

0.019

0.022

0.022

_{1}

(Slope for Family Income)

-0.198

-0.183

-0.183

SE

0.006

0.007

0.007

Discussion

Summary

In sum, the present results generally agree with simulation work. Scaled weighted findings diverged slightly from unweighted analyses, agreeing more with each other than with unweighted analyses. Also consistent with simulation work, weighted and unweighted data did not diverge greatly in general. However, while estimates and standard errors generally remained comparable, small specific changes did result. Although they did not lead to different inferential decisions in these data, they might in other data.

Software Strengths and Weaknesses

Strengths

Mplus has several strengths. It has tremendous flexibility and can incorporate numerous statistical models within the MLM framework well beyond "traditional" hierarchical linear and generalized linear models. One can fit factor models, latent class models, structural equation models, mixture models, latent growth curve models, and others within the MLM framework. Second, Mplus will automatically scale the weights for the user using each approach described here and Mplus allows analysts to specify weights scaled outside of Mplus. Third, Mplus offers a wide variety of estimators and link functions. Fourth, Mplus can handle both of the current recommended methods for analyzing subpopulations of complex survey data, the zero-weight approach and the multiple-group approach.

MLwiN also incorporates several strengths. First, MLwiN can fit models with up to five levels, making it quite useful in multistage designs. Second, MLM has an easy point-and-click, windows-based user interface, which makes fitting MLMs easy and straightforward. Third, MLwiN incorporates several estimators. Fourth, like Mplus, MLwiN provides an automatic weight scaling feature, and, like Mplus, it allows the user to specify weights scaled outside of MLwiN. Fifth, MLwiN has numerous features available for evaluating a model's appropriateness. And, sixth, MLwiN includes several graphical features.

Finally, GLLAMM also has some distinct advantages. Like Mplus, GLLAMM offers an astounding array of models that it can fit within the MLM framework.

Weaknesses

Despite its strengths, Mplus has some distinct disadvantages. First, it can only fit two-level cross-sectional MLM models. Although one can fit a two-level MLM and use Mplus' complex data analysis feature to properly estimate standard errors for a third level, Mplus does not allow one to investigate what predicts variation at level-3. For multistage surveys, this may be a substantial limit. Second, relative to MLwiN, Mplus offers few analytical tools for investigating model assumptions, model fit, and model diagnostics. Third, relative to MLwiN, Mplus offers few graphical tools, whether these limits outweigh its strengths will depend on the individual users needs.

MLwiN also has limits. Primarily, it cannot fit the wide variety of models that Mplus and GLLAMM can (e.g., latent class models). While MLwiN can fit some models beyond hierarchical linear and generalized linear models (e.g., multilevel confirmatory factor analyses), MLwiN does not have the full flexibility that Mplus and GLLAMM do. For users seeking to fit extremely complex models, this may be a substantial drawback. Second, MLwiN will only automatically scale the weights using method B. And third, while MLwiN does offer several estimators (e.g., iterative generalized least squares (IGLS), restricted IGLS, and Markov chain Monte Carlo (MCMC)), it does not offer as large a range of estimators as Mplus. Again, whether these weaknesses outweigh its strengths will depend primarily on the type of analysis the user expects to conduct.

Finally, GLLAMM has some noteworthy disadvantages. First, GLLAMM has well known problems with computational speed. Models that take seconds to converge in the other programs can take days (literally) to converge in GLLAMM. Aside from some minor adjustments, analysts can do little to increase GLLAMM's speed. Second, although GLLAMM has an advantage with categorical outcomes, it may be less accurate with continuous outcomes.

Limitations

Although these analyses generally support the use of MLM in complex survey data with design weights, some issues remain unresolved. First, a best practice for scaling weights across multiple levels has yet to be advanced. Though Asparouhov

Third, MLM theoretically allow investigators to examine predictors and variance across naturally occurring clusters within complex sampling design (e.g., creating a three-level model by grouping individuals according to their county of residence using data from a two-level survey that sampled people within states). However, this flexibility may result in cross-classified data structures (e.g., hospital catchment areas overlapping states in a survey that sampled people within states). While MLM can handle cross-classified data,

Fourth, analysts often wish to investigate relationships within a certain subgroup. Although analysts can use interaction terms to investigate hypotheses within the specified subgroup, analysts may wish to examine a subgroup of the sample excluding other sample members entirely. For this situation, where analysts wish to investigate hypotheses among a specific subgroup only, no established guidelines exist regarding a best practice method for estimating MLM in complex survey data with design weights. When using complex surveys, one should include the entire sample in the analyses. This leaves the sample design structure whole and leads to proper estimation of variances and standard errors. However, it presents a problem when analysts would like to select a subgroup and examine a MLM for this subgroup of individuals in a sample. Analysts should

Finally, little work addresses missing data's role in MLM with design weights. It remains unclear how to best handle missing data within the context of MLM, complex survey data, and design weights. Analysts might take a zero-weighting approach for missing data,

While these limits highlight an array of outstanding issues that need investigation, they do not preclude analysts from employing MLM in complex survey data with design weights. Moreover, they demonstrate the need to choose a MLM program that allows flexibility with regard to design weights. Thus, as theory advances, software will not limit analyses.

Applied Summary Recommendations

Given the breadth of findings discussed and presented and the various strengths and weakness of each approach and software program, the reader might now wonder, "what do do in practice?" In my work, I standard approach. First, in terms of software, take the following I generally use Mplus. I do this because Mplus offers the most flexibility relative to speed. I frequently fit models that MLwiN cannot estimate (e.g., MLM multiple group structural equation models) and I rarely fit models with more than two levels (which Mplus currently cannot estimate). Analysts fitting the types of models discussed in this paper will generally find MLwiN more than meets their needs. Second, in terms of scaling the weights, I always fit the models using each scaling technique (methods A and B). I do this to examine any inferential discrepancies. If I find no inferential discrepancies, I generally report the findings from method A. I do this because I frequently work with cluster sizes larger than

Conclusion

Summarily, recent advances in statistical theory and software now allow users of complex survey data with design weights to analyze data in a MLM framework. This paper shows the utility of conducting MLM in complex survey data with design weights across a variety of scaling methods. Survey analysts should incorporate the recommendations offered in this paper and consider MLM when they seek to understand the intricate relations that may exist in complex survey data. MLM allow analysts to better understand and describe individual and contextual (cluster) level differences. This has the potential to profoundly influence public health policy.

Competing interests

The author declares that they have no competing interests.

Authors' contributions

Using publicly available data, I worked individually, conducted the literature searches and summaries of previous related work, undertook the statistical analyses, wrote the manuscript, conducted all revisions, and read and approved the final manuscript.

Appendix A

A.1 Data set creation

To recreate the data used in these analyses, one needs to merge two NCHS files and add the state-level (level-2) variable. The level-1 outcome, months without insurance, is available on the NS-CSHCN "interview" dataset as CQ905. The level-1 covariate (family income) is available on the data file NCHS entitles "multiple imputation". NCHS labels the family income variable POVLEVEL_I. For these analyses, I used a single imputation, imputation 1, as suggested by Pedlow, et al.

NCHS makes both datasets available at

Proportion of families in each state falling at or below the 200% poverty line.

Alabama

0.309

Montana

0.277

Alaska

0.226

North Carolina

0.31

Arkansas

0.362

North Dakota

0.217

Arizona

0.304

Nebraska

0.209

California

0.27

New Hampshire

0.152

Colorado

0.21

New Jersey

0.167

Connecticut

0.18

New Mexico

0.32

District of Columbia

0.288

Nevada

0.229

Delaware

0.207

New York

0.28

Florida

0.266

Ohio

0.235

Georgia

0.264

Oklahoma

0.308

Hawaii

0.185

Oregon

0.249

Iowa

0.21

Pennsylvania

0.212

Idaho

0.278

Rhode Island

0.217

Illinois

0.228

South Carolina

0.3

Indiana

0.233

South Dakota

0.234

Kansas

0.231

Tennessee

0.288

Kentucky

0.322

Texas

0.325

Louisiana

0.336

Utah

0.25

Massachusetts

0.212

Virginia

0.197

Maryland

0.154

Vermont

0.19

Maine

0.243

Washington

0.189

Michigan

0.245

Wisconsin

0.187

Minnesota

0.18

West Virginia

0.332

Missouri

0.264

Wyoming

0.209

Mississippi

0.413

A.2 Scaling the Weights

After creating the dataset, one needs to scale the weights. In any analyses, here or otherwise, one should scale the weights before doing

SAS Code

proc sort data = mlm;

by state;

run;

proc summary data = mlm;

by state;

var weight_i;

output out = intermediate

uss = sumsqw

sum = sumw

n = nj;

run;

data mlm;

merge mlm intermediate;

by state;

aw = weight_i/(sumw/nj);

label aw = "Method A";

bw = weight_i/(sumsqw/sumw);

label bw = "Method B";

run;

data mlm; set mlm; drop _freq_ sumsqw sumw nj _type_; run;

To update this code for other datasets, 1) replace "mlm" with the name of the dataset of interest, 2) replace "weight_i" with the level-1 weight from the dataset of interest, and 3) replace "state" with the level-2 cluster variable from the dataset of interest.

Stata Code

gen sqw = WEIGHT_I^2

egen sumsqw = sum(sqw), by(STATE)

egen sumw = sum(WEIGHT_I), by(STATE)

egen nj = count(IDNUMXR), by(STATE)

gen bw1 = WEIGHT_I*(sumw/sumsqw)

gen aw1 = WEIGHT_I*(nj/sumw)

To update the Stata code for other datasets, 1) read the dataset of interest into memory, 2) replace "weight_i" with tje level-1 weight from the dataset of interest, 3) replace "state" with the level-2 cluster variable from the dataset of interest, and, 4) replace "idnumxr" with the level-1 id variable from the dataset of interest.

Appendix B: Equations for Scaling the Weights

B.1 Method A

B.2 Method B

For both, _{ij }the unscaled weight for individual _{j }the number of sample units in cluster

Appendix C: Traditional MLM Equations

C.1 Continuous Models

C.1.1 Unconditional model

C.1.2 Level-1 Predictor Only (Fixed Effect)

C.1.3 Level-1 Predictor Only (Fixed and Random Effects)

C.1.4 Level-2 Predictor Only (Fixed Effect)

C.1.5 Level-1 and Level-2 Predictors, No Cross-Level Interaction

C.1.6 Level-1, Level-2, and Cross-Level Interaction

C.2 Categorical Models

C.2.1 Unconditional model

C.2.2 Level-1 Predictor Only (Fixed Effect)

C.2.3 Level-1 Predictor Only (Fixed and Random Effects)

C.2.4 Level-2 Predictor Only

C.2.5 Level-1 and Level-2 Predictors, No Cross-Level Interaction

C.2.6 Level-1, Level-2, and Cross-Level Interaction

C.3 Variance Components

In all models,

_{ij }(i.e., the residual variance within states).

_{0j }(i.e., the variance in the intercepts between states).

_{1j }(i.e., the variance in slopes between states).

_{01 }= COV(_{0j}, _{1j}) (i.e., the covariance between the intercepts and slopes).

Appendix D: Original Weights

This appendix briefly summarizes the methodology used to weight the 2005–2006 NS-CSHCN. Generally, the weighting scheme for the sample involved the steps below. Here, I only describe the base weights. Readers interested in more detail should consult Blumberg et al.

1. Compute base sampling weight.

2. Adjustment for nonresolution of released telephone numbers.

3. Adjustment for incomplete age-eligibility screener.

4. Adjustment for incomplete CSHSCN Screener.

5. Adjustment for multiple telephone lines.

6. Raking adjustment of household weights.

7. Raking adjustment of child screener weights.

8. Adjustment for subsampling of CSHCN.

9. Adjustment for nonresponse to the CSHCN interview.

10. Raking of adjustment of the nonresponse-adjusted CSHCN interview weights.

The base weight equals the reciprocal of the selection probability of the ^{th }telephone number:

_{k }= probability of selecting the ^{th }telephone number in the estimation area.

_{q }= sample size in quarter

_{q }= total telephone numbers on the sampling frame in quarter

Following computation of the base weight, several adjustments followed. Blumberg, et al.,

Acknowledgements

I would like to the US Health Resources and Services Administration, Maternal and Child Health Bureau and the US Centers for Disease Control and Prevention, National Center for Health Statistics for making the data publicly available. I would also like to thank Tara J. Carle and Margaret Carle whose unending support and thoughtful comments make my work possible. I am also grateful to Stephen J. Blumberg for his collegial support and his dedication to bridging gaps between advanced methodologies and applications in order to improve children's health. Finally, I would like to thank all four reviewers. Their tireless work and insightful comments vastly improved the original manuscript.

Pre-publication history

The pre-publication history for this paper can be accessed here: