Department of Health Systems Science, M/C 802, College of Nursing, University of Illinois at Chicago, 845 S. Damen Ave., Chicago, IL 60612, USA

University of Cincinnati School of Medicine, Cincinnati Children's Hospital Medical Center, 3333 Burnet Avenue, MLC 7014, Cincinnati, OH 45229, USA

Abstract

Background

Computerized adaptive testing (CAT) is being applied to health outcome measures developed as paper-and-pencil (P&P) instruments. Differences in how respondents answer items administered by CAT vs. P&P can increase error in CAT-estimated measures if not identified and corrected.

Method

Two methods for detecting item-level mode effects are proposed using Bayesian estimation of posterior distributions of item parameters: (1) a modified robust

Results

Both methods evidenced good to excellent false positive control, with

Conclusions

Whereas false positives were well controlled, particularly for

Background

Computerized adaptive testing (CAT) is widely used in education and has gained acceptance as a mode for administering health outcomes measures

Methods used for assessing DIF by mode of administration fall into two general categories: (1) approaches based on classical test theory (CTT), such as comparisons of item

Though these studies often employed multiple methods of assessing item comparability, systematic comparisons across methods were not conducted. Nevertheless, there is reason to believe that some methods, such as item

Several methods have been developed that attempt to overcome the limitations of classical procedures for detecting mode effects. Most of these methods are based on item response theory and involve comparisons of item parameters after matching of respondents according to trait level. Achieving accurate identification of DIF based on an IRT framework requires precise estimates of item parameters and person measures and the use of an appropriate measurement model

Given the uncertainty in trait and item parameters, some investigators have recommended methods to identify DIF based on Bayesian probability theory. Bayesian approaches use probability distributions to model uncertainty in model parameters. These probability distributions represent prior beliefs or assumptions concerning the nature of the data and the level of uncertainty regarding various parameters. For instance, an investigator may specify that item discrimination parameters adhere to a lognormal distribution with log mean of 0 and variance of 0.5. The prior (particularly the prior variance) reflects uncertainty about the values before observing the data. Conversely, the

Two general methods of DIF detection employing Bayesian methods have been proposed. The first approach is the use of Bayesian procedures to directly estimate DIF magnitude such as the Mantel-Haenszel (MH) test

A second approach involves estimation of the posterior distribution of model parameters, which can be used in subsequent DIF analyses _{iG1} and _{iG2} for groups 1 and 2, respectively) are computed, and from this a Bayesian _{iF} - _{iR}) >0 can be used as an indicator of DIF. This procedure provided more accurate results compared to standard MH DIF analysis, especially when items were very easy to endorse

Despite these promising results, none of the studies employing posterior distributions of item parameter estimates assessed DIF in CAT-administered items. Moreover, to our knowledge there has been no application of Bayesian methods to the assessment of DIF between non-CAT and CAT-administered assessments. Standard methods of assessing DIF can be problematic when comparing CAT- and P&P-administered data because of the confounding of CAT item selection, sample differences in trait level, and actual mode effects.

Rationale of the Study

It is common practice to employ paper-based forms when validating and scaling an item bank for use in CAT. Thus, it is important to determine that the resulting item parameters are not influenced by mode DIF. As suggested earlier, current methods of assessing DIF may not be appropriate when comparing adaptively and non-adaptively administered items. One solution would be to administer the entire item bank via computer and conventional modes of administration and then employ standard methods of DIF assessment. Whereas this approach could be used with small item banks, it would be quite burdensome to respondents and likely require collecting data apart from standard assessment practice with very large item banks.

Other researchers have already faced this issue. For example, to reduce respondent burden, the Patient Reported Outcomes Information System (PROMIS) only administered the entire set of initially developed PROMIS item to a small set of individuals from the total PROMIS calibration sample. This has limited the PROMIS collective’s ability to address some key issues, similar to what we raise here. Thus, while the less technical approach is possible, we suspect the common problem of needing to reduce respondent burden will generally limit the application of the less technical approach, indicating the need for alternative approaches. One alternative approach, which we present in this paper, is to develop procedures appropriate for detecting mode DIF in CAT vs. non-CAT-administered items, enabling assessment of DIF using data collected as part of standard assessment.

The purpose of the present study was to develop and evaluate two approaches to assessing item-level mode effects employing a Bayesian framework. In the following sections we outline this framework and describe the design and results from a preliminary Monte Carlo simulation study. The procedures are described and evaluated with respect to false-positive (i.e., DIF is detected when not simulated) and true-positive (i.e., DIF is detected when simulated) detection rates under several study conditions. We then examine factors associated with true and false DIF identification.

Research Questions:

1. How well does each method detect item-level mode effects as indicated by ROC analysis, true positive and false positive rates? In the present study, true positives are defined as identification of items as exhibiting mode DIF when mode DIF is simulated, which is also referred to as correct DIF detection. Conversely, false positives refer to flagging of items as exhibiting DIF when DIF was not simulated, which is also referred to as incorrect DIF detection.

2. What factors influence correct (true positive) and incorrect (false positive) detection of item-level mode DIF using each procedure?

Methods

The methods employed in this study will be presented in three main sections. First, we describe the development and underlying assumptions of two Bayesian methods for detecting item-level mode effects. Second, we describe the simulation study, including its design and data generation procedures. The third section outlines the analysis of the simulated data.

A Bayesian procedure for detecting item-level mode effects

In the proposed model, analysis of mode effects involved a three-step process:

Step 1. Estimate

Step 2. Using _{i} obtained in Step 1, estimate the posterior distributions of mode-specific item parameters for subsequent comparison in step 3.

Step 3. Estimate DIF for each item common across modes by assessing the difference in the posterior distributions of the item parameters (i.e., between _{j}
^{
CAT
} and _{j}
^{
P&P
}). In the present study we examined two approaches to making this comparison. The first approach involved calculating a variation of the robust Z statistic

where

The second approach involved constructing the 95% credible interval (_{j}
^{CAT} – β_{j}
^{P&P}. In order to obtain a single value reflecting the level of mode DIF, we also computed the minimum difference of each bound of the

_{i} ~

_{j} ~

_{
j
}

_{ij} ~ _{
ij
}

where the first value in parentheses for priors of _{i}
_{
j
}, and _{
j
} is the prior mean and the second is the prior variance. These priors may be regarded as “semi-informative.” They are similar to priors employed in earlier IRT studies, with the exception that we selected a lognormal rather than a truncated normal prior for the discrimination parameters

The Markov chain Monte Carlo estimation consisted of three parallel chains each with a separate and randomly generated set of starting values for model parameters. For each chain, the first 1,000 MCMC iterations were discarded (burn-in phase), followed by 500 iterations per chain retained for subsequent analysis. The total number of iterations and the length of the burn-in phase were chosen on the basis of preliminary examination of trace plots of item and person parameters which revealed good convergence of the three chains of parameter estimates (analysis results are available upon request). Using additional iterations or a longer burn-in did not change DIF analysis results.

Simulation study

A preliminary Monte Carlo simulation study was performed to assess the accuracy of the proposed method for detecting item-level mode effects. Two interests guided the design and implementation of this simulation: First, in this preliminary study we decided to restrict our focus to uniform DIF instead of or in addition to non-uniform (i.e., discrimination) DIF. Second, we focused on instruments fitting a one- (1PL) or two-parameter (2PL) IRT model

where _{ij} is the response to item _{
j
} is the discrimination parameter and _{j} is the difficulty parameter for item _{i} is respondent _{
i
} are equal across items.

Study Design

In this study, the following factors were investigated: (1) data-generating model (one-parameter [1PL] vs. two-parameter [2PL] logistic IRT model); (2) DIF magnitude (|^{CAT} – ^{P&P}|) of 0.42 vs. 0.63 logits, which corresponds to “B” and “C” class DIF, respectively, according to Educational Testing Services criteria

Data Generation

Data were generated for the present study in three steps: (1) generation of validation (paper-and-pencil) data, (2) generation of CAT item response data, and (3) CAT simulation, which produced item response datasets containing only those items selected by the CAT. Each of these steps is outlined in the following sections.

Generation of the Validation (Paper-and-Pencil) Item Parameters and Response Data

For each IRT model, a set of item parameters and corresponding item response datasets were generated. Both item banks consisted of 100 items. In the 1PL model, discrimination (_{j}) parameters for all items were set to 1.0; in the 2PL item bank, _{j} parameters were randomly generated from a lognormal distribution with log mean = 0 and SD = 0.5, with values restricted to a range of 0.5 to 2.5. Discrimination parameters were limited to this range because items with very low discrimination (i.e., less than 0.5) are rarely used in item banks, whereas highly discriminating items (i.e., true discrimination parameters greater than 2.5) tend to be poorly estimated (i.e., positively biased) parameters _{
j
}) parameters were generated from a uniform distribution ranging from −3.0 to 3.0 logits, in increments of 0.25 logits. Person measures (_{
i
}) for 500 simulees were generated using an

Parameter Estimation

The generated item-response data were then used to estimate IRT item parameters (see Additional file _{
j
} parameters were 0.99 and 1.00 and root mean squared error (RMSE) values were 0.11 and 0.15 for 1PL and 2PL-generated datasets, respectively. For the 2PL data, correlation between true and estimated _{
j
} parameters was .9 and RMSE was 0.14. As previously observed _{
j
}. The estimated item parameters were used in subsequent CAT simulations. RMSEs and correlations between item parameters and their estimates were consistent parameter recovery results presented elsewhere

**Appendix A.** Generating and Estimated Item Parameters for the Two Simulated Item Banks (Validation data) Based on the One-and Two-Parameter IRT Models, Respectively. 1PL = one-parameter item response model; 2PL = two-parameter item response model;

Click here for file

Generation of CAT Item Response Data

Prior to performing CAT simulations, response data for all 100 items in the simulated item banks described above were generated for a total of 3000 simulees in each iteration. This sample size permitted examination of the effect of CAT item usage on DIF detection rates. Employing the study variables described above, a total of 160 item-response datasets were created and used for CAT simulation. For each dataset, person measures were generated from an _{CAT}, 1.0) distribution, where μ_{CAT} = 0.0 or 1.0. Non-DIF-item response data were generated using the estimated parameters in Additional file _{
j
} parameter (see Additional file _{
j
} parameters for the generated CAT item responses were the same as those used to generate the initial P&P data.

**Sample File of Simulee Measures Used to Generate CAT Item Responses.** A 1 column x 3000 row of generated θ values used to generate item responses for subsequent use in CAT simulation.

Click here for file

**Sample P&P and CAT Item Parameters.** This file contains 4 columns of item parameters for 100 items. Columns 1 and 2 are the estimated discrimination and difficulty parameters for the P&P version of the instrument, respectively. Columns 3 and 4 are the discrimination and difficulty parameters used to generate the item responses for subsequent use in the CAT simulation.

Click here for file

**Sample Item Response File Used in CAT Simulation.** A 100 column by 3000 row spreadsheet containing generated item responses according to the CAT person and item parameters described above. A 1 indicates a "no" response and 2 a "yes" response.

Click here for file

CAT Simulation

Each generated dataset was then used in a series of CAT simulations. In order to ensure comparability across conditions, a fixed-length CAT consisting of 30 administered items for each simulee was conducted. This stopping rule is similar to that used in a previous investigation of CAT and DIF

Analysis

Prior to addressing the main research questions, descriptive analyses were performed for both the CAT simulation results and the

Detection of Mode Effects (Research Question 1)

The overall performance of the robust

Logistic regression and ROC analyses were also performed to examine the predictive accuracy of each statistic without reference to specific cutoff values. Since both ^{2} and Δ^{2} for robust

Factors Related to True and False Positive Mode Effects (Research Question 2)

A series of multilevel random-intercept logistic regression analyses were performed at both univariate (single predictor) and multivariate levels. At the multivariate level, four models were developed, one for each statistical test (

Relationship of Item Difficulty to Power and Type I Error

In order to provide a clearer picture of the relationship of item difficulty with power and Type I error, a plot of mean power and Type I error by P&P item difficulty was created. This plot was based on a series of linear regression analyses to predict mean power and Type I error for both

Software

Generation of item and person parameters and item response data was performed in the R statistical package

**Appendix B.** WinBUGS Code.

Click here for file

Results

Descriptive analyses

CAT Simulation

The CAT simulations are summarized in Table ^{CAT}.

**IRT Model**

**CAT to Full-Scale ****
θ
**

**Mean Standard Error**

**1PL**

**2PL**

**1PL**

**2PL**

1PL = one-parameter item response model; 2PL = two-parameter item response model; CAT = computerized adaptive testing; DIF = differential item functioning; IRT = item response theory;

**DIF Size**

**0.42**

**0.63**

**0.42**

**0.63**

**0.42**

**0.63**

**0.42**

**0.63**

- Diff. Mean

- θ

- = 0

DIF % = 10

0.97

0.97

0.97

0.97

0.26

0.26

0.23

0.24

DIF % = 30

0.97

0.96

0.97

0.97

0.26

0.26

0.24

0.24

- Diff. Mean

- θ

- = 1

DIF % = 10

0.97

0.97

0.97

0.97

0.26

0.26

0.24

0.24

DIF % = 30

0.97

0.96

0.97

0.97

0.26

0.26

0.24

0.24

Average

0.97

0.97

0.97

0.97

0.26

0.26

0.24

0.24

With respect to the number of times a given item was administered by CAT (CAT item usage), the median number of item administrations across items and simulation conditions is 553 (

Robust Z and 95% Credible Interval Indices

Among non-DIF items, ^{th} and 97.5^{th} percentiles corresponded to a Δ

Detection of mode effects (Research question 1)

Correct classification, sensitivity, and specificity were examined using expected cutoff values at ^{
2
}(1) = 545.06,

Table

**IRT Model**

**DIF Size**

**DIF %**

**Diff. Mean ****
θ
**

**Robust ****
Z
**

**Bayes 95% ****
CrI
**

**TP%**

**FP%**

**TP%**

**FP%**

1PL = one-parameter item response model; 2PL = two-parameter item response model; CAT = computerized adaptive testing; DIF = differential item functioning; DIF% = percentage of items simulated with DIF; FP% = percentage of false positive DIF results; IRT = item response theory; TP% = percentage of true positive DIF results;

1PL

0.42

10

0

54.90

0.91

60.35

1.36

1

70.00

2.36

66.35

3.88

30

0

69.66

0.29

70.67

0.44

1

71.09

1.05

71.91

2.58

0.63

10

0

78.00

0.56

83.00

0.89

1

75.00

3.56

81.00

5.22

30

0

82.67

2.57

87.00

3.00

1

76.33

2.14

79.33

3.29

2PL

0.42

10

0

60.82

0.06

60.42

0.06

1

58.24

2.09

55.10

3.89

30

0

62.71

0.14

66.09

0.28

1

66.00

4.86

66.33

6.86

0.63

10

0

72.00

0.33

77.00

0.55

1

70.00

2.39

77.00

3.56

30

0

72.33

3.14

77.67

3.14

1

66.33

3.33

69.00

5.00

Average

69.13

1.86

71.76

2.75

The present findings revealed power (true positive) rates of 69.1% and 71.8% for ^{CAT}-^{P&P} = 0 condition (54.9%) whereas for Δ^{CAT}-^{P&P} = 1.0 (55.1%). For

Factors related to true and false positive mode effects (Research question 2)

**Model/Predictor**

**Univariate**

**Multivariate**

**
OR
**

**
AUC
**

**
OR
**

**95% ****
CI
**

**Robust Z (Model AUC = 0.95)**

Size of DIF

1.49**

0.55

3.42**

(2.58-4.54)

Percentage of DIF

1.17

0.52

1.20

(0.89,1.61)

2PL IRT Model^{b}

0.76**

0.53

0.47**

(0.35,0.64)

Diff. Mean

0.99

0.50

0.66**

(0.50,0.87)

CAT Item Usage^{c}

21133.86**

0.94

3111.68**

(1417.85,6829.03)

Absolute Item Difficulty^{d}

0.03**

0.85

0.10**

(0.07,0.14)

Item Discrimination^{d}

3.62**

0.60

3.12**

(2.34,4.17)

**Bayesian 95% Credible Interval (Model AUC = 0.93)**

Size of DIF

1.73**

0.56

3.52**

(2.73,4.53)

Percentage of DIF

1.17

0.52

1.16

(0.89,1.50)

2PL IRT Model^{b}

0.74**

0.53

0.50**

(0.39,0.65)

Diff. Mean

0.91

0.49

0.60**

(0.47,0.77)

CAT Item Usage^{c}

2468.29**

0.92

505.64**

(264.29,967.37)

Absolute Item Difficulty^{d}

0.04**

0.83

0.15**

(0.11,0.20)

Item Discrimination^{d}

2.86**

0.58

1.99**

(1.54,2.56)

**Model/Predictor**

**Univariate**

**Multivariate**

**
OR
**

**
AUC
**

**
OR
**

**95% ****
CI
**

**Robust Z (Model AUC = 0.77)**

Size of DIF

1.93**

0.55

2.01**

(1.36,2.97)

Percentage of DIF

1.44

0.55

1.48

(0.99,2.20)

2PL IRT Model

1.14

0.52

0.96

(0.63,1.46)

Diff. Mean

3.31**

0.59

3.95**

(2.56,6.08)

CAT Item Usage

1.91**

0.54

4.17**

(3.11,5.60)

Item Difficulty

0.28**

0.62

0.12**

(0.08,0.19)

Item Discrimination

1.64**

0.56

1.23

(0.96,1.58)

**Bayesian 95% Credible Interval (Model AUC = 0.74)**

Size of DIF

1.62*

0.55

1.61**

(1.20,2.15)

Percentage of DIF

1.33

0.53

1.30

(0.97,1.75)

2PL IRT Model

1.14

0.52

1.08

(0.80,1.47)

Diff. Mean

1.28E+08**

0.62

4.01**

(2.90,5.55)

CAT Item Usage

0.96

0.44

2.36**

(1.82,3.06)

Item Difficulty

0.30**

0.65

0.16**

(0.11,0.22)

Item Discrimination

1.19

0.51

1.02

(0.82,1.26)

For the

Relationship of Item Difficulty to Power and Type I Error

In order to better understand the performance of

Mean Predicted True and False Positive Rates by P&P Item Difficulty and Analysis Procedure

**Mean Predicted True and False Positive Rates by P&P Item Difficulty and Analysis Procedure.** Solid line – Robust z (

Discussion

Bayesian methods have been widely used in IRT and have received considerable attention in DIF analysis. However, their application to detecting DIF between CAT and conventional modes of administration has received relatively little attention. Thus, this study sought to develop and test methods for assessing CAT vs. P&P mode DIF employing a Bayesian framework. The present study revealed that the robust

CAT item usage was found to be the single best predictor of detecting simulated mode effects, followed by absolute item difficulty. In fact, the multivariate model performed only slightly better than when CAT item usage was the only predictor. For items with DIF, those items administered most often by CAT were more likely to be detected than items administered less frequently. This is not surprising given the wide variability in the frequency that various items were administered during the CAT simulations. The frequency an item is administered by CAT could therefore form the basis of power analysis conducted prior to DIF analysis for a given item. This would be particularly useful in the context of ongoing data collection, potentially improving power and minimizing analysis time.

There are two likely explanations for the observed relationship between absolute item difficulty and power in DIF detection. First, items with difficulty parameters closest to the mean theta values will be more likely to be administered by CAT. Since measures with mean trait levels of 0 or 1 logit were simulated, items in this range of difficulty would be most frequently administered. Second, items towards the extremes of the measurement continuum are less precisely estimated (i.e., have larger standard errors). Thus, power to detect DIF in items that are very easy or difficult to endorse is lower than that for items of average difficulty. This would likely explain why absolute item difficulty was a significant predictor of power even after controlling for CAT item usage. These findings may in part reflect the use of a fixed-length CAT during the simulation. In the case of a variable-length CAT, more items would likely be administered to simulees at the extremes of the trait continuum in order to achieve sufficient measurement precision, including items that are very easy or difficult to endorse. Conversely, we would expect fewer items to be administered to simulees who are in the center of the trait distribution under a variable-length CAT.

With respect to incorrect DIF decisions, easier-to-endorse items were more likely to be erroneously flagged than more difficult items. This finding is in contrast to Wang, Bradlow, Wainer and Muller

As might be expected, DIF magnitude (i.e., the difference between CAT and P&P item parameters for a given item) was significantly and positively related to power. The same was not true for the percentage of items with DIF in the item bank. The latter result suggests that the power to detect a single DIF item is not significantly affected by the presence of other DIF items in the bank which may "contaminate" the person measures.

The results of this study revealed a positive relationship between item discrimination and power to identify items with mode DIF. One possible explanation for this finding is that CAT using a 2PL model and maximum information item selection will tend to select items with higher discrimination parameters for administration. In other words, DIF in high discriminating items may be easier to detect because these items are administered more frequently in CAT. Yet the results of the multivariate logistic regression analysis failed to support this conclusion. Item discrimination remained statistically significant even when controlling for CAT item usage. High item discrimination therefore appears to enhance power in mode-effect detection. This finding is corroborated by previous DIF research examining the relationship of item discrimination to power using several analytic procedures

For both

In addition to the effect of item parameters, false positive DIF results were significantly associated with DIF size and mean difference in trait level between CAT and P&P administration modes. These effects likely reflect problems with the trait estimate used as the matching variable in the DIF analysis. Items with large DIF effects and mean differences in trait level between groups limit the effectiveness of matching, as has been observed in previous DIF studies

The strength of Monte Carlo simulation lies in its ability to systematically vary several factors thought to affect identification of simulated effects. In this study, several factors were directly examined with respect to detection of mode-of-administration DIF, including DIF size, percentage of DIF items, and mean difference in trait level between modes, item response model, and analytic procedure. We also examined the effects of variables not part of the research design, including CAT item usage, item discrimination, and item difficulty parameters. A particular strength of the study is the examination of CAT item usage rather than sample size as a factor related to identification of DIF.

Nevertheless, our study has several limits. For example, several other factors were not considered in the simulation. Of particular importance is the degree to which the mean, variance, and shape of distributions of parameters are consistent with specified priors in the Bayesian estimation model. Though differences in mean trait levels were examined, deviations from prior assumptions concerning parameter variances or distribution types were not examined. For instance, there is a need to conduct further studies examining the potential effect of skewed theta and item parameter distributions on the performance of DIF procedures

Also, we intentionally did not address non-uniform DIF. Thus limits our study to conclusions about uniform DIF only. Importantly, though, no theoretical reasons exist to preclude conducting similar analyses on non-uniform DIF. However, given the nascent status of research in this field, we choose to focus on a single type of DIF. Our future research will hopefully address non-uniform DIF in one study and both simultaneously in a final study. By addressing each in a stepwise and piecemeal fashion, we hope to avoid spurious conclusions that could arise by addressing all simultaneously in the initial study. For example, we did not want to the presence of non-uniform to influence the detection of uniform DIF using these methods we developed here. Final, we only used simulated data. Future studies employing these procedures with real data are also needed.

Conclusions

This study yielded mixed results concerning the methods for assessing mode effects. Whereas Type I error was well controlled, power to detect DIF was suboptimal, though the present findings were consistent with those reported in similar studies

Abbreviation

1PL: One-parameter logistic item response theory model; 2PL: Two-parameter logistic item response theory model; AUC: Area under the curve; CAT: Computerized adaptive testing; CrI: Credible interval; DIF: Differential item functioning; EB: Empirical Bayes; IQR: Interquartile range; IRT: Item response theory; MH: Mantel-Haenszel test; P&P: Paper-and-pencil (administration); RZ: Robust z test.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

BR conceived of the study, developed the procedures tested in the study, wrote the code for the simulation and performed the statistical analyses. BR and AC participated in the writing of the manuscript. BR wrote the background and methods sections, BR and AC wrote the results, discussion and conclusions sections. Both BR and AC reviewed drafts of the manuscript and gave final approval of the submitted version.

Acknowledgements

The development of this paper was supported by the National Institute on Drug Abuse (NIDA) under grant R21 DA 025371. NIDA had no direct role in the design of the study, analyses or interpretation of the study findings. The authors would like to thank Leanne Welch and Tim Feeney for their help proofreading the manuscript. Finally, the authors wish to thank the Research Open Access Publishing (ROAAP) Fund of the University of Illinois at Chicago for financial support toward the open access publishing fee for this article.

Pre-publication history

The pre-publication history for this paper can be accessed here: