Abstract
Background
Previous research on educational data has demonstrated that Rasch fit statistics (mean squares and tstatistics) are highly susceptible to sample size variation for dichotomously scored rating data, although little is known about this relationship for polytomous data. These statistics help inform researchers about how well items fit to a unidimensional latent trait, and are an important adjunct to modern psychometrics. Given the increasing use of Rasch models in health research the purpose of this study was therefore to explore the relationship between fit statistics and sample size for polytomous data.
Methods
Data were collated from a heterogeneous sample of cancer patients (n = 4072) who had completed both the Patient Health Questionnaire – 9 and the Hospital Anxiety and Depression Scale. Ten samples were drawn with replacement for each of eight sample sizes (n = 25 to n = 3200). The Rating and Partial Credit Models were applied and the mean square and tfit statistics (infit/outfit) derived for each model.
Results
The results demonstrated that tstatistics were highly sensitive to sample size, whereas mean square statistics remained relatively stable for polytomous data.
Conclusion
It was concluded that mean square statistics were relatively independent of sample size for polytomous data and that misfit to the model could be identified using published recommended ranges.
Background
Although Rasch Models [1] were originally designed and used for educational assessment in recent years they have increasingly been used in health research. This renewed interest in these models has largely been encouraged by a number of potential advantages of Rasch models over traditional psychometric methods, including the ability to decrease the number of items in questionnaires to reduce patient burden whilst retaining the psychometric properties of the instrument, and the pooling of data drawn from different samples allowing more accurate parameter estimation. Recent studies in health have explored the use of Rasch models in instrument development [24], modification of existing questionnaires [58], as well as in instrument and crosslinguistic comparison [9,10].
Rasch Models are a family of measurement models [11] which can be used to describe latent traits where items from questionnaires and person scores are located along the same scale of the latent trait. Item location ("difficulties") and person measures ("abilities") are estimated separately to produce estimates for each parameter which are sample and item independent respectively [12]. Rasch Models specify a number of criteria, which if fulfilled result in interval scales where adjacent scores along the scale are equally spaced, a feature which is particularly important for interpreting clinically meaningful differences [13]. Firstly, the data should describe a unidimensional construct, that is, a single latent trait should explain the variance in the data. The existence of dimensionality can be assessed using principal components analyses of the residuals [14]. Secondly, item invariance stipulates that item (or person) parameters should be independent of the sample (or items) used. This item invariance criterion can be evaluated using differential item functioning to determine whether item bias is present. The final criterion, which will form the focus of this paper, is item fit, in other words whether individual items in a scale fit the Rasch model.
There has been and there continues to be a considerable debate around the issue of which is the most appropriate fit statistic to use, what range of fit statistics to be employed when evaluating fit, and how fit statistics should be interpreted [15,16].
The use of chisquare statistics or infit and outfit mean squares to assess item fit to the model (described in more detail below) has been advocated. The mean squares can be converted through a cuberoot transformation (WilsonHilferty) to (infit/outfit) tstatistics.
The mean square fit statistics are perhaps the most commonly used fit statistics in health research. A series of ranges has been suggested [17] to be employed when evaluating item fit depending on the type of test, however the majority of studies employ a range of 0.7 to 1.3. Despite the popularity of this approach some concerns have been voiced about the use of a single, universal range to evaluate fit and the lack of adjustment of the range to sample size. For instance, Smith et al. [16] using simulated datasets on dichotomous data have determined that Type I error rates (defined here as the probability of falsely rejecting an item as not fitting the Rasch model) were significantly less than α = 0.05 for both infit and outfit mean squares using a range of critical values (0.7, 0.8, 0.9 – 1.1, 1.2, 1.3). Furthermore, Type I error rates decreased for the outfit mean square as sample size was increased. In contrast, the Type I error rates for the tstatistics, although not equal to 5% demonstrated fewer discrepancies.
More recently, studies [18] have demonstrated using data collected from a large sample of examinees' results that tstatistics may potentially identify more items that do not fit the model than both the infit and outfit mean square fit statistics. For instance, the number of misfitting items identified by the tstatistic was four times greater than those identified by the mean square fit statistic (23 and 5, respectively).
In addition to research on the dichotomous model, recent work on the polytomous (Rating Scale) model with simulated data has suggested that the variability of mean squares is dependent on sample size and furthermore that the standard deviations for the tstatistics are generally smaller than their expected value (unity) [19]. These authors propose adjusting the critical range employed for both types of fit statistic depending on sample size.
Finally, Smith & Suh [18] have concluded that using mean square statistics may lead researchers to missing significant numbers of misfitting items, which may have an important impact on the development of unidimensional instruments, and that there is, furthermore, a need to understand Type I error rates associated with critical values for fit statistics. On the basis of this Smith and colleagues [16,18] have suggested that the tstatistic rather than the weighted and unweighted mean squares should be used to identify misfit, given that this statistic appears to be less sensitive to changes in sample size or alternatively to adjust mean square fit statistics using a correction based on the square root of the sample size [16].
However, despite this assertion there are a number of other methodological studies [15,20] which have shown that the tstatistic is highly sample dependent.
The evaluation and identification of item misfit is critical to the development of unidimensional instruments, and reliable fit statistics play an important part in this. There is uncertainty in the literature to assist health researchers in determining the most appropriate fit statistic to select for developing or modifying questionnaires. Previous research on simulated datasets has focused on the relationship between sample size and fit statistics at the level of groups of items. However, for test users the emphasis is more on which fit statistics are able to identify misfit consistently for individual items. Identification and removal of misfitting items will not only reduce patient burden, but may also improve person measure assessment [5].
Therefore the aim of this study was to investigate the impact of sample size on four commonly used fit statistics, i.e. infit/outfit mean square and their tstatistics for two polytomous Rasch models using data collected from a cancer patient sample.
The study attempted to determine: 1). whether fit statistics (and therefore Type I error rates, i.e. the probability of falsely rejecting an item which does fit the Rasch model) vary with sample size and 2). whether there were any differences in this variation between the different types of fit statistic.
Methods
Participants
Patient data were pooled from a number of studies carried out by Cancer Research – UK Psychological Medicine Group, Western General Hospital, Edinburgh (Scotland) over the past decade. The data have been collated from patients who completed a touchscreen version of both the HADS and the PHQ9 in outpatient oncology clinics.
A total of 4072 patients completed the HADS (2781 females and 1291 males), and 3556 patients completed the PHQ9 (2268 females and 1288 males). The average age of the sample was 60 years. Further clinical and demographic details are available from the published studies [21].
The studies from which these data have been drawn have all received ethical approval from the local research ethics committee.
Instruments
The Hospital Anxiety & Depression Scale (HADS)
The Hospital Anxiety and Depression Scale (HADS) [22] was originally developed for screening for psychological distress in the general medical population. The scale consists of 7 items forming a Depression subscale (HADSD), and 7 items forming an Anxiety subscale (HADSA). Patients are asked to rate how they have felt in the past week on a 4point scale (scored 0–3). It has been claimed that scores on the two subscales may also be summed to provide a total score (HADST), measuring psychological distress [23]. Previous research in a large heterogeneous cancer population [6] has shown potential misfit on three of the instruments' items: Anxiety 6 ("I get a sort of frightened feeling") and Depression 5 and 7 ("I have lost interest in my appearance" and "I can enjoy a good book, radio or TV programme" respectively). This misfit was present both in the full, 14item version of the HADS as well as for the individual subscales. Although a principal components analysis of the residuals did not reveal the presence of any additional factors, given the misfitting items the analysis presented here will focus on the two subscales, HADSAnxiety (A) and HADSDepression (D).
Patient Health Questionnaire (PHQ9)
The Patient Health Questionnaire – 9 (PHQ9) [24] is a nineitem selfadministered questionnaire which may be used for detecting and assessing the severity of depression. The instrument is based on the Diagnostic and Statistical Manual of Mental Disorders (DSMIV) [25] criteria for diagnosing depression, and is scored on a 4point scale ("not at all" to "nearly every day"). Patients are asked to rate any problems experienced over the last two weeks.
Rasch Models
Both the Rating Scale and Partial Credit models are members of the family of Rasch Models [1]. The Rating Scale Model [26] is commonly employed to analyse Likerttype data [see Additional file 1]. As with all Rasch Models, the Rating Scale describes a probabilistic relationship between item difficulty (D) and person ability (B). In addition to this, thresholds are derived for each adjacent response category in a scale. In general, for k response categories, there are k1 thresholds. Each threshold has its own estimate of difficulty (F_{k}). The Rating Scale Model [see Additional file 1] describes the probability, P_{ni }of a person with ability B_{n}, choosing a given category with a threshold F_{k }and item difficulty D_{i}. A single set of thresholds is estimated for all items in a scale. The Partial Credit Model [27] can be seen as a modification of the Rating Scale Model where the threshold estimates are not constrained, that is, threshold estimates are free to vary between each item within a scale. Therefore for N items there will be N(k – 1) threshold estimates for the Partial Credit Model.
Additional file 1. The file format is in MS Word, and is entitled "Appendix 1". It contains five formulae with descriptions.
Format: DOC Size: 33KB Download file
This file can be viewed with: Microsoft Word Viewer
Rasch Fit Statistics
Rasch fit statistics describe the fit of the items to the model. The mean square fit statistics have a chisquare distribution and an expected value of 1, where fit statistics greater than 1 can be interpreted as demonstrating more variation between the model and the observed scores, e.g. a fit statistic of 1.25 for an item would indicate 25% more variation (or "noise") than predicted by the Rasch model [11], in other words there is an underfit with the model. Conversely, an item with a fit statistic of 0.70 would indicate 30% less variation (or "overlap") than predicted or the items overfit the model. Items demonstrating more variation than predicted by the model can be considered as not conforming to the unidimensionality requirement of the Rasch model.
There are two commonly used mean square fit statistics, namely the infit mean square (also referred to as the weighted mean square) and outfit (or unweighted) mean square. Both mean squares are derived from the squared standardised residuals for each item/person interaction [see Additional file 1]. The outfit mean square is the average of the standardised residual variance across items and persons and is unweighted, meaning that the estimate produced is relatively more affected by unexpected responses distant to item or person measures. For the infit mean square the residuals are weighted by their individual variance (W_{ni}) [see Additional file 1] to minimise the impact of unexpected responses far from the measure. The infit mean square is relatively more affected by unexpected responses closer to item and person measures [11].
The infit and outfit mean squares can be converted to an approximately normalised tstatistic using the WilsonHilferty transformation [see Additional file 1]. These infit/outfit tstatistics have an expected value of 0 and a standard deviation of 1. These statistics are evaluated against ± 2, where values greater than +2 are interpreted as demonstrating more variation than predicted.
Method
The relationship between sample size and fit statistics was explored using the two Rasch models, that is, the Rating Scale Model [26], and Partial Credit Model [27]. The analysis was performed using Winsteps version 3.64 [14]. Eight sample sizes were used for each questionnaire: 25, 50, 100, 200, 400, 800, 1600, and 3200. Ten samples were drawn with replacement for each sample size for each item for the two instruments. Therefore, for the HADS there were 1120 data points (14 items × 8 sample sizes × 10 samples), and 720 data points for the PHQ9 (9 items × 8 samples sizes × 10 samples). Ten samples were collated for each sample size (25 to 3600) for each questionnaire to produce an average for each of the four fit statistics (infit/outfit mean squares (MNSQ) and tstatistic (ZSTD in Winsteps)) for each item. This process was completed for both Rasch models.
Results
1. Fit Statistics – Type I error rate
Tables 1, 2, 3, 4 show the fit statistics for each item averaged across sample size and provide an indication of the Type I error rates. For both the HADS subscales and PHQ9 a Type I error rate of 5% would translate into approximately 1 misfitting item identified by chance alone.
Table 1. Fit statistics for the HADS subscale items (collapsed across sample size) for the Rating Scale Model
Table 2. Fit statistics for the HADS subscale items (collapsed across sample size) for the Partial Credit Model
Table 3. Fit statistics for PHQ9 items (collapsed across sample size) for the Rating Scale Model
Table 4. Fit statistics for PHQ9 items (collapsed across sample size) for the Partial Credit Model
Tables 1 and 2 demonstrate that for both HADS subscales there was a broad agreement between the infit and outfit statistics. In other words, the numbers of items identified as misfitting were relatively consistent for the infit and outfit versions of the same type of statistic irrespective of the Rasch model applied.
However for the PHQ9 (Tables 3 and 4) consistently more items were identified as misfitting by tstatistics (infit/outfit tstatistic) than by the equivalent mean square statistics. In terms of underfit to the model the Type I error rate for the tstatistics was at least double that of the corresponding mean square, e.g. for the Rating Scale Model, the total number of items exceeding the thresholds for infit tstatistic was 3, whereas for the infit mean square 1, and for the Partial Credit Model infit tstatistic it was 2, and the infit mean square 0. A similar pattern of results was also found for those items overfitting the models.
Finally, it can also be seen that the standard errors were uniformly smaller for the mean square statistics than those for the tstatistics, indicating greater levels of stability in the parameter estimates. This was particularly noticeable for the HADS, but also applied to some extent to the PHQ9.
2. Fit Statistics – Sample Size
The relationship between sample size and fit statistics is shown in Tables 5, 6, 7, 8. This analysis has been broken down into overfitting (MNSQ < 0.7/t < 2) and underfitting/misfitting items (MNSQ > 1.3/t > 2).
Table 5. HADS – Rating Scale Model Error rates by sample size (collapsing across items)
Table 6. PHQ9 – Rating Scale Model Error rates by sample size (collapsing across items)
Table 7. HADS – Partial Credit Scale Model Error rates by sample size (collapsing across items)
Table 8. PHQ9 – Partial Credit Scale Model Error rates by sample size (collapsing across items)
Overfitting items
It can be seen that for both the infit and outfit mean squares few items were identified with mean square fit statistics < 0.7 for the HADS subscales (Tables 5 and 7) and the PHQ9 (Tables 6 and 8). In contrast to this, the corresponding tstatistics (< 2) demonstrated that as sample size increased, the number of items identified as misfitting rapidly increased. For instance, for the Rating Scale Model, the infit mean squares for HADSD (Table 5) consistently failed to identify a single instance of an item misfitting as sample size increased, whereas the corresponding tstatistic identified no misfit between sample sizes 25 and 200. Furthermore, there was only 1 instance of misfit at sample sizes of 400 and 800, and 2 instances at sample sizes of 1600 and beyond. This pattern was particularly evident for the HADSA Partial Credit Model (Table 7). A similar pattern was also observed for the PHQ9 (Table 8).
Underfitting items
There was a clear link observed between sample size and fit statistic when comparing infit and outfit mean squares above 1.2 with tstatistics > +2. Once again, tstatistics increased in proportion to sample size, whereas the mean square equivalents remained approximately invariant to sample size changes (compare for instance the infit statistics on the Rating Scale for HADSD in Table 5, as well as for the PHQ9 in Table 6). Additionally, more instances of misfit were identified, in general, when a mean square of >1.2 was used compared with 1.3, although this was not always consistently the case.
3. Fit Statistics, Sample size and individual items
Items not demonstrating misfit
In terms of agreement between the four statistics for individual items not exhibiting misfit it can be seen from Table 1 that, for instance for HADSA, the infit and outfit means squares agreed with their equivalent tstatistic on 5 items for the Rating Scale Model; similarly there was agreement between the fit statistics for 4 items from the HADSD. Slightly less consistency was observed for both subscales on the Partial Credit Model (Table 2) and for both models using the PHQ9, although again there was agreement for the majority of items (Tables 3 and 4).
An example of an item (HADSA1) which had demonstrated fit across all four statistics is shown in Figure 1. Although Table 1 demonstrated fit for the tstatistics it can be seen that whereas the item demonstrated consistent (infit and outfit) mean square statistics (approx. 0.92) across sample size, the infit and outfit tstatistics became increasingly more negative as sample size increased (> 200), resulting in the tstatistics highlighting significant overfit for this item at sample sizes greater than 1600.
Figure 1. Infit and Outfit statistics by sample size for HADSAnxiety 1.
Overfitting items
For the Rating Scale Model one HADSA item (7) and one HADSD item (6), as well as two PHQ9 items (2 and 4) were identified as overfitting by the tstatistics but not by the mean squares. The Partial Credit Model demonstrated overfit for HADSD6, as well as HADSA3, and PHQ2. Figure 2 demonstrates once again that whereas the mean squares remained consistent across sample size, the tstatistics became increasingly more negative (sample size > 200).
Figure 2. Infit and Outfit statistics by sample size for HADSDepression 6.
Underfitting items
HADSD5 and HADSD7 were identified as underfitting on the Rating Scale Model (RSM) for both the in and outfit tstatistics, but not the mean squares, although neither was identified as misfitting on the Partial Credit Model (PCM); HADSD4 was identified as misfitting (i.e. underfitting) on the Partial Credit Model by the infit t alone. PHQ3 and 5 were identified as misfitting by both infit and outfit tstatistics for the RSM and PCM, but not by the mean squares. Finally, HADSA item 6 was consistently identified as misfitting (underfitting) by all four statistics, yet when the four statistics are plotted against sample size (Figure 3) it is apparent that this item was only identified by the tstatistics as under/misfitting once the sample size exceeded 200.
Figure 3. Infit and Outfit statistics by sample size for HADSAnxiety 6.
In summary, two instances of misfit for the tstatistics could be discerned from the data: 1). instances where mean square statistics fell within the critical (0.7 – 1.3) range (i.e. "fit"), and 2). instances where mean square statistics fell outside this range, in particular exceeding 1.3 (misfit).
Items identified as falling within range (0.7 – 1.3) showed consistent mean squares (infit/outfit) as sample size increased; on the other hand, the corresponding tstatistics increased with sample size (i.e. identified misfit where none was identified as such by the corresponding mean square). Items identified as falling outside the critical range (0.7 – 1.3) were consistently identified as misfitting by mean squares, but only identified as such by the corresponding tstatistics when sample size exceeded 200. Beyond these limits tstatistics increased in proportion with sample size. In other words, the tstatistic only identified items as misfitting for larger sample sizes.
Discussion
The aim of this study was to explore the relationship between sample size and four commonly used fit statistics for two polytomous Rasch Models. The results of this study demonstrated that Type I error rates – defined strictly in this study as falsely rejecting an item as not fitting the Rasch model – for the tstatistic were at least twice those of the corresponding fit statistic for both infit and outfit for both Rasch Models. In addition, the results of the analysis of sample size and fit statistic suggested that whereas the mean square fit statistics broadly remained constant in the number of items whether identified as fitting or misfitting (under and over), the instances of misfit identified by the tstatistics increased proportionally with sample size. Further analysis of the individual item fit and sample size suggested that although in the majority of cases there was agreement between mean square and tstatistics in terms of identifying fit and misfit (>50% for both models and instruments), there were discrepancies in Type I error rate as defined in this study and a lack of sample size invariance for the tstatistics.
The results of the study suggest that tstatistics are highly dependent on sample size which has the effect of inflating putative Type I error rates. Specifically, for cases where mean square statistics fell within the range 0.7 – 1.3, the tstatistics increased in magnitude as sample size increased, therefore for the tstatistic the Type I error rate was inflated and the probability of identifying misfit where none was identified by the mean square statistics increased with sample size. Similarly, where mean square statistics identified misfit outside the 0.7 – 1.3 range, tstatistics only identified misfit as the sample size increased to beyond 200.
In terms of Type I error rates, for Rating Scale Model the outfit mean square statistics provided the most stable rates, whereas the infit mean squares were more stable for the Partial Credit Model, although there was little difference in identifying misfit between the 1.2 and 1.3 criteria for mean squares.
Taken together these results suggest that both infit and outfit mean square statistics are relatively insensitive to sample size variation for polytomous data, and that tstatistics may vary considerably with sample size. The latter has confirmed previously reported findings using simulated data sets [15].
The potential cause for this sample size dependence for the tstatistics may lie with the standard deviations. The results of previous research have demonstrated that the variability of the mean squares decreases significantly [19] by sample size. As the tstatistics are derived from the mean squares and their standard deviations it appears that tstatistics are disproportionately affected by decreases in variability. The fact that tstatistics are highly dependent on the variance and thereby sample size has also been demonstrated in previous studies with the dichotomous model [15].
Although the results for the tstatistics confirm results from previous studies (e.g. "Knox Cube Test") [15] they differ markedly from existing literature on simulated data using the dichotomous model [15,16] which, in addition, has also suggested a significant sample size dependence for mean square statistics. For instance, Karabatsos [15] generated data sets with sample sizes of 150, 500 and 1000 and test sizes of 20 and 50 items. Ability, θ, was distributed as N(0, 1) and item difficulty, δ, as U(2 to +2). Type I error rates were evaluated for both infit and outfit at critical values: 1.1, 1.2 and 1.3. The results indicated both fit statistics were clearly a function of sample size, and test length to a lesser extent.
This gives rise to a potential discrepancy between the dichotomous and polytomous Rasch Models and Type I error rates, suggesting a dependence between sample size and fit for the dichotomous model for both mean square and tstatistics, in contrast to sample size independence for mean square fit statistics for the polytomous model as demonstrated in this study, and further research will be required to elucidate this issue.
There are a number of limitations to this study: 1). The primary limitation is that "real" data directly derived from patients were used rather than simulated data. Previous work on the HADS in particular had demonstrated the presence of misfitting items in the scale [6]. The aim was to observe how effectively the four fit statistics identified misfit and whether and to what extent this was affected by sample size. However, we acknowledge that estimates of true Type I error rates are more optimally derived from simulated data where fit and misfit may be artificially manipulated. Further limitations reflect the fact that the data were restricted to cancer patients only, and only included mental health questionnaires. Additionally, the relationship between sample size and instrument length was not explored, although there were modest differences in test length between the HADS and PHQ9. Finally any potential interactions with dimensionality and item difficulty [15] were also not explored.
The presence of underfitting items in instruments may have a potentially significant impact by severely degrading the measures, whereas overfitting items will tend to overestimate differences in raw scores [11]. The former may lead to an underdetection of health problems (e.g. low levels of screening efficacy), the latter may interfere in comparisons within and between individuals. Clearly the need to accurately identify misfitting, particularly underfitting items is paramount. This study demonstrated that low Type I error rates were evidenced by mean square fit statistics, which appeared independent of sample size. The clinical impact of erroneously removing misfitting items has not been directly investigated, however research suggests that the converse problem of retaining misfitting items (Type II errors) has little or no impact on the efficacy of, for instance, instruments used to screen for psychological distress [6]. Research on both the HADS [6] and the Geriatric Depression Scale [28] suggests that misfitting items may be removed from the instruments whilst maintaining, if not improving screening efficacy (in terms of diagnosing cases of anxiety or depression) when compared with a gold standard psychiatric interview. Although the clinical implications of Type I and II errors needs to be explored further, the results suggest that correctly identifying misfit has a direct benefit to patients by reducing the burden of the number of questions needing to be answered (whilst maintaining efficacy of the instrument).
Conclusion
In summary, the study suggests that for polytomous Rasch Models when evaluating against accepted threshold criteria the tstatistics are sample size dependent. In contrast to this sample size invariance appears to exist for the mean square fit statistics. It may therefore be recommended that tstatistics should be adjusted or interpreted with caution when judging item fit or attempting to identify misfit in data, particularly for large samples and polytomous data.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
LJF, GV and MS contributed the data for this study. The data analysis was performed by ABS and RR. The manuscript was drafted by ABS with critical comments provided by RR, LJF, GV and MS. All authors read and approved the final manuscript.
Acknowledgements
The authors wish to thank the patients who completed the questionnaires, as well as the research assistants responsible for the data collection. We are grateful to the reviewers for providing thoughtful comments and suggestions for the initial manuscript. The research was funded by Cancer Research – UK.
References

Rasch G: Probabilistic models for some intelligence and attainment tests. The University of Chicago Press: Chicago; 1980.

Bode RK, Cella D, Lai JS, Heinemann AW: Developing an initial physical function item bank from existing sources.
Journal of Applied Measurement 2003, 4:124136. PubMed Abstract

Lai JS, Cella D, Chang CH, Bode RK, Heinemann AW: Item banking to improve, shorten and computerize selfreported fatigue: an illustration of steps to create a core item bank from the FACITFatigue Scale.
Quality of Life Research 2003, 12:485501. PubMed Abstract  Publisher Full Text

Smith AB, Rush R, Velikova G, Wall L, Wright EP, Stark D, Selby P, Sharpe M: The initial development of an item bank to assess and screen for psychological distress in cancer patients.
Psychooncology 2007, 16:724732. PubMed Abstract  Publisher Full Text

Pallant JF, Miller RL, Tennant A: Evaluation of the Edinburgh Post Natal Depression Scale using Rasch analysis.
BMC Psychiatry 2006, 6:28. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Smith AB, Wright EP, Rush R, Stark DP, Velikova G, Selby PJ: Rasch analysis of the dimensional structure of the Hospital Anxiety and Depression Scale.
Psychooncology 2006, 15:817827. PubMed Abstract  Publisher Full Text

Smith AB, Wright P, Selby PJ, Velikova G: A Rasch and factor analysis of the Functional Assessment of Cancer TherapyGeneral (FACTG).
Health and Quality of Life Outcomes 2007, 5:19. BioMed Central Full Text

Smith AB, Wright P, Selby P, Velikova G: Measuring social difficulties in routine patientcentred assessment: a Rasch analysis of the social difficulties inventory.
Quality of Life Research 2007, 16:823831. PubMed Abstract  Publisher Full Text

Holzner B, Bode RK, Hahn EA, Cella D, Kopp M, SpernerUnterweger B, Kemmler G: Equating EORTC QLQC30 and FACTG scores and its use in oncological research.
European Journal of Cancer 2006, 42:31693177. PubMed Abstract  Publisher Full Text

Petersen MA, Groenvold M, Bjorner JB, Aaronson N, Conroy T, Cull A, Fayers P, Hjermstad M, Sprangers M, Sullivan M: Use of differential item functioning analysis to assess the equivalence of translations of a questionnaire.
Quality of Life Research 2003, 12:373385. PubMed Abstract  Publisher Full Text

Bond TG, Fox CM: Applying the Rasch Model: Fundamental Measurement in the Human Sciences. Lawrence Erlbaum Baum Associates: Hillsdale, New Jersey; 2001.

Suen HK: Principles of test theories. Lawrence Erlbaum Associates: Hillsdale, New Jersey; 1990.

Stucki G, Daltroy L, Katz JN, Johannesson M, Liang MH: Interpretation of change scores in ordinal clinical scales and health status measures: the whole may not equal the sum of the parts.
Journal of Clinical Epidemiology 1996, 49:711717. PubMed Abstract  Publisher Full Text

Linacre JM: A User's Guide to WINSTEPS/MINISTEPS RaschModel Computer Programs.
2007.

Karabatsos G: A critique of Rasch residual fit statistics.
Journal of Applied Measurement 2000, 1:152176. PubMed Abstract

Smith RM, Schumacker RE, Bush MJ: Using item mean squares to evaluate fit to the Rasch Model.
Journal of Outcome Measurement 1998, 2:6678. PubMed Abstract

Smith RM, Suh KK: Rasch fit statistics as a test of the invariance of item parameter estimates.
Journal of Applied Measurement 2003, 4:153163. PubMed Abstract

WenChung Wang, ChengTe Chen: Item Parameter Recovery, Standard Error Estimates, and Fit Statistics of the Winsteps Program for the Family of Rasch Models.
Educational and Psychological Measurement 2005, 65:376404. Publisher Full Text

Linacre JM: Size vs. Significance: Infit and Outfit MeanSquare and Standardized ChiSquare Fit Statistic.

Sharpe M, Strong V, Allen K, Rush R, Postma K, Tulloh A, Maguire P, House A, Ramirez A, Cull A: Major depression in outpatients attending a regional cancer centre: screening and unmet treatment needs.
Br J Cancer 2004, 90:314320. PubMed Abstract  Publisher Full Text

Zigmond AS, Snaith RP: The hospital anxiety and depression scale.
Acta Psychiatrica Scandanavia 1983, 67:361370. Publisher Full Text

Razavi D, Delvaux N, Farvacques C, Robaye E: Screening for adjustment disorders and major depressive disorders in cancer inpatients.
Br J Psychiatry 1990, 156:7983. PubMed Abstract

Kroenke K, Spitzer RL, Williams JB: The PHQ9: validity of a brief depression severity measure.
Journal of General Internal Medicine 2001, 16:606613. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

American Psychiatric Association: Diagnostic and Statistical Manual of Mental Disorders. 4th Edition, Text Revision (DSMIVTR). American Psychiatric Association: Washington, DC; 2000.

Andrich DA: A rating formulation for ordered response categories.
Psychometrika 1978, 43:357374. Publisher Full Text

Masters GN: A Rasch model for partial credit scoring.
Psychometrika 1982, 47:149174. Publisher Full Text

Tang Wai K, Wong E, Chiu HFK, Lum CM, Ungvari GS: The Geriatric Depression Scale should be shortened: results of Rasch analysis.
International Journal of Geriatric Psychiatry 2005, 20:783789. PubMed Abstract  Publisher Full Text
Prepublication history
The prepublication history for this paper can be accessed here: