Open Access Research article

Methodological issues regarding power of classical test theory (CTT) and item response theory (IRT)-based approaches for the comparison of patient-reported outcomes in two groups of patients - a simulation study

Véronique Sébille12*, Jean-Benoit Hardouin1, Tanguy Le Néel1, Gildas Kubis1, François Boyer34, Francis Guillemin5 and Bruno Falissard67

Author Affiliations

1 EA 4275 "Biostatistique, recherche clinique et mesures subjectives en santé", Faculté de Pharmacie, Université de Nantes, 1 rue Gaston Veil, 44035 Nantes Cedex 1, France

2 Plateforme de Biométrie, Cellule de promotion de la recherche clinique, CHU de Nantes, France

3 University of Reims Champagne-Ardenne, Faculty of medicine - EA 3797, 35 rue Cognacq Jay, F-51095, Reims, France

4 Department of Physical Medicine and Rehabilitation, Sebastopol Hospital, 48 rue de Sébastopol, F-51092, Reims, France

5 Nancy-Université, Paul Verlaine Metz University, Paris Descartes University, EA 4360 Apemac, Nancy, France

6 INSERM 669, Université Paris-Sud, and Université Paris Descartes, Paris, France

7 AP-HP, Hôpital Paul Brousse, Département de santé publique, Villejuif, France

For all author emails, please log on.

BMC Medical Research Methodology 2010, 10:24  doi:10.1186/1471-2288-10-24

Published: 25 March 2010



Patients-Reported Outcomes (PRO) are increasingly used in clinical and epidemiological research. Two main types of analytical strategies can be found for these data: classical test theory (CTT) based on the observed scores and models coming from Item Response Theory (IRT). However, whether IRT or CTT would be the most appropriate method to analyse PRO data remains unknown. The statistical properties of CTT and IRT, regarding power and corresponding effect sizes, were compared.


Two-group cross-sectional studies were simulated for the comparison of PRO data using IRT or CTT-based analysis. For IRT, different scenarios were investigated according to whether items or person parameters were assumed to be known, to a certain extent for item parameters, from good to poor precision, or unknown and therefore had to be estimated. The powers obtained with IRT or CTT were compared and parameters having the strongest impact on them were identified.


When person parameters were assumed to be unknown and items parameters to be either known or not, the power achieved using IRT or CTT were similar and always lower than the expected power using the well-known sample size formula for normally distributed endpoints. The number of items had a substantial impact on power for both methods.


Without any missing data, IRT and CTT seem to provide comparable power. The classical sample size formula for CTT seems to be adequate under some conditions but is not appropriate for IRT. In IRT, it seems important to take account of the number of items to obtain an accurate formula.