Pfizer, Inc, 235 E. 42nd Street, New York, NY 10017, USA

University of Maryland School of Pharmacy, 220 Arch Street, 12th Floor, Baltimore, MD, 21201, USA

Pfizer, Inc, 500 Arcola Road, Collegeville, PA, 19426-3982, USA

Abstract

Implicit in the growing interest in patient-centered outcomes research is a growing need for better evidence regarding how responses to a given intervention or treatment may vary across patients, referred to as heterogeneity of treatment effect (HTE). A variety of methods are available for exploring HTE, each associated with unique strengths and limitations. This paper reviews a selected set of methodological approaches to understanding HTE, focusing largely but not exclusively on their uses with randomized trial data. It is oriented for the “intermediate” outcomes researcher, who may already be familiar with some methods, but would value a systematic overview of both more and less familiar methods with attention to when and why they may be used. Drawing from the biomedical, statistical, epidemiological and econometrics literature, we describe the steps involved in choosing an HTE approach, focusing on whether the intent of the analysis is for exploratory, initial testing, or confirmatory testing purposes. We also map HTE methodological approaches to data considerations as well as the strengths and limitations of each approach. Methods reviewed include formal subgroup analysis, meta-analysis and meta-regression, various types of predictive risk modeling including classification and regression tree analysis, series of n-of-1 trials, latent growth and growth mixture models, quantile regression, and selected non-parametric methods. In addition to an overview of each HTE method, examples and references are provided for further reading.

By guiding the selection of the methods and analysis, this review is meant to better enable outcomes researchers to understand and explore aspects of HTE in the context of patient-centered outcomes research.

Review

Background

Recent interest in ‘patient-centered’ outcomes research (PCOR) stems from a growing need for valid and reliable evidence that can be used by stakeholders to make individualized treatment decisions. Currently, patients, physicians and payers often make treatment or payment decisions based upon data representing the average effect of an intervention, as observed from selected pools of patients in a clinical trial setting. Reliance on trial outcomes data for real-world decision making requires an assumption that the study population from which the average was generated accurately represents the individual patient. However, a ‘real’ patient is likely different from the average trial patient in important ways, such as demographic, disease severity, or health behavior characteristics.

In terms of intervention outcomes, these differences could mean that the average effect observed in the study population may bear little resemblance to the real effect observed in the individual patient

Figure

Random Variation in Treatment Effect vs. Heterogeneity of Treatment Effect.

**Random Variation in Treatment Effect vs. Heterogeneity of Treatment Effect.**

Assessment of HTE is becoming more common in the medical literature, often in the form of subgroup analysis within RCTs

This article is intended to serve as a thorough discussion about how to explore, evaluate and evaluate HTE evidence, a primer of sorts, for researchers in all sectors but particularly those involved in product development requiring RCT’s, who are interested in incorporating HTE considerations into their RCT studies. Because, for a given study, aims and circumstances can vary widely, we have sought to avoid a prescriptive approach to the process of methods selection. Instead, we seek to provide the reader with a general framework, supplemented with sufficient background material that, when combined with examples and references, enable the researcher who is interested in developing their own HTE study with the tools needed to do so.

The HTE methodological approaches discussed below were informed by a literature review of HTE methods and selected to provide a review of a variety of techniques in the medical, statistics, and economics literature. In order to focus on methods useful in product development, they were also selected for their apparent relevance to RCT conduct and analysis. Non-randomized data analyses can inform trial design and real-world comparative effectiveness research (CER) studies, but are also subject to treatment selection biases which can significantly complicate HTE analysis. We considered such issues beyond the scope of this paper and so non-randomized data applications are included to a much lesser degree. Also beyond this paper’s scope is the background science needed to analyze genetically-driven differences, although the HTE methods discussed below may be used in an exploratory manner in that area.

We begin with relatively established approaches such as formal subgroup analysis of clinical trial data, and heterogeneity in meta-analysis of trials. We then discuss more exploratory approaches for HTE, particularly the family of predictive risk modeling approaches, with some detail on classification and regression tree (CART) analysis. This is followed by several approaches more explicitly accounting for intra-individual effects, albeit in quite different ways – latent growth and growth mixture models, and series of “n-of-1” trials - with the latter being an example of how an alternative trial design can be combined with specific methodological approaches to model HTE. Finally, we discuss some HTE methods receiving relatively more attention in the econometrics literature: quantile-based treatment heterogeneity and non-parametric methods testing for HTE. The overview of methods is followed by a discussion of how those with an interest in PCOR and HTE during product development can develop an appropriate research agenda by appropriately matching methods with study questions.

Subgroup analysis of clinical trial data

Widespread in use but contentious in value, subgroup analysis can assess whether observed clinical trial treatment effects are consistent across all patients, or whether HTE exists across the patient population’s demographic, biologic, or disease characteristics. Subgroup analyses may be pre-specified as part of a trial’s analysis plan; however, reviews of clinical trial reports suggest that subgroup analyses are often employed when a statistically significant effect is not detected in the overall population in an effort to identify a statistical effect of interest

Subgroup analysis should be both undertaken and interpreted with caution, especially when not pre-specified

Sample size and power are additional concerns, as most trials are powered to only detect treatment differences on the overall population level. Even in cases where a significant effect within a subgroup is detected, multiplicity is also a concern, as the likelihood of a false-positive response increases significantly with each additional analysis conducted

Several studies have demonstrated how spurious results from subgroup analyses can be easily, if inadvertently, generated

Meta-analysis

Meta-analysis is a technique that can be used to combine treatment effects across trials and their variations into an aggregated treatment effect with higher statistical power than observed in the individual trials. It can also provide an opportunity to detect HTE by testing for differences in treatment effects across similar RCTs. It is important that the individual treatment effects are similar enough for the pooling to be meaningful. If there are large clinical or methodological differences between the trials, an appropriate judgment may be to not perform a meta-analysis at all.

HTE across studies included in a meta-analysis may exist because of differences in the design or execution of the individual trials (such as randomization methods, patient selection criteria, handling of intermediate outcomes, differences in treatment modalities, etc). Cochran's Q, a technique to detect this type of statistical heterogeneity, is computed as the weighted sum of squared differences between each study's treatment effect and the pooled effects across the studies, and provides an indication of whether inter-trial differences are impacting the observed study result ^{2}, H^{2}, and R^{2} indices developed by Higgins and Thompson are closely related and measure the degree of statistical heterogeneity

A possible source of error in a meta-analysis is publication bias, i.e., the likelihood that a particular trial result is reported depends on the statistical significance and the direction of the result. Trial size might also result in publication bias since larger trials would be less likely to escape publication than smaller, less-known ones. Language and accessibility might be other factors. There are methods of identifying and adjusting for publication bias, e.g. the funnel plot which plots the effect size against the sample size and if no bias is present is shaped as a funnel

HTE can sometimes be minimized by choosing the measure with the smallest variance across trials. Common measures of treatment effect when comparing proportions are: the additive Risk Difference (RD), and the multiplicative Relative Risk (RR) and Odds Ratio (OR). The RR and the OR both have good statistical properties but are less intuitive to interpret than the less statistically efficient RD. It may be the case that one measure, such as the RR, does not vary across subgroups, while the RD does vary, and it likely will if a common RR is applied to different baseline risks

Meta-regression is a variant of meta-analytic technique that allows for a more in-depth understanding of the pooled clinical trial data, by “exploring whether a linear association exists between variables and a comparative treatment effect, along with the direction of that association”

Predictive risk modeling

A rapidly growing method for identifying potential for HTE is predictive risk modeling, whereby individual patient risk for disease-related events at baseline is differentiated based on observed factors. Most common measures are disease staging criteria, such as those used in COPD or heart failure, as well as more continuous algorithmic measures such as the Framingham risk scores for cardiovascular event risk

Initial predictive risk modeling, also known as risk function estimation, is often but not always performed prior to including treatment effects, and can employ a variety of methods. This is an area of extensive current research for more efficient algorithms, given the proliferating sources of individual patient data with large sample sizes, better data on outcomes, and large numbers of potential predictors. Traditional least squares or Cox proportional hazards regression methods are still quite appropriate in many cases and provide relatively more interpretable risk functions, but are typically based on linearity assumptions and may not give the highest scores for predictive metrics. Partial least squares is an extension of least squares methods that can reduce the dimensionality of the predictor space by interposing latent variables, predicted by linear combinations of observable characteristics, as the intermediate predictors of one or more outcomes

Given a risk function that generates pre-treatment event risk predictions for individual patients, one must choose how to use it with RCT data to evaluate HTE. For categorical risk predictions, methods such as subgroup interactions or stratified treatment analyses can be used, subject to the considerations discussed above. Continuous risk predictions can be interacted with the treatment response in a regression format, but questions about the nature of the interaction – linear, quadratic, logistic, etc. – must be managed. Some other techniques have been proposed as well. For example, Ioannidis and Lau propose dividing patients into quartiles based on predicted risks and analyzing accordingly

Classification and regression tree (CART) analysis

A decision-tree based technique, the Classification and Regression Tree (CART) approach considers how variation observed in a given response variable (continuous or categorical) can be understood through a systematic deconstruction of the overall study population into subgroups, using explanatory variables of interest

The key to CART is its ‘systematic’ approach to the development of the subgroups, which are constructed sequentially through repeated, binary splits of the population of interest, one explanatory variable at a time. In other words, each ‘parent’ group is divided into two ‘child’ groups, with the objective of creating increasingly homogeneous subgroups. The process is repeated and the subgroups are then further split, until no additional variables are available for further subgroup development. The resulting tree structure is oftentimes overgrown, but additional techniques are used to ‘trim’ the tree to a point at which its predictive power is balanced against issues of over-fitting. Because the CART approach does not make assumptions regarding the distribution of the dependent variable, it can be used in situations where other multivariate modeling techniques often used for exploratory predictive risk modeling would not be appropriate – namely in situations where data are not normally distributed.

CART analyses are useful in situations where there is some evidence to suggest that HTE exists, but the subgroups defining the heterogeneous response are not well understood

Latent growth and growth mixture modeling

Latent growth modeling (LGM) is a structural equation modeling technique that captures inter-individual differences in longitudinal change corresponding to a particular treatment. In LGM, patients’ different timing patterns of the treatment effects are the underlying sources of HTE. Not only does LGM distinguish patients who do or do not respond, it also examines whether the patient responds quickly or slowly, and if they have temporary or durable responses. The heterogeneous individual growth trajectories are estimated from intra-individual changes over time by examining common population parameters, i.e., slopes, intercepts, and error variances. For example, each individual has unique initial status (intercept) and response rate (slope) during a specific time interval. The variances of all individuals’ baseline measures (intercepts) and changes (slopes) in health outcomes represent the degree of HTE. The HTE of individual growth curves identified in LGM can also be attributed to observed predictors, including both fixed and time varying covariates. Duncan & Duncan provide a non-technical introduction to LGM

However, the assumption that all individuals are from the same population in LGM is too restrictive in some research scenarios. If the HTE is due to observed demographic variables, such as age, gender, and marital status, one may utilize multiple-group LGM. Despite its successful applications for modeling longitudinal change, there may be multiple subpopulations with unobserved heterogeneities. Growth mixture modeling (GMM), built upon LGM, allows the identification and prediction of unobserved subpopulations in longitudinal data analysis. Each unobserved subpopulation may constitute its own latent class and behave differently than individuals in other latent classes. Within each latent class, there are also different trajectories across individuals; however, different latent classes don’t share common population parameters. For example, Wang and Bodner use a simulated dataset to study retirees’ psychological well-being change trajectory when multiple unknown subpopulations exist

Wang and Bodner and Jung & Wickrama provide intuitive introductions to GMM

Series of n of 1 trials

Combined (aka, “series of”) n-of-1 trial data provide a unique way to identify HTE. An n-of-1 trial is a repeated crossover trial for a single patient, which randomly assigns the patient to one treatment vs. another for a given time period, after which the patient is re-randomized to treatment for the next time period, usually repeated for 4-6 time periods. Such trials are most feasibly done in chronic conditions, where little or no washout period is needed between treatments and treatment effects are identifiable in the short-term, such as pain or reliable surrogate markers

Combining data from identical n-of-1 trials across a set of patients allows for a powerful statistical analysis which can control for patient fixed or random effects as well as covariate, center, or sequence effects. These combined trials are often analyzed within a Bayesian context using shrinkage estimators that combine individual and group mean treatment effects to create a “posterior” individual mean treatment effect estimate which is a form of inverse variance-weighted average of the individual and group effects. These trial conduct and analysis steps are illustrated in Figure

Series of N-of-1 Trials, Conduct and Analysis Steps.

**Series of N-of-1 Trials, Conduct and Analysis Steps.**

Zucker et al furnish a good example of the information provided by analysis of combined n-of-1 trials

Combined n-of-1 trials could be used for early exploratory testing for HTE, or for later-phase, more focused testing of comparative effectiveness or cost-effectiveness

Quantile regression

Quantile regression provides additional distributional information about the central tendency and statistical dispersion of the treatment effect in a population, which is not normally revealed by the conventional mean estimation in RCTs. For example, patients with different comorbidity scores may respond differently to a treatment. Quantile regression has the ability to reveal HTE according to the ranking of patients’ comorbidity scores or some other relevant covariate by which patients may be ranked. Therefore, in an attempt to inform patient-centered care, quantile regression provides more information on the distribution of the treatment effect than typical conditional mean treatment effect estimation. The quantile treatment effect (QTE) characterizes the heterogeneous treatment effect on individuals and groups across various positions in the distributions of different outcomes of interest. This unique feature has given quantile regression analysis substantial attention and has been employed across a wide range of applications, particularly when evaluating the economic effects of welfare reform

One caveat of applying quantile regression in clinical trials for examining HTE is that QTE doesn’t demonstrate the treatment effect for a given patient. Instead, quantile regression focuses on the treatment effect among subjects within the ^{th} percent in terms of blood pressure or a depression score for some covariate of interest, for example, comorbidity score. It is not uncommon for the

Quantile Regression Estimation of Treatment Effects.

**Quantile Regression Estimation of Treatment Effects. **In the above Figure, quantile regression is used to estimate treatment effects across a range of survival time quantiles (

The literature on HTE of the impact of welfare reform has focused on mean treatment effects across demographic subgroups. This leads us to assume that the potential HTE results from observed differences in some demographic characteristics. It is possible that the statistical significance of observed HTE from subgroup analysis is due to some outliers in the dataset, especially when the number of patients in a subgroup is relatively small. Quantile regression has the advantage of being robust to outliers. In a RCT where outliers are a potential issue, QTE will certainly have the superior performance compared with subgroup analysis and provide more convincing evidence for HTE.

Nonparametric methods

Nonparametric methods have a variety of approaches and advantages for dealing with HTE in RCTs. Different nonparametric methods, such as kernel smoothing methods and series methods, can be used to generate test statistics for examining the presence of HTE. A kernel method is a weighting scheme based on a kernel function (e.g. uniform, Epanechnikov, and Gaussian). When evaluating the treatment effect of a patient in RCTs, the kernel method assigns larger weights to those observations with similar covariates. This is done because it is assumed that patients with similar covariates provide more relevant data on predicted treatment response. For patients who have significantly different backgrounds (demographically, clinically, and contextually speaking), kernel smoothing methods still utilize information from highly divergent patients when estimating a particular patient’s treatment effect; however, lower weights are given to patients who are very different. Additionally, kernel methods require choosing a set of smoothing parameters to group patients according to their relative degree of similarities. Figure

Nonparametric Regression Estimation of Treatment Effects.

**Nonparametric Regression Estimation of Treatment Effects. **Blood-sugar measurements are a common tool in diabetes testing. In a glucose-tolerance test, the glucose level in blood is measured after a period of fasting (fasting-glucose measurement) and again 1 h after giving the subject a defined dose of glucose (postprandial glucose measurement). Pregnant women are prone to develop subclinical or manifest diabetes, and establishing the distribution of blood-glucose levels after a period of fasting and after a dose of glucose is therefore of interest.The above Figure shows that bivariate nonparametric regression to the mean for glucose measurements for 52 women, with repeated measurements over three pregnancies. Circles are observed sample means obtained from the three repetitions of the standardized values of (fasting glucose, postprandial glucose). Arrows point from observed to predicted values. It is clear that different women have different treatment effects represented by the directions of the arrows.Reference: Müller et al (2003) [58]. Reproduced with permission.

Nonparametric test statistics have been proposed in the literature as powerful tools to identify potential HTE. Crump et al

The significance of utilizing nonparametric models lies in the less restrictive assumptions (i.e., differentiability and moment conditions) imposed in comparison with the functional form assumptions of their parametric counterparts. More often than not, the structure in parametric models implicitly assumes a homogeneous treatment effect. Therefore, some nonparametric regression frameworks are flexible in their designs so that they permit HTE across individual patients. Rather than providing a p-value for the existence of HTE, the nonparametric regression frameworks may present treatment effects that vary among patients, from which the distribution of the response to a treatment is observable. The underlying hypothesis is that differential treatment response can be explained by differences in patients’ demographic characteristics, clinical variables, and contextual variables. Frolich considers a local likelihood logit model for binary dependent variables

Discussion and conclusion

Two factors motivated the generation of this primer. First was a growing recognition of the interest in and need for more granular and patient-centric data with which individualized treatment decisions could potentially be made. Second was a realization that there were no unifying guiding principles for those researchers who might be interested in exploring HTE as part of a PCOR agenda.

Figure

Decision Process for Choosing a Method for HTE.

**Decision Process for Choosing a Method for HTE.**

Once the level of prior information is established, a second consideration relates to the development of HTE study objectives. Key elements include the treatments to be studied as well as the prior evidence about the nature and sources of HTE, which may be population-based or treatment-based or both. These will inform the intent of the HTE investigation, whether it is to be largely exploratory or testing specific hypotheses. Early stage studies with little prior evidence may need to be more exploratory in nature. Subsequent phase 3 studies may need to determine the most appropriate doses and populations for initial labeling and may be designed to test very specific hypotheses already formed by such studies.

Another key element is the nature of the data available for the study. Existing trial or real-world data may be sufficient for the intent of the study, or may be the only study options; otherwise, new data collection may be part of the study design. Finally, the researcher must choose the specific method(s) to test for HTE in the study data. Table

**
Meta-analysis
**

**
CART
**

**
N of 1 trials
**

**
LGM/GMM*
**

**
QTE**
**

**
Nonparametric
**

**
Predictive risk models
**

* LGM/GMM: Latent growth modeling/Growth mixture modeling.

**QTE: Quantile treatment effect.

The importance of ‘context’ for HTE studies goes beyond just methodological concerns. The recent and growing interest in patient-centered treatment means that HTE studies are increasingly likely to be used in clinical decision-making. However, the hope that HTE evidence can serve to significantly improve patient outcomes needs to be balanced against questions regarding the reliability of the scientific methodology used to identify the HTE in question, the weight of what is already understood about conditions in which HTE studies are developed, and new concerns that may arise as additional HTE evidence is generated.

An example of such a concern is whether it is sufficient to be able to detect that a given population responds differentially to a treatment. What if this differential response is novel, was not previously detected in prior attempts at HTE investigation, but was only discovered using a relatively new statistical technique? Should those patients who reflect the differential response population be treated differently – and if so, how?

The key question facing researchers and policy makers is as follows: what level of evidence is required before treatment paradigms may change on the basis of HTE data, and how do we understand and accept the validity of this evidence in a landscape where new tools for detection are consistently being developed? This question is only likely to become more urgent as increased availability of electronic data sources yields more and more research that could profoundly impact clinical treatment paradigms. While there is great hope that recent efforts like those being undertaken by the Patient Centered Outcomes Research Institute will result in the development of better evidence and improved decision-making ability for all stakeholders, significant work is needed to standardize and build consensus around which methods are most appropriate to be used to generate this evidence.

Despite the high levels of enthusiasm and funding directed towards evidence generation, key questions regarding dissemination and assimilation of evidence into clinical practice remain. In the case and context of HTE studies, it will be crucial to further understand what types of analyses are most likely to impact clinical decision-making behavior. By contextualizing various existing HTE methods that could potentially be used against a novel framework, which highlights both prior evidence as well as other considerations related to elements of study design, this primer sought to fill what seems to be an increasingly important gap in the outcomes research literature.

Appendix – notes on estimation routines

Standard meta-analysis like fixed and random effect models, and tests of heterogeneity, together with various plots and summaries, can be found in the R-package rmeta (

Abbreviations

HTE: Heterogeneity of treatment effect; PCOR: Patient-centered outcomes research; RCT: Randomized clinical trial; CER: Comparative effectiveness research; CART: Classification and regression tree; RD: Risk difference; RR: Risk ratio; OR: Odds ratio; LGM: Latent growth mixture; GMM: Growth mixture modeling; QTE: Quantile treatment effect.

Competing interests

C. Daniel Mullins, PhD has served as a consultant and scientific advisor to Pfizer Inc and has received consulting income and research grants from Pfizer Inc. Zhiyuan Zheng, PhD served as a consultant to Pfizer Inc in development of this manuscript. Richard Willke, Prasun Subedi and Rikard Althin are employees of Pfizer Inc. These affiliations are not felt to have created any relevant competing interests given the nature of this manuscript.

Authors’ contributions

All authors made substantial contributions to the conceptualization and writing of this manuscript. The HTE topics included and their appropriate uses were decided on during joint discussions among the co-authors, and each co-author was responsible for writing at least one major section of the manuscript. All authors read and approved the final manuscript.

Authors’ information

The authors are health economists by training who work in the field of health economics and outcomes research with a focus on pharmaceuticals. The authors have observed the variety of HTE study methods becoming popular in different disciplines, and have taken a particular interest in how different methods can best contribute to the evidence regarding the optimal use of new treatments among different individuals.

Acknowledgements

We would like to acknowledge this journal’s referees and editor for their very helpful comments, as well as Lisa Blatt and Joanna Monday for their assistance in the preparation of this manuscript.

Pre-publication history

The pre-publication history for this paper can be accessed here: