Inserm UMR-S 707, Paris, France

EHESP School of Public Health, Rennes-Sorbonne Paris Cité, Paris, France

UPMC-Sorbonne Université, Paris, France

Registre de Dialyse Péritonéale de Langue Française, Pontoise, France

Nephrology Department, CHU Clemenceau, Caën, France

Abstract

Background

Directed acyclic graphs (DAGs) are an effective means of presenting expert-knowledge assumptions when selecting adjustment variables in epidemiology, whereas the change-in-estimate procedure is a common statistics-based approach. As DAGs imply specific empirical relationships which can be explored by the change-in-estimate procedure, it should be possible to combine the two approaches. This paper proposes such an approach which aims to produce well-adjusted estimates for a given research question, based on plausible DAGs consistent with the data at hand, combining prior knowledge and standard regression methods.

Methods

Based on the relationships laid out in a DAG, researchers can predict how a collapsible estimator (e.g. risk ratio or risk difference) for an effect of interest should change when adjusted on different variable sets. Implied and observed patterns can then be compared to detect inconsistencies and so guide adjustment-variable selection.

Results

The proposed approach involves i. drawing up a set of plausible background-knowledge DAGs; ii. starting with one of these DAGs as a working DAG, identifying a minimal variable set, S, sufficient to control for bias on the effect of interest; iii. estimating a collapsible estimator adjusted on S, then adjusted on S plus each variable not in S in turn (“add-one pattern”) and then adjusted on the variables in S minus each of these variables in turn (“minus-one pattern”); iv. checking the observed add-one and minus-one patterns against the pattern implied by the working DAG and the other prior DAGs; v. reviewing the DAGs, if needed; and vi. presenting the initial and all final DAGs with estimates.

Conclusion

This approach to adjustment-variable selection combines background-knowledge and statistics-based approaches using methods already common in epidemiology and communicates assumptions and uncertainties in a standardized graphical format. It is probably best suited to areas where there is considerable background knowledge about plausible variable relationships. Researchers may use this approach as an additional tool for selecting adjustment variables when analyzing epidemiological data.

Background

Adjustment-variable selection in epidemiology can be broadly grouped into background knowledge-based and statistics-based approaches. Directed acyclic graphs (DAGs) have come to be a core tool in the background-knowledge approach as they allow researchers to present assumed relationships between variables graphically and, based on these assumptions, to identify variables to adjust for confounding and other biases

To our knowledge, only one methodological article in epidemiology to date has explicitly looked at combining background knowledge in DAGs with a statistical selection procedure for variable selection

In this article, we propose an approach to adjustment-variable selection which aims to produce well-adjusted estimates for a given research question based on plausible DAGs which are also consistent with the data at hand, and to clearly communicate assumptions and uncertainties underlying the estimates in DAG format. It asks researchers to lay out prior assumptions about variable relationships in one or more prior DAGs, uses the change-in-estimate patterns in the data to refine and revise these DAGs, and presents the prior and final DAGs with corresponding estimates. The approach is based on recent theoretical results regarding confounding equivalence (c-equivalence)

Methods

DAGs and minimally sufficient adjustment variable sets

In this article, we assume that the reader is familiar with the terminology of and rules for reading DAGs. There are now many introductions to DAGs for epidemiologists [

DAGs allow the identification of the variable set or sets sufficient to adjust for confounding and other biases, based on the variable relationships shown. Greenland et al.

Drawing up prior DAGs

The first step is preparing a set of DAGs which encode prior, expert knowledge about variable relationships and show the major prior uncertainties. These DAGs should include

1. all measured variables considered relevant, including those routinely used for adjustment in the research area (e.g. sex) even if not thought

2. plausible proxy and measurement error relations;

3. plausible unmeasured parents with two or more children in the DAG; and

4. participation or selection variables conditioned upon during data-collection, including voluntary participation by subjects and restriction of the study to particular groups, such as hospitalized patients.

In most cases, more than one prior DAG will be needed to show the main uncertainties in variable relationships, including the presence or absence of arrows between variables, arrow direction, and the presence of unmeasured variables.

It is important to consider the source population of the data in preparing the prior DAG or DAGs. As much prior knowledge will come from research in other contexts, there will be cases when a researcher judges that an association between variables found in other studies do not apply in his or her dataset. For example, socioeconomic status may have an association with access to healthcare in systems with large out-of-pocket payments but not in well-functioning nationalized systems. In this case, the researcher needs to explain why he or she has chosen not to connect two variables which other researchers would connect, based on knowledge about source populations. Possible differences in source populations should also be borne in mind when revising the DAG, as discussed below.

Using minimally sufficient adjustment sets to compare a DAG with data

For any given DAG, a researcher can identify the minimally sufficient adjustment set or sets for the effect of interest. Once done, he or she can identify the changes expected in this estimate when adjusting on different variable sets according to the DAG. To do this, we need to assume compatibility, faithfulness

Given the above, a collapsible effect estimate conditional on a minimally sufficient adjustment set will not change when estimated on this set plus the variables excluded from the set, provided that the excluded variables are not mediators (or ancestors or descendants of mediators) lying on an open path or colliders (or descendants of colliders) which, if conditioned upon, would open the path on which they lie. Conversely, a collapsible effect estimate conditional on a minimally sufficient adjustment set should change when estimated on this set minus any variable in the set. This allows a researcher to identify the change-in-estimate pattern implied by the DAG and so compare it with the observed pattern from the data.

Practically, we propose the following steps for this. Sample R-code is in Additional file

1. Draw up the DAGs encoding prior, expert knowledge and the main prior uncertainties as described above and select an initial working DAG from this set (the most plausible DAG);

2. From the working DAG, identify a minimally sufficient adjustment set, S, for the effect of interest (A→Y);

3. Using a collapsible estimator, estimate A→Y conditional on S;

4. Re-estimate A→Y conditional on S plus each of the variables not included in S in turn (“add-one pattern”);

5. Plot each estimate on a single graph, thereby showing differences in the estimates between the models;

6. Repeat steps 4 and 5 but deleting each variable in turn from S (“minus-one pattern”);

7. Determine whether the add-one and minus-one patterns found are consistent with the working DAG;

8. If the patterns are consistent with the working DAG, check to see if any of the other prior DAGs give the same expected patterns. Take all prior DAGs with consistent patterns as the revised working DAGs and move to step 11;

9. If the patterns are not consistent with the working DAG, check to see if any of the other prior DAGs imply the patterns as observed. Take all such consistent prior DAGs as the revised working DAGs and move to step 11;

10. If the patterns are not consistent with the working DAG or with any of the other prior DAGs, undertake an

11. Repeat steps 2 to 11 for each revised working DAG, moving to step 12 when there are no inconsistent add-one and minus-one patterns;

12. Present the prior and all final DAGs with corresponding effect estimates.

**(Reviewing a DAG when implied and observed patterns are incompatible; Additional information on the empirical example; Sample R code for the add-one and minus-one graphs).**

Click here for file

The key to step 7 is recognizing when the observed patterns are consistent with the patterns implied by the DAG. If S is minimally sufficient, the add-one pattern is consistent if the only meaningful changes arise when conditioning on mediators lying on open paths from A to Y or when conditioning on colliders which open a path from A to Y. All variables in S should show meaningful minus-one changes, but this may not always be the case in practice because of incidental cancellations (see Discussion). Once familiar with the rules of DAGs, it is straightforward for a researcher to identify the expected changes for any adjustment set for a given DAG: for example, if adjusting on {C_{1},C_{3}} in Figure
_{2} and a change for C_{4} and C_{5}. The implied minus-one pattern is a change for C_{1} and C_{3}.

Directed acyclic graph showing putative relationships between variables A, Y, C1, C2, C3, C4, and C5

**Directed acyclic graph showing putative relationships between variables A, Y, C1, C2, C3, C4, and C5.**

Importantly, DAGs will commonly have more than one minimally sufficient adjustment set. In this case, the researcher should also compare the effects estimated on each minimally sufficient set in steps 8 and 9 above. These adjusted effect estimates should not differ, meaning that any observed differences can help distinguish between the different working DAGs in these steps.

Defining a meaningful change

A key decision is defining the change in the estimate sufficient to warrant reviewing the DAG. The first issue here is the size of the change. For this, a researcher could choose to follow (and defend) the commonly used threshold of a 10% relative difference in the starting estimate

The second issue here is variability in the change in estimate because of sampling error or other problems such as unstable models. In this case, a researcher may inappropriately revise (or not revise) a prior DAG because the observed patterns have failed to align with the patterns in the source population by chance. We note, however, that this is the case for the change-in-estimate procedure as currently practised as it only uses the point estimate change to guide covariable selection.

To incorporate variability into the proposed approach, we suggest estimating the expected proportion of times the add-one and minus-one patterns would lead to a revision of the DAG under resampling and using this information in a sensitivity analysis. This can be done by bootstrap, calculating the proportion of resampled estimates lying beyond the meaningful change threshold for each variable during the add-one and minus-one steps. The researcher should report these proportions for the prior working and final DAGs. We also suggest undertaking a sensitivity analysis by revising the prior working DAG considering only variables with >50% of resampled add-one changes outside the meaningful threshold as showing meaningful changes. Although this will mean presenting several final DAGs, it has the merit of communicating uncertainty in the assumptions used for the final models. In contrast, for the minus-one step we suggest only reporting the proportion of resampled estimates without undertaking the sensitivity analysis for the reasons outlined in the Discussion.

There are two important caveats here. First, the proposed 50% cut-off for the add-one changes is arbitrary and further studies should explore the performance of different cut-off values. Second, inflated variance estimates because of unstable regression models (e.g. small sample size, collinearity) would also lead to a high estimated variability of the changes, highlighting the importance of routine model checking in the approach.

Reviewing the DAG

An important issue in reviewing the working DAG (steps 7 to 10 above) is that, as numerous DAGs can be constructed around the same variables, there is a risk of revision

Results

We now run through a theoretical example to illustrate the approach before presenting an empirical example from clinical epidemiology.

Confounding, mediation, collision

Take the (as yet unknown) best-working DAG in Figure
_{1}}. The implied add-one pattern for Figure
_{1}} is a change for C_{4} and C_{5} and no change for C_{2} or C_{3}; the implied minus-one pattern is a change for C_{1}. He or she estimates the A→Y effect adjusted on {C_{1}} and the add-one and minus-one patterns. Graphing this (step 5 above) gives a pattern as in Figure
_{4} and C_{5} and for removing C_{1} are consistent with Figure
_{2} and C_{3} are not consistent with Figure

**(Figures containing DAGs as Powerpoint slides).**

Click here for file

Directed acyclic graph showing alternative putative relationships between variables A, Y, C1, C2, C3, C4, and C5

**Directed acyclic graph showing alternative putative relationships between variables A, Y, C1, C2, C3, C4, and C5.**

Directed acyclic graph showing one set of alternative putative relationships between variables A, Y, C1, C2, C3, C4, and C5

**Directed acyclic graph showing one set of alternative putative relationships between variables A, Y, C1, C2, C3, C4, and C5.**

Directed acyclic graph showing another set of alternative putative relationships between variables A, Y, C1, C2, C3, C4, and C5

**Directed acyclic graph showing another set of alternative putative relationships between variables A, Y, C1, C2, C3, C4, and C5.**

During preparation of the prior DAGs, our researcher flagged the possible confounding pathways in Figures
_{2} as a collider in Figure
_{1} only, namely add-one changes for C_{2}, C_{3}, C_{4}, and C_{5} and minus-one changes for C_{1}. These are consistent with Figure
_{1} only are add-one changes for C_{2}, C_{4}, and C_{5}; no add-one change for C_{3}; and a minus-one change for C_{1}. These do not correspond to those observed in Figure
_{3}). Consequently, the researcher can discount the DAG in Figure

Add-one and minus-one patterns for a starting adjustment-variable set of {C1} based on DAG in Figure

**Add**-**one and minus**-**one patterns for a starting adjustment**-**variable set of****{****C**_{1}**}****based on DAG in Figure****taking the associations in the DAG in Figure****as the unknown best working DAG.** The solid horizontal line is the RD estimate adjusted on the putative minimally sufficient set {C_{1}}. The dashed horizontal lines are the pre-defined meaningful change thresholds in the RD estimate. The add-one section shows the RD upon adding each variable listed to the adjustment-variable set in turn. The minus-one section shows the RD upon removing each variable listed from the adjustment-variable set in turn.

The researcher should reapply the above steps to each of Figures
_{1},C_{2},C_{3}}. The implied patterns adjusting on this set is an add-one change for C4 and C5 and a minus-one change for C_{1}, C_{2}, and C_{3}. As Figure
_{2} and C_{3}. In contrast, re-running the steps on Figure
_{1},C_{2}} and {C_{1},C_{3}} are minimally sufficient adjustment sets in Figure

Alternatively, the researcher may have pre-identified uncertain mediation paths involving C_{2} and C_{3}, for example a single mediating path (A→C_{2}→C_{3}→Y) or two separate mediating paths (A→C_{2}→Y and A→C_{3}→Y) (not shown but easily constructed by replacing A←C_{2} with A→C_{2} in Figures
_{3} by A→C_{3} in Figure

Measurement error

Measurement error can also cause an estimate to change when adding or deleting variables to or from the adjustment set, even though this would not be the case had the variables been measured perfectly. To see why, consider Figure
_{2} and C_{3}. Following
_{C} as representing all factors affecting measurement of C. Adjusting on C_{2}* only partially blocks A←C_{2}→C_{3}→Y at C_{2}; similarly, adjusting on C_{3}* only partially blocks this pathway at C_{3}; consequently the estimate adjusted on {C_{1},C_{2}*} will not equal that adjusted on {C_{1},C_{2}*,C_{3}*} even though they would have been the same if we could have adjusted on {C_{1},C_{2}} and {C_{1},C_{2},C_{3}}.

Directed acyclic graph showing alternative putative relationships between variables A, Y, C1, C2, C3, C4, and C5 in which C2 and C3 are measured with error (measured variables are C2* and C3* and variables affecting their measurement are UC2 and UC3)

**Directed acyclic graph showing alternative putative relationships between variables A, Y, C1, C2, C3, C4, and C5 in which C2 and C3 are measured with error (measured variables are C2* and C3* and variables affecting their measurement are UC2 and UC3).**

To see how measurement error fits into the proposed approach, consider the case of Figure
_{2} and C_{3} in Figure
_{1},C_{2}} will give add-one and minus-one patterns as in Figure
_{3} in Figure
_{3} to the {C_{1},C_{2}} adjustment set should not change the estimate. In contrast, this pattern is consistent with the measurement error in Figure
_{1},C_{2}*,C_{3}*}, adjusting on a mismeasured confounder may increase bias under certain conditions
_{1},C_{2}*,C_{3}*} will be bias reducing, arguably common in epidemiological research
_{2}→C_{3}→Y pathway. Regardless of the direction of the bias, the proposed change-in-estimate approach should flag the need to review the associations involving the mismeasured variables in the DAG.

Add-one and minus-one patterns for a starting adjustment-variable set of {C_{1}, C_{2}} based on DAG in Figure 1, taking the associations in the DAG in Figure 6 as the unknown best working DAG

**Add**-**one and minus**-**one patterns for a starting adjustment**-**variable set of** {**C**_{1}, **C**_{2}} **based on DAG in Figure****taking the associations in the DAG in Figure****as the unknown best working DAG.** Note that the variables listed as C_{2} and C_{3} are actually these variables measured with error, i.e. C_{2}* and C_{3}* in Figure
_{1}}. The dashed horizontal lines are the pre-defined meaningful change thresholds in the RD estimate. The add-one section shows the RD upon adding each variable listed to the adjustment-variable set in turn. The minus-one section shows the RD upon removing each variable listed from the adjustment-variable set in turn.

Bias amplification

Recent work has shown that residual bias can be amplified by adjustment on instrument-like variables

Consider Figure
_{U}→Y in Figure
_{1},C_{2}}, {C_{1},C_{3}}, and {C_{1},C_{2},C_{3}} should not differ. However, with residual confounding (Figure
_{2} and C_{3} have different “instrument strengths” (i.e. relative to C_{3}, C_{2} is more strongly associated with the exposure A) and so amplify the residual bias differently
_{1},C_{2}} (based on Figure
_{1},C_{3}}, as C_{3} should be a weaker instrument than C_{2}, but also to present the estimate adjusted on {C_{1},C_{2}} and {C_{1},C_{2},C_{3}}.

Directed acyclic graph showing alternative putative relationships between variables A, Y, C1, C2, C3, C4, C5, and an unmeasured variable

**Directed acyclic graph showing alternative putative relationships between variables A, Y, C1, C2, C3, C4, C5, and an unmeasured variable ZU.**

Presenting more than one final DAG

In many instances, the researcher will need to present more than one final DAG with implied add-one and minus-one patterns consistent with the patterns observed. Sometimes the adjusted estimate will be the same as the DAGs imply the same minimally sufficient adjustment set. An example is removing the C_{5}→Y arrow and adding a C_{5}←C_{3} arrow in Figure
_{1}} and so the adjusted effect estimate will be the same. However, in some cases the minimally sufficient adjustment sets will be different, so that an estimate for each DAG will need to be presented. One example of this involves the confounding vs. mediating pathways mentioned above, if both types of relationship were identified as plausible during the preparation of the prior DAGs (e.g. the DAG in Figure
_{2}→Y with A→C_{2}→Y in Figure

Empirical example

We now consider an empirical example to illustrate the approach. We compare mortality 5 years after peritoneal-dialysis (PD) initiation amongst patients with polycystic kidney disease (PKD) versus other nephropathies, using data from the French Language Peritoneal Dialysis Registry (RDPLF) (details in Additional file

The DAG in Figure

Directed acyclic graph showing prior assumptions about relationships between variables in the empirical example

**Directed acyclic graph showing prior assumptions about relationships between variables in the empirical example.**

Directed acyclic graph showing prior uncertainty about variable relationships in the empirical example (absence of Type of Assistance -> Death arrow)

**Directed acyclic graph showing prior uncertainty about variable relationships in the empirical example (absence of Type of Assistance -> Death arrow).**

Directed acyclic graph showing prior uncertainty about variable relationships in the empirical example (absence of Sex -> Type of Assistance)

**Directed acyclic graph showing prior uncertainty about variable relationships in the empirical example (absence of Sex -> Type of Assistance).**

Directed acyclic graph showing prior uncertainty about variable relationships in the empirical example (showing Comorbidity index and Type of assistance as proxy variables for Major concomitant illnesses and Frailty, respectively)

**Directed acyclic graph showing prior uncertainty about variable relationships in the empirical example (showing Comorbidity index and Type of assistance as proxy variables for Major concurrent illnesses and Frailty, respectively).**

There is only one minimally sufficient adjustment set in the prior DAG (Figure

Add-one and minus-one patterns for a adjustment-variable set of {

**Add**-**one and minus**-**one patterns for a adjustment**-**variable set of****{****, ****}****based on DAG in Figure**

For Figure

Add-one variables

Minus-one variables

Sex

28.4%

Age

95.3%

Type of peritoneal dialysis

37.1%

Comorbidity index

98.6%

Type of assistance

99.6%

For Figure

Add-one variables

Minus-one variables

Type of peritoneal dialysis

15.2%

Age

38.3%

Comorbidity index

58.8%

Sex

75.9%

Type of assistance

100.0%

We therefore need to review the DAG, focusing on

Now using Figure

Add-one and minus-one patterns for a adjustment-variable set of {

**Add**-**one and minus**-**one patterns for a adjustment**-**variable set of **** {Age**,

As an aside, Figures

Discussion

We have presented an approach to selecting adjustment variables which combines prior knowledge expressed in a DAG with results from analysis of the data. The approach is pragmatic in that it focuses only on the effect of interest (also emphasized by others

The approach depends on recent theoretical work on c- (confounding-) equivalence

To our knowledge, only one other article in the epidemiology literature to date has looked at adjustment variable selection by explicitly combining DAGs and a statistical selection procedure

The proposed approach has some potential advantages over other variable-selection methods. It can reduce the “black-box” nature of using the p-value or the change-in-estimate alone to select variables, as it lays out the rationale for adjustment-variable choice graphically. It will also frequently lead to a more parsimonious model than selection based on p-values since it chooses variables by relevance to the exposure-outcome association, rather than the association with the outcome alone. The approach also extends background-knowledge methods by checking starting assumptions against the data and requiring researchers to justify mismatches or adapt assumptions appropriately. The approach complements the recently proposed method of adjusting on all assumed parents of exposure and outcome

An important point concerns the possibility of incidental cancellations and small effects. Finding a meaningful difference in the add-one pattern for a variable

A potential criticism of the approach is that it does not eliminate background knowledge from adjustment-variable selection. Indeed, the examples include instances of needing background knowledge to distinguish between DAGs giving the same add-one and minus-one patterns (e.g. confounding- vs. mediating-pathway examples, measurement-error vs. bias-amplification examples). It is well known that different DAGs can imply the same statistical relationships

Another potential criticism is that the approach only addresses variable relationships relevant to the effect of interest, remaining agnostic about other regions of the DAG. This aims to focus on the research question at hand and to minimize the risk of “getting lost” in trying to explore all possible associations in the DAG, many of which do not directly impact on the selected exposure-outcome estimate. A researcher wishing to explore the full DAG could apply a DAG-discovery algorithm (e.g. the PC, GES, or FCI algorithms; see the TETRAD project’s website and

We wish to highlight several additional limitations of the proposed approach. Like the change-in-estimate procedure, the approach is

Several extensions to the approach are possible, should it appeal to epidemiologists working on applied questions. These include how best to address sampling variability in the patterns, comparing the performance of different rules based on the proportion of bootstrap samples which fall outside the meaningful threshold. Another potential extension concerns precision in choosing the adjustment set. We note that a researcher may wish to adjust on additional variables to improve precision

Conclusions

In summary, we have proposed a novel approach to adjustment-variable selection in epidemiology which combines existing knowledge-based and statistics-based methods. It requires a researcher to present background-knowledge assumptions in a DAG, to compare these against patterns in the data, and to review assumptions accordingly. It also ensures clear communication of assumptions and uncertainties to other researchers and readers in a standardized graphical format. As the approach requires background knowledge, it is probably best suited to areas such as clinical epidemiology where researchers know quite a lot about

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

DE, BC, and AF conceived the idea through their interests in confounder selection and directed acyclic graphs. CV and TL were responsible for the peritoneal dialysis data and contributed to the development and interpretation of the empirical example. DE did the analyses and drafted the manuscript. All authors critically reviewed the drafts and approved the final version.

Pre-publication history

The pre-publication history for this paper can be accessed here: