Abstract
Background
Largescale public health interventions with rapid scaleup are increasingly being implemented worldwide. Such implementation allows for a large target population to be reached in a short period of time. But when the time comes to investigate the effectiveness of these interventions, the rapid scaleup creates several methodological challenges, such as the lack of baseline data and the absence of control groups. One example of such an intervention is Avahan, the India HIV/AIDS initiative of the Bill & Melinda Gates Foundation. One question of interest is the effect of Avahan on condom use by female sex workers with their clients. By retrospectively reconstructing condom use and sex work history from survey data, it is possible to estimate how condom use rates evolve over time. However formal inference about how this rate changes at a given point in calendar time remains challenging.
Methods
We propose a new statistical procedure based on a mixture of binomial regression and Cox regression. We compare this new method to an existing approach based on generalized estimating equations through simulations and application to Indian data.
Results
Both methods are unbiased, but the proposed method is more powerful than the existing method, especially when initial condom use is high. When applied to the Indian data, the new method mostly agrees with the existing method, but seems to have corrected some implausible results of the latter in a few districts. We also show how the new method can be used to analyze the data of all districts combined.
Conclusions
The use of both methods can be recommended for exploratory data analysis. However for formal statistical inference, the new method has better power.
Keywords:
Avahan; Censoring; Changepoint; Consistent condom use; Female sex workers; Generalized estimating equations; Length bias; Logistic regression; Mixture model; Timedependent covariateBackground
Assessing the impact of largescale interventions is essential in the context of scarce resources [1]. For instance, Ng et al. [2] investigate impact of Avahan, the HIV/AIDS initiative of the Bill & Melinda Gates Foundation in India [3,4], on HIV prevalence in pregnant women and they cite examples of assessments of other largescale interventions [57]. Due to ethical concerns and other constraints, scaleup is often rapid and simultaneous, so there is no randomization or control group [8]. Furthermore, in populations targeted by such largescale programs it is often the case that no extensive or quality baseline data on population health prior to the start of the intervention are available. To add to the difficulty, it is often not possible to have longitudinal or prospective followup for a period that is long enough to provide sufficient data in these target populations, especially in the field of HIV: either the incidence of disease is too low to allow sufficient statistical power if the general population is studied or the level of loss to followup is high when studying highrisk populations.
Nevertheless, the impact of these programs must be assessed and the assessment of intermediate outcomes is essential as a prerequisite for understanding the mechanism through which the interventions may have an impact. One strategy proposed by Lowndes and collaborators [9] consists of a retrospective construction of the baseline information using survey data obtained during the course of the Avahan intervention. Specifically they asked female sex workers (FSWs) if they consistently use condoms with all of their occasional clients and if so, since when. They also asked the FSWs when they started their career as sex worker. They showed how this approach enabled them to simply, yet accurately, reconstruct time trends of consistent condom use (CCU) in the target population. Testing for a difference in CCU pre and postintervention from such reconstructed data, however, proved to be more difficult. An approach based on binomial regression with a linear link function and generalized estimating equations (GEE) yielded results that were somewhat questionable for districts where CCU was already high in the preintervention period, even suggesting that prevention may have had a significant negative impact on CCU in these districts.
Our objective in this paper is to propose an alternative to the GEEbased method to make inferences about pre and postintervention differences on the basis of such reconstructed data. Though we apply the method to the analysis of condom use in FSWs, the proposed method is a combination of standard analysis tools and is applicable to any context where one seeks to make inferences about the effect of an intervention on the rate of occurrence of a given outcome when baseline data are not available but can be assessed retrospectively using a questionnaire.
In terms of broad statistical strategy, the idea is to use the fact that the data observed are timestoevents and that the goal of the analysis is to make inferences about a changepoint in the rate of occurrence of one of these events. Recently, Minard and collaborators [10] compared an approach based on logistic regression with GEE to a Markov modeling approach to test whether an intervention program had a significant effect on smoking cessation behavior. They observed that the GEE approach lacked power to detect timecovariate interactions. As has been noticed in other contexts [11,12], this suggests that methods based on direct timetoevent modeling tend to make more efficient use of the information contained in this type of dataset and should be at the core of any efficient methodological approach. The Markov model proposed by Minard et al. cannot, however, be used in our context. Indeed, their model was built under the assumption of stationarity, i.e., that the intensities of transition from one state (e.g., smoker) to another (e.g., nonsmoker) are constant in time. Because our primary interest is to make inferences about a change in a rate of occurrence after a specified date, stationarity is obviously not an option that we can consider. An additional difficulty that we will see is that in the FSW data, an important proportion of the individuals have an event time exactly equal to zero (i.e., they consistently used the condom at the beginning of their sex work career). This implies that the model to be used must have a probability mass at time zero, along the lines of the zeroinflated Cox model used by Grouwels and Braekers [13] in their study of ethanolinduced anesthesia. Our context is simpler than theirs, however, because we do not have to deal with leftcensored observations, which enables us to fit their model with standard binomial and Cox regression software.
The paper is organized as follows. In the methods section we describe the type of datasets under consideration, quickly review the GEE method, describe our proposed Coxbinomial approach and give the details of the simulation study and real data application through which the methods will be compared. The results of these comparisons are summarized in the results section. A discussion of the strengths and weaknesses of the proposal along with ideas for further research are outlined in the discussion section. An appendix with technical details as well as a file with the R code required to implement the new method are also provided.
Methods
Study design
The method that we propose can be applied to crosssectional studies where participants are randomly recruited in a specified target population at a specific point τ in calendar time (or over a relatively short period of time about τ). For instance in the condom use study, women involved in sex work at welldefined locations were sampled in 2006 (with the exception of a few districts where the sampling was done in 2007 or 2008). Even though we do not consider it in this paper, survey weights can presumably be added to the methods to handle data obtained with sampling schemes other than simple random sampling. Each participant recruited is asked to provide two dates: (i) the date at which she became part of the target population and (ii) the date at which some permanent status change took place. In the condom use study, each woman was asked (i) at what date she started her career as a sex worker and (ii) if she consistently uses the condom with her occasional clients and if so, since when. The objective is to make inferences about differences in how permanent status change takes place before and after a given calendar time, say t_{0}, on the basis of such data. In the condom use study, we are interested in the detection of changes in how FSWs start CCU before and after 112004, date at which the Avahan program started its prevention intervention [3]. Because of the high level of heterogeneity of the HIV epidemic in India [14] the nature of the intervention varies from district to district and the evaluation was planned at the districtlevel [4,15]. Consequently, separate inferences for each district are desired. But separate inferences are not usually the norm; we also perform a unified analysis. To avoid repetition, in the sequel we present the methods in terms of the condom use study, even though they are applicable in a broader context.
Inference based on generalized estimating equations
Getting crude estimates of the rate of CCU at any given calendar time t, say R_{t} = U_{t}/N_{t}, is fairly straightforward with the data that are available. As illustrated in Table one of Lowndes et al., the denominator N_{t} of the rate is the number of women in the survey who begun sex work before time t while the numerator U_{t} is the number, among the N_{t} women who begun sex work before t, who started CCU before t. As long as the intervention does not have an impact on the length of the career as FSW (more on this in the Discussion), then these crude estimates of the rate of condom use should be unbiased.
Under the assumption that R_{t} is steadily increasing with t, the question of interest is whether this rate of increase changes after calendar time t_{0}. A first visual assessment can be done by plotting R_{t} as a function of t for several values of t and seeing whether the slope of the relationship changes after t_{0}. Lowndes et al. formally tested whether the difference in the average slope was different before and after 112004 by fitting a binomial regression model with linear link, with a subject being a FSW, the repeated response being whether the FSW consistently uses the condom in a given calendar year and the covariates being indicators corresponding to the calendar years and testing the appropriate contrast for significance.
There are two main drawbacks with this GEEbased method. First, each FSW is not observed at every given time t at which the rate R_{t} is calculated. Hence treating the condom use statuses at each t as a cluster of longitudinal observations may not be appropriate. But most importantly, the method is very likely to conclude that the slope after time t_{0} is not different from, or maybe even significantly lower than, the slope before time t_{0} when the rates before t_{0} are already high. This is easily seen with an example: if the condom use rate goes from 70% to 90% from 2001 to 2003, then it would have to go from 90% to more than 110% from 2004 to 2006 in order for the average slope to be steeper after 112004 than before 112004, which is of course impossible. Hence this test of change in average slope cannot possibly assess whether CCU increases after a given calendar time if the rate of CCU is already high before that calendar time. Moreover, as discussed previously, the literature suggests that an approach based on timetoevent modeling is likely to be more powerful.
Inference based on a mixture Cox model with timedependent covariate
Our objective is to propose a method based on timetoevent modeling that, unlike the GEE approach, (i) does not lose its efficiency when condom use is already high before t_{0} and that (ii) only uses the two dates provided by the FSWs (time at start of career as FSW and time at start of CCU) as data inputs.
Assumption on career as female sex worker
Our strategy is based on the following assumption in terms of the chronology of CCU. We suppose that when they begin their career, FSWs either immediately start CCU, either they do not. If they do not, then they might eventually start CCU at some time before the end of their career, or they might not. Mathematically, let D_{i} denote the calendar time at which woman i begins her career as FSW. Let Y_{i} = 1 if she consistently uses the condom when she starts working as a FSW and let Y_{i} = 0 otherwise. Let C_{i} denote the amount of time during which woman i works as a FSW before starting CCU. Note that when Y_{i} = 1 woman i consistently uses the condom at the start of her career and thus C_{i} = 0. On the other hand, when Y_{i} = 0 woman i does not use the condom as soon as she starts her career and therefore C_{i} > 0. If we denote by F_{i} the length of the career of woman i as a FSW, then if C_{i} < F_{i}, woman i will start to consistently use the condom before the end of her career. Conversely, if C_{i} > F_{i}, then woman i will not use the condom consistently during her career. This notation is illustrated in Figure 1.
Figure 1. Illustration of the notation. Thick segments: four careers as FSW, with the portion of career without condom use in red and the portion of career with condom use in black. The lengths of the double arrows are the values of the variables , , , , when applicable. Vertical dashed lines are drawn at the time of sampling (τ) and at the time of the intervention ().
Observed values of the variables
Given the study design (see Figure 1), woman i may only be included in the study if she started her career before τ (D_{i} < τ) and if her career does not end before τ (D_{i} + F_{i} > τ); this is the case for women B, C and D in Figure 1. Clearly, we can deduce the values of D_{i} and Y_{i} from the information provided by the FSWs taking part in the study. However we only get a lower bound on F_{i} (all we know is that F_{i} > τ – D_{i}) and therefore its value is rightcensored for every FSW in the study, i.e., 100% of the observations of F_{i} are rightcensored. As for C_{i}, we know its exact value if FSW i has started CCU before τ (women B and C in Figure 1), but we only know that it is greater than τ – D_{i} (rightcensored) if she does not use the condom consistently at the time of data collection τ (woman D in Figure 1).
Model and method
Our main interest is in the distribution of C_{i}. We will suppose that the C_{i} of different FSWs are independent. Because of the probability mass at C_{i} = 0 (FSWs who started CCU at the beginning of their career), we use a mixture model [13]. We suppose that C_{i} = 0 (i.e., that Y_{i} = 1) with probability π_{i} and that C_{i} > 0 (i.e., that Y_{i} = 0) with probability 1π_{i}. Then for the conditional distribution of C_{i} given C_{i} > 0, we can use any standard survival model.
Because the objective of the study is to assess changes in CCU after a given calendar time t_{0}, we must model π_{i} and the distribution of C_{i} when C_{i} > 0 as functions of t_{0}; of course, other covariates can be included in the model and we give the details of the general model in Appendix A. There are several modeling options for π_{i}. In our case, because we are mainly interested in comparing how FSWs start CCU before and after the intervention, we use a simple binomial regression model. Let x_{i} = 0 if FSW i started her career before t_{0} (e.g., woman C in Figure 1) and x_{i} = 1 if FSW i started her career after t_{0} (e.g., women B and D in Figure 1). Then we can make inferences about changes in π_{i} after t_{0} by performing the binomial regression with Y_{i} as outcome and x_{i} as covariate:
where g() is a known link function. The most common link is the logit function, but if the event to be modeled has high prevalence, then using a log link in Eq. 1 will yield regression coefficients that can be interpreted as log prevalence ratios, which is usually preferable to log odds ratios. Other links such as the probit or complementary loglog can also be used instead of the logit link. If desired, the effect of time can be modeled in a more complex fashion by changing the righthand side of Eq. 1, for example by changing the step function with a cut point at t_{0} to a broken line with a cut point at t_{0}. If other covariates are available, they can also be included on the RHS of Eq. 1.
As for the distribution of C_{i} when C_{i} > 0, define the timedependent covariate z_{i}(t) that takes on value 0 when t < t_{0} and value 1 when t ≥ t_{0}; once again, other covariates (or stratification variables) can be used and a more general formulation of the model is given in the Appendix. Then a change in the risk of starting to consistently use the condom at time t_{0} can be estimated by fitting a Cox model to the dataset comprised of FSWs for whom C_{i} > 0, with C_{i} as response and z_{i}(t) as timedependent covariate:
Under the assumption (which we discuss below) that C_{i} and F_{i} are independent, we show in Appendix A that the two models can be fitted independently of each other with standard binomial and Cox regression software. Effect of t_{0} on each model can be assessed separately, or the likelihood ratio (or Wald) statistics to test whether x_{i} and z_{i}(t) are significant in their respective models (i.e., β_{1} = 0 in Eq. 1 and γ = 0 in Eq. 2) can be added together. Under the null hypothesis of no change in CCU at t_{0}, this sum of two likelihood ratio (or Wald) statistics approximately follows a chisquared distribution with two degreesoffreedom.
The Coxbinomial method described above can also be used on clustered data. As we show in the Appendix, all one has to do is fit the two models (Cox and binomial) with methods that will yield robust “sandwich” variance estimates of the parameter estimators and then use a Wald test on two degrees of freedom instead of a likelihood ratio test.
As a referee suggested, this Coxbinomial model can be approximated by a single Cox model by imputing a very small positive time to the observations that have the event at time zero. This modeling approach is easier to implement and can be more easily generalized, as is the case when one wishes to include random effects, for instance. However we do not advocate the use of this approximation when the individuals who have an event time equal to zero represent a nonnegligible proportion of the sample, as will be the case in our data application.
Comparison of the methods by simulation
We ran some simulations to compare the properties of the two methods. The simulation algorithm and the parameter values were chosen so that the simulated samples were roughly similar to the samples in the real dataset to be discussed below. In all runs, we simulated a population of 800 women. Their dates of birth were randomly generated according to a normal distribution with mean 1973 and standard deviation 7. The age at which they began sex work was simulated from a normal distribution with mean 22 and standard deviation 4. The duration of their career as FSWs (F_{i}) followed a Weibull distribution with shape 3 and scale 40. Condom use was simulated according to the Coxbinomial model, whereby each woman started to consistently use the condom at the beginning of her career with probability π_{1} if she started sex work before 112004 (time t_{0}) and with probability π_{2} if she started sex work after 112004. For women who did not consistently use the condom at the beginning of their career, we simulated the duration of sex work until CCU (C_{i}) from a piecewise exponential distribution with rate h_{1} before 112004 and rate h_{2} after 112004. The specific sets of values of π_{1}, π_{2}, h_{1} and h_{2} used in the simulations are given in Tables 1 and 2. We set 112009 as the time of data collection (time τ), which generated samples of between 350 and 450 FSWs contributing to the estimation of the binomial regression parameters and to the GEE method, and between 70 and 290 FSWs contributing to the estimation of the Cox model parameter, which is quite comparable to the sample sizes observed in the real data analyses reported in Table 3. Tests of the hypothesis of no change in CCU after 112004 were carried at the 5% significance level.
Table 1. Values of the parameters associated with condom use in the simulation study
Table 2. Proportion of the 1,000 simulated samples for which the null hypothesis of no change in consistent condom use was rejected at the 0.05 level
Table 3. Application of the Coxbinomial and GEE methods to the data on condom use by FSWs with their occasional clients for 21 districts in India
Results
Simulation study
As can be seen from Table 2, when condom use does not change in time, both methods yield fairly unbiased inferences, with rejection levels that are never significantly different from 0.05 with the Coxbinomial approach, and only once barely superior to the rejection limit of 0.064 for each of GEE with linear and logit links. Unfortunately, when initial condom use was high, the GEE method with logit link failed to converge. The power of both methods is very good when an increase in CCU after 112004 follows a low CCU before 112004. However when CCU before 112004 is medium, the GEE method loses some of its power to detect an increase in CCU after 112004, most especially with the linear link. On the other hand, the Coxbinomial approach is able to maintain its very high power even when initial CCU is medium.
Application to the Indian data
GEE method with linear link
We reanalyzed without sampling weights (for comparability with the Coxbinomial method; results remain very similar to those published in [9]) the data on occasional clients collected from integrated behavioural and biological assessment (IBBA) [16,17] considered by Lowndes et al. and give the results in Table 3. The overall sample size for the 21 districts was 7140 FSWs who all contributed to the binomial regression model and the GEE analysis. Of these FSWs, 4668 did not start CCU at the beginning of their career and contributed to the Cox model analysis. A gross summary of the results of Lowndes et al. is that the estimated proportion of CCU in FSWs increased from 2001 to 2006. However the estimated average yearly rate of increase was lower after 112004 than before that date for five of the 21 districts considered, (significantly so for two of these districts, Pune brothelbased and Thane brothelbased), suggesting a negative effect of the intervention in these districts where the proportion of CCU was already high; this is barely plausible.
Reanalysis with the Coxbinomial method
Results of the analysis of the same data with the Coxbinomial approach are summarized in Table 3. Although we fitted the semiparametric Cox model described in Eq. 2 that did not assume piecewise constant hazards, we reported the mean condom acquisition rate pre and postintervention in each district as and , respectively, in Table 3 for descriptive purposes. Though the findings with the Coxbinomial by and large agree with those obtained with the GEE method, there are a few interesting differences. First, all estimates obtained with the Coxbinomial approach suggest an increase in CCU (be it CCU at the start of the career as FSW or acquisition of CCU during the course of the career as FSW for those who did not consistently use the condom at the beginning of their career) after 112004. Moreover, except for Mumbai brothelbased and Thane brothelbased, all of these increases are statistically significant at the 5% level. In all instances, the pvalue of the combined and binomial models lead to the same conclusion, while for a few districts the pvalue of the Cox model is not significant when the pvalue of the combined model is. The sample sizes in the two districts in which the binomial regressions are not significant are not the ones with smaller samples. There are 6 districts for which the pvalue associated to the coefficient of the Cox model is not significant, and for 4 of these 6 districts, the sample size is small. As could be expected from theoretical arguments and the results of the simulation study, differences between the results obtained with the two approaches happen in districts where CCU is already high before 112004 (i.e., and high).
We also tried the single Cox model approximation by fitting the Cox model above to the data from all FSWs where the event time for FSWs starting CCU at career start set equal to 1 day. Since we have many districts where initial CCU is high, the method leads to questionable results in some districts. The full results are reported in Additional file 1 available on the journal’s website.
Additional file 1. Table with additional analysis results. This is a PDF file that contains a table that summarizes the results of the analysis of the Indian data with a single Cox model and where FSWs who start CCU at the beginning of their career are given an event time equal to 1 day.
Format: PDF Size: 343KB Download file
This file can be viewed with: Adobe Acrobat Reader
Analysis with the Coxbinomial model combining all districts and with additional covariates
To have a populationaveraged effect of the intervention, we combined the data from all the districts and redid the Coxbinomial analysis. For the binomial part of the model, we used GEE with an independence working assumption and a logit link; attempts with a log link failed to converge. For the Cox part, we used the same model as for the districtwise analyses, but fitted the model using the marginal approach (see [18], chapter 8). We first performed this combined analysis with the same covariates as the districtwise analyses, then repeated it with the addition of the year at which the FSW started sex work in the binomial model and calendar year as a timevarying covariate in the Cox model. The results obtained are summarized in Table 4. For these data, 7,140 FSWs contributed to the estimation of the prevalence of condom use at the beginning of sex work, with estimates of 0.279 before the intervention and 0.586 after. Not surprisingly, in the Coxbinomial model with no other covariate this difference is highly significant (pvalue < 0.0001). However when the year of beginning of sex work is entered in the model, the effect of the intervention ceases to be significant. As for incidence of CCU acquisition during sex work, 4,668 contributed to the inferences and we observed a rate of 0.049 before the intervention and 0.236 after. Again, this difference is highly significant in the Cox part of the Coxbinomial analysis without any other covariate (pvalue < 0.0001). With calendar year as a timevarying covariate in the model, the effect of the intervention remains positive and significant with a pvalue of 0.0026. When we compute the Wald statistic described in the appendix we obtain a value of 11.41, which yields a pvalue of 0.0033, and thus we reject the null hypothesis that the intervention has no effect.
Table 4. Application of the Coxbinomial approach for clustered data described in the appendix to the Indian data combining all 21 districts
Discussion
Summary
We proposed a new method to test whether a significant change occurred at a given point in calendar time t_{0} in how individuals in a target population experience a permanent status change, motivated by a study of how the acquisition of CCU by FSWs in India changed after the beginning of an intervention that began on 112004. The new method is based on a Cox model with a timedependent covariate to which a probability mass at time zero is added; standard binomial regression is used to model this probability mass. We showed how this new approach can easily be implemented with standard software that fit the binomial regression and Cox models and how it can be extended to deal with clustered data. We contrasted the new method with an existing approach that compares the average slope of the status change rate before and after t_{0} using GEE. We found that inferences based on the two methods were quite comparable when the prevalence of status change was low before t_{0}, but that the new Coxbinomial approach was more powerful to detect changes when prevalence of status change before t_{0} increased. When we applied the new method to the data on CCU in FSWs in India analyzed by Lowndes et al., we obtained that CCU after 112004 increased in all 21 districts considered, significantly so in all but two of them. In contrast, with the GEE approach only 10 significant increases were found and five decreases were observed, two of the latter being significant decreases. When applied simultaneously to the data from all 21 districts and when correcting for calendar time, the method suggests that increase in CCU at career start is mostly associated with calendar time while acquisition of CCU in the course of the career as FSW is positively associated to both the intervention and calendar time.
Strengths
The method proposed in this paper has several strengths. First and foremost, it remains powerful at detecting change even when the prevalence of permanent status change is high before t_{0}, which is the main drawback of the GEE method that motivated this work. Not only did this gain in power stand out in the simulation study, but also in the application of the method to the CCU in FSWs study. Second, its application only requires the two dates measured in the data (date at which the individual enters the target population and date at which status change takes place) as data inputs, unlike the GEE method which creates a cluster of observations of arbitrary size for each individual. Third, its implementation in practice is quite simple and can be done using standard software for Cox and binomial regression. In this paper, all analyses were performed with R [19] (and the R code to implement the method is available in “Additional file 2), but they could as easily have been performed with SAS [20] or Stata [21]. Finally, the method is flexible. In this paper we only considered a binomial regression and a Cox model. But the basic strategy remains valid if one uses different models for the probability masses at zero and the survival times. Furthermore, the method can easily be adapted to accommodate clustered data.
Additional file 2. R code for the Coxbinomial method. This is a PDF file from which the R code to implement the Coxbinomial analysis method described in this paper can be copied and pasted into R or into an R program editor.
Format: PDF Size: 113KB Download file
This file can be viewed with: Adobe Acrobat Reader
Limitations
There are some limitations to the method of which the users should be aware. An important technical issue is that all inferences in the GEE method or in the method proposed in this paper rely on an assumption that cannot be tested with the type of dataset described in this study. Indeed, the validity of the inference relies on the independence between the amount of time spent in the target population (random variable F_{i}) and the amount of time spent in the target population before status change takes place (random variable C_{i}). If there is no such independence, then inferences on the distribution of C_{i} cannot be made independently of F_{i} and both the proposed Coxbinomial method and the GEE approach break down. This may be further compounded by the fact that the distribution of F_{i} observed in the sample is not the same as the distribution of F_{i} in the target population due to the length bias phenomenon inherent to any prevalent cohort sampling scheme [22], whereby individuals with longer prevalent periods are more likely to be part of a crosssectional sample.
Another weakness of the method proposed (that is also a weakness of the GEE approach) is that it assumes that changes in the probability of status change is instantaneous at time t_{0}, while in reality the probability may change smoothly between t_{0} and some other time t_{1} > t_{0}, or may be instantaneous but delayed at some other time t_{1} > t_{0}. Furthermore t_{1} may be unknown.
It can also be argued that the scope of the simulation study presented here is somewhat limited, as we did not fit models where the model assumptions were false. But because our objective was to determine whether the Coxbinomial approach overcomes the loss of power of the GEE method when the model assumptions are true, we feel that additional simulations under more general conditions are not necessary at this stage and should be considered in future research.
A key limitation that arises when we apply our method to the study of CCU in FSW in India is that the method that we propose here does not address the potential bias due to the “beforeafter” design of this kind of study. In this particular case, it is very difficult to control for many sources of bias that could cause CCU to evolve in time because no district that was not part of the intervention could be added to the analysis.
Future research
The problem of inference from length biased data has been studied in the statistical literature, for example using nonparametric methods [22] or inverse weighting [23]. However the existing methods mainly consider univariate distributions and data arising from prevalent cohort studies with followup. It would be interesting to see what minimal additional assumptions and/or data would be required to derive estimation methods for the bivariate distribution of C_{i} and F_{i}, or at the very least to derive a test of independence between these two variables. The fact that the distribution of C_{i} has a point mass at zero and a hazard rate that may both have a change point at calendar time t_{0} represents an additional modeling challenge.
As a matter of fact it should be possible to model the manner in which the point mass or the hazard change following the intervention at time t_{0} in a more flexible way. For instance in the case where we suppose the change to take place at an unknown time t_{0}, fitting the proposed Coxbinomial model with several values of t_{0} and keeping the value of t_{0} yielding the largest likelihood ratio test statistic is a sensible avenue. However in this case the pvalue of the tests cannot be approximated by simple chisquared distributions anymore. Determination of such pvalues has been studied for an ordinary Cox model [24], but further work is needed for the Coxbinomial mixture.
Conclusion
Our aim in this paper was to improve the efficiency of the methods used to assess intervention outcomes on the basis of reconstructed data. We proposed a method based on a mixture of a Cox proportional hazards model and a binomial regression model. When applied to an analysis of condom use by female sex workers in India, the method yielded results quite similar to those obtained with a GEEbased approach in districts where initial condom use was low, but yielded results more plausible than those of the GEEapproach in districts where initial condom use was high.
Appendix A
Proof that the model can be fitted using standard binomial and Cox regression software
Independent data
Suppose that the data are collected at calendar time τ. At that time, we observe n independent FSWs. Let the observed data be , i = 1, …, n, where D_{i} is the calendar time at which FSW i started her career, F_{i} is the total duration of the career of woman i as a FSW, C_{i} is the amount of time during which woman i worked as a FSW before starting to consistently use the condom, , , and δ_{i} = 1 if and 0 otherwise. Note that to be included in the study, woman i must be a FSW at time τ, i.e., D_{i} < τ and D_{i} + F_{i} > τ. Let Y_{i} = 1 if and Y_{i} = 0 otherwise. Then for inference about the distribution of C_{i} and F_{i}, the likelihood function will be proportional to (with the slight abuse of notation whereby “Pr” may denote a probability or a density).
If we assume that D_{i}, C_{i} and F_{i} are independent, we can factor the above probabilities as:
where f_{C}_{,}S_{C}_{,} and S_{F} respectively denote the density of C_{i}, the survival function of C_{i} and the survival function of F_{i} . We see that (*) is the likelihood for inference on the distribution of C_{i} based on a rightcensored sample and that (**) is the likelihood for inference on the distribution of F_{i} on the basis of a sample of length biased data. Most importantly, we notice that (**) does not depend on the distribution of C_{i} and therefore inference about this latter distribution using only (*) will be valid and efficient.
Let us now rewrite (*) under the binomial regression with link function g() and Cox regression models given by Eq. 1 and Eq. 2, where this time x_{i} and z_{i}(t) can be covariate vectors:
Term (a) is the contribution to the likelihood of FSWs who consistently use the condom at the start of their career, term (b) is the contribution of FSWs who are observed to start CCU after the beginning of their career and term (c) is the contribution of FSWs who have not yet begun CCU at the time of data collection. By regrouping the terms, we can rewrite L(β, γ) as
We recognize (#) as the likelihood for ordinary logistic regression based on the entire sample. As for (##), it is the full likelihood for inference on the parameters of the survival model for C_{i} with rightcensored data and a timevarying covariate based on the sample comprised of FSWs who do not use the condom at the start of their career. Efficient semiparametric inference about γ can be obtained by replacing (##) with the Cox partial likelihood, which is what most standard survival analysis software do. We therefore get that the overall likelihood is the product of the likelihood for ordinary binomial regression and the likelihood for Cox regression.
Clustered data
Let us assume that the data come from K clusters (e.g., districts). Standard software that implement binomial and Cox regression for clustered data will output consistent estimators of β and γ, along with their respective robust “sandwich” variance estimators. Typically, β is obtained by solving for β an estimating equation of the form and γ is obtained by solving for γ an estimating equation of the form . As for the variance estimators, they are usually given by with and an empirical estimate of , and mutatismutandis for the variance of . Now if we put θ = (β ', γ ') ', then estimating β and γ separately amounts to solving . It is easy to see that in this case, and . But as a referee pointed out to us in a review of a previous version of this manuscript, the terms and are equal to zero. Thus a robust variance estimator of is given by . The practical implication of this result in our case is that the test on two degrees of freedom described earlier can be carried out by computing , which should follow a chisquared distribution with two degrees of freedom under the null hypothesis of no intervention effect.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
TD and BA surveyed the statistical literature and investigated the theoretical properties of the Coxbinomial method. TD ran the simulation study and reanalyzed the Indian data. TD wrote the manuscript. CML designed the estimation of retrospective data in absence of baseline and participated in the interpretation of the results. MA participated in the literature review, the development of the Coxbinomial method and the interpretation of the results. All authors read, critically reviewed and approved the final manuscript.
Acknowledgements
We are indebted to the agencies that collected the IBBA data: the Institutes of the Indian Council of Medical Research and FHI 360 for the data collected in the states of Andhra Pradesh, Maharashtra and Tamil Nadu, as well as the University of Manitoba and the Karnataka Health Promotion Trust (KHPT) for data collected in Karnataka state. In particular, we thank Ramesh Paranjape and Mandar K, Maindar, from the National AIDS Research Institute, Rajatashuvra Adhikary and Lakshmi Ramakrishnan, from FHI 360, Stephen Moses, from the University of Manitoba, and B.M. Ramesh, from KHPT, for providing access to the databases. We thank Eric Demers for his help in formatting the IBBA data and his rerunning the analysis of these data with the GEE method without sampling weights. We also thank the Natural Sciences and Engineering Research Council of Canada and Bill & Melinda Gates Foundation for their financial support of this research. The views expressed herein are those of the authors and do not necessarily reflect the official policy or position of the Bill & Melinda Gates Foundation. A Web site is available to apply for access to the IBBA data [http://www.nariicmr.res.in/IBBAdataaccess.php]. Finally we thank the two referees for their detailed comments that have led to significant improvement of the manuscript.
References

Boerma T, de Soyza I: Beyond accountability: learning from largescale evaluations.
Lancet 2011, 378:16101612. PubMed Abstract  Publisher Full Text

Ng M, Gakidou E, LevinRector A, Khera A, Murray C, Dandona L: Assessment of populationlevel effect of Avahan, an HIVprevention initiative in India.
Lancet 2011, 378:16431652. PubMed Abstract  Publisher Full Text

Avahan: The India AIDS Initiative: The business of HIV prevention at scale. New Delhi, India: Bill & Melinda Gates Foundation; 2008.

Chandrasekaran P, Dallabetta G, Loo V, Mills S, Saidel T, Adhikary R, Alary M, Lowndes CM, Boily MC, Moore J, et al.: Evaluation design for largescale HIV prevention programmes: the case of Avahan, the India AIDS initiative.

Jalan J, Ravallion M: Does piped water reduce diarrhea for children in rural India?
J Econom 2003, 112(1):153173. Publisher Full Text

Galiani S, Gertler P, Schargrodsky E: Water for life: the impact of the privatization of water services on child mortality.
J Polit Econ 2005, 113(1):83120. Publisher Full Text

Lim SS, Dandona L, Hoisington JA, James SL, Hogan MC, Gakidou E: India’s Janani Suraksha Yojana, a conditional cash transfer programme to increase births in health facilities: an impact evaluation.
Lancet 2010, 375(9730):20092023. PubMed Abstract  Publisher Full Text

Lowndes CM, Alary M, Verma S, Demers E, Bradley J, Jayachandran A, Ramesh B, Moses S, Adhikary R, Mainkar M: Assessment of intervention outcome in the absence of baseline data: ‘reconstruction’ of condom use time trends using retrospective analysis of survey data.
Sex Transm Infect 2010, 86(Suppl. 1):i49i55. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Minard C, Chan W, Wetter D, Etzel C: Trends in smoking cessation: a Markov model approach.
J Appl Stat 2012, 39:113127. Publisher Full Text

Abbott R: Logistic regression in survival analysis.
Am J Epidemiol 1985, 121:465471. PubMed Abstract  Publisher Full Text

Mancl L, Hujoel P, DeRouen T: Efficiency issues among statistical methods for demonstrating efficacy of caries prevention.
J Dent Res 2004, 83(Suppl.1):C95C98. PubMed Abstract  Publisher Full Text

Zeroinflated semiparametric Cox’s regression model for leftcensored survival data [http://homes.stat.unipd.it/mgri/SCo2011/Papers/CS/CS3/grouwels_braekers.pdf webcite]

Banandur P, Rajaram SP, Mahagaonkar SB, Bradley J, Ramesh BM, Washington RG, Blanchard JF, Moses S, Lowndes CM, Alary M: Heterogeneity of the HIV epidemic in the general population of Karnataka state, south India.
BMC Public Health 2011, 11(Suppl 6):S13. BioMed Central Full Text

Boily MC, Lowndes CM, Vickerman P, Kumaranayake L, Blanchard JF, Moses S, Ramesh BM, Pickles M, Watts C, Washington RG, et al.: Evaluating largescale HIV prevention interventions: study design for an integrated mathematical modelling approach.

Saidel T, Adhikary R, Mainkar M, Dale J, Loo V, Rahman M, Ramesh B, Paranjape R: Baseline integrated behavioural and biological assessment among most atrisk populations in six highprevalence states of India: design and implementation challenges.

Details Regarding IBBA RoundI & II Data [http://www.nariicmr.res.in/IBBAdataaccess.php webcite]

Therneau TM, Grambsch PA: Modeling Survival Data: Extending the Cox Model. New York: SpringerVerlag; 2000.

R Development Core Team: R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2011.

SAS Institute: SAS/STAT 9.1, User’s Guide. Cary, NC: SAS Institute Inc.; 2004.

StataCorp: Stata Statistical Software: Release 12. College Station: TX: StataCorp LP; 2011.

Asgharian M, M’Lan CE, Wolfson DB: Lengthbiased sampling with right censoring.
J Am Stat Assoc 2002, 97:201209. Publisher Full Text

Mandel M, Ritov Y: The accelerated failure time model under biased sampling.
Biometrics 2010, 66:13061308. PubMed Abstract  Publisher Full Text

Liu M, Lu W, Shao Y: A Monte Carlo approach for changepoint detection in the Cox proportional hazards model.
Stat Med 2008, 27:38943909. PubMed Abstract  Publisher Full Text
Prepublication history
The prepublication history for this paper can be accessed here: