STATS NOTES: handling missing data in a survival model

What’s the issue?

In the project I am currently working on, I have data about whether, and when, study participants received secondary care for psychiatric reasons, from the Hospital Episode Statistics (HES) dataset linked to the Millennium Cohort Study. As a result, my collaborators and I decided that a Cox proportional hazards model would be the most parsimonious way to address the research question. But first, I needed to deal with missing data.

Regardless of the modelling strategy, it is important to follow a principled method for dealing with missingness, of which multiple imputation is one. In the R programming language, the mice package is most commonly used for multiple imputation, but in my case there is a problem. The problem is that a Cox proportional hazards model is non-linear, which makes the imputation models in the mice package incompatible with the substantive model I want to use to analyse my data.

In 2015, Bartlett et al. published their solution to this problem – a new multiple imputation algorithm that enforces the compatibility of the imputation model with the substantive model. The result is multiple imputation by substantive-model-compatible fully conditional specification (or SMCFCS, to its friends). “Amazing,” I thought, “I can just use that.” But of course, it would not be so simple.

Cox proportional hazards models are frequently used in clinical trials, where it is uncommon to have missing data in the outcome. For example, if the outcome is mortality, it would be unusual to not know whether a participant in your clinical trial had died. As a result, discussions about missing data in these models normally revolve around missingness in the independent variables. But the linkage data that I am using does have missingness in the outcome, because only a minority of the study’s participants have consented to the HES linkage.

Because I need to keep my results representative of the population of England born in 2000–2, I couldn’t simply remove participants who didn’t consent to the HES linkage. But when I went to apply SMCFCS to my data (using the excellent smcfcs package), I found that imputing both missing dependent and independent variable data simultaneously was not yet supported.

So, what to do?

There are two options that I know of:

Calculate non-response weights that model the inverse probability of a participant consenting to the HES linkage. Then, remove those who didn’t consent from the dataset, so there is no longer missing data in the outcome. Then I can run SMCFCS as normal, and afterwards apply the non-response weights to my analysis to arrive at unbiased estimates.
Notice how I wrote “was not yet supported” in the past tense, because in record time, Jonathan Bartlett has come up with and implemented a new function in the smcfcs R package that allows imputation of both missing outcomes and independent variable data. By using a flexible, spline-based parametric survival model instead of the semi-parametric Cox model, it is much easier to impute the missing outcome values. This is particularly useful when you have auxiliary variables that you can include in your imputation model to inform the imputation of the outcome. You can read more about this new approach on Jonathan Bartlett’s blog.

What can we do with this?

Linkages between established longitudinal studies and routine healthcare data are a relatively new phenomenon. They unlock all kinds of opportunities to investigate the relationship between the socioeconomic and lifestyle factors that are measured in the longitudinal studies and clinical outcomes that are recorded in healthcare data. However, they also present new statistical challenges, because attrition and non-response are pervasive in voluntary studies like the Millennium Cohort Study.

With this methodological gap closed, it becomes easier to use survival models to address questions via linked healthcare data without subjecting the results to the biasing effect of missing data. This helps us to further our understanding of the health impacts of a variety of behaviours and socioeconomic influences (in my case, social media use) across the life course.

STATS NOTES: handling missing data in a survival model by Tom Metherell is licensed under CC BY 4.0