STATS NOTES: auxiliary variables and why they matter

28 November 2024

What is an auxiliary variable?

Missing data are a major source of potential bias in statistical analyses. These missing data are commonly handled using multiple imputation, which allows missing values to be imputed many times to yield unbiased estimates. However, this procedure relies on the assumption that the data are missing at random (MAR). This means that we assume the probability of data being missing depends only on the data we have available, and not on other factors for which data are not available.

Here at the Centre for Longitudinal Studies, a lot of work has gone into demonstrating that findings from cohort studies can be rendered representative of the overall target population, providing that appropriate auxiliary variables are included when undertaking multiple imputation. The auxiliary variables are included in the imputation model, but are not included in the models used for data analysis. The key is to include enough good predictors of non-response, and of the missing values themselves, that we can assume the data are MAR. If these aren’t included, the data risk becoming missing not at random (MNAR), meaning that there are unobserved factors that influence the probability of missingness. Data that are MNAR lead to biased results even with multiple imputation.


How to select auxiliary variables

The process of choosing which auxiliary variables to include depends on the data that you’re using. For example, most of the CLS cohort studies now have standard sets of predictors of non-response that should form part of the set of auxiliary variables in all studies using the data. These can be found in the handling missing data user guide. If there isn’t already a list available for your dataset, you might be able to follow a similar approach to identify the predictors of missingness that should be included. Failing that, you can rely on a theoretical understanding of what factors are most likely to drive missingness in the data.

It is also important to include some auxiliary variables that can predict the missing values themselves, which you will need to identify using your understanding of the data and the relationships between the different concepts that are measured.


Including auxiliary variables in the imputation model

Whether you’re using the R package mice, Mice.jl, the Stata mi command or another tool for multiple imputation, the procedure is broadly the same. You can simply include the auxiliary variables in the dataset that will be imputed, as if they were additional independent variables.

Auxiliary variables can also be useful when using full-information maximum likelihood (FIML) estimation, and both Stata and the R package lavaan have auxiliary variable functionality built in.


Caution required

It is important to note that inclusion of inappropriate auxiliary variables can cause collider bias. To combat this, you should usually avoid including potential colliders as auxiliary variables. In practice, this often means not including variables that are measured after the exposure of interest, among other precautionary measures. However, if a variable is a strong predictor of non-response, but also could act as a collider, you may be in a no-win situation. In this case, there is a choice between excluding it, risking bias from the data becoming MNAR, and including it, risking collider bias. There is no one-size-fits-all solution – the choices you make depend on the research question being investigated and the characteristics of the dataset.


To sum up

Auxiliary variables are an important part of handling missing data. They make the missing at random assumption more plausible, and they can also increase precision by providing additional information to inform the imputation of missing values. Identifying and cleaning auxiliary variables takes extra work, but it is worth it to increase your confidence in your findings!


STATS NOTES: auxiliary variables and why they matter by Tom Metherell is licensed under CC BY 4.0