Dealing with Missing Data in Healthcare: Best Practices for Imputation

Ramanpreet Bhatia
3 min readMay 23, 2023

Missing data is a common issue in statistical analysis, and particularly pervasive in the healthcare industry where missing information can result from a variety of sources including incomplete patient records, lost follow-ups, or procedural errors. Although the default approach may be to simply ignore missing data, this can lead to biased or incomplete results. Alternatively, data imputation, or the process of replacing missing data with substituted values, can also introduce bias or error. This article aims to guide you on how to tackle this dilemma, with a focus on ensuring the validity and sensitivity of your imputed results.

Types of Missing Data and Their Imputation Strategies

There are three main types of missing data: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).

  • MCAR implies that the likelihood of data missing is the same for all observations. In healthcare, for instance, a system error might prevent blood pressure readings from being recorded for all patients on a specific day. Imputation methods like mean/mode imputation or completely random sampling from the observed values may work for MCAR.
  • MAR implies that the missingness is related to some observed data but not the missing data. For example, younger patients might be less likely to respond to follow-up surveys. Imputation methods that work best for MAR include regression imputation, stochastic regression imputation, or multiple imputation.
  • MNAR means that the missingness is related to the value of the variable that’s missing. For instance, patients with severe illnesses might be less likely to report their symptoms. More advanced methods like pattern-mixture models or selection models are usually applied to deal with MNAR data.

Choosing an Appropriate Imputation Method

Choosing the correct imputation method is largely dependent on the type of missing data and the specific characteristics of your dataset. Simple imputation methods like mean imputation work well when data is MCAR and when the proportion of missing data is small. However, for larger proportions of missing data, or for MAR or MNAR, multiple imputation or machine learning-based methods such as K-Nearest Neighbors (KNN) or Random Forest may be more suitable.

Before selecting a method, it’s important to analyze the data thoroughly to understand the reason behind the missing data, its patterns, and its relationship with other variables.

Validity and Sensitivity Analysis

Once the imputation process is complete, you should carry out a validation and sensitivity analysis. This will allow you to assess how well your imputed dataset represents the original data, and how sensitive your results are to the imputation method used.

One way to conduct a sensitivity analysis is by comparing the results of your analysis using the imputed dataset with the results using only the complete cases. If the results are markedly different, it suggests that your imputation method may have introduced bias.

Reporting the Imputation Procedure and Results

Transparent and detailed reporting of your imputation procedure is crucial. This includes disclosing the percentage of missing data, the reasons behind the missing data, the chosen imputation method, and the results of your sensitivity analysis.

The choice of imputation method should be justified with clear reasoning. The sensitivity analysis can demonstrate the robustness of your results against changes in the imputation method.

Best Practices for Imputing Missing Data

  1. Understand your data: Before deciding on any imputation method, perform a thorough analysis of your data to understand the patterns and potential reasons for missingness.
  2. Choose the most suitable imputation method: Your choice should be guided by the type of missing data (MCAR, MAR, or MNAR) and the specific characteristics of your dataset.
  3. Perform a sensitivity analysis: This will help you understand how your results might vary with different imputation methods.
  4. Report transparently: Transparency in reporting your imputation procedure and results is essential. It allows others to evaluate the robustness and validity of your findings.

Remember, every dataset is unique, and there is no one-size-fits-all solution to imputing missing data. An approach that minimizes bias and error for one dataset may not work as well for another. Therefore, careful consideration and validation are key to handling missing data effectively in healthcare research and beyond.

--

--

Ramanpreet Bhatia

A computer scientist who is passionate about making sense of data.