Handling Missing Data

Overview

  • General Considerations
  • Reasons for Missing Data
  • Testing Type of Missingness
  • Strategies to Handle Missing Data

General Considerations

Prevalence of Missing Data

  • Dong & Peng (2013) summarized sources finding that:
    • 48% of exp. psych. & education research articles from 1998 – 2004 had missing data
      • Up to 16% more may have; couldn’t be sure
    • Amount of missingness typically 15 – 20%
    • 97% of studies handled missingness solely through list- or pairwise deletion

Prevalence of Missing Data (cont.)

  • Bell et al. (2014), randomized clinical trials published July – Dec., 2013, in the BMJ, JAMA, Lancet, & New England Journal of Medicine (n = 77):
    • 95% reported missing outcome data
    • Methods used to handle missing data:
      • Listwise deletion: 45%
      • Single imputation: 27%
      • Model-based methods: 19%
      • Multiple imputation: 8%

Common Effects of Missing Data

  • Ignoring missingness can:
    • Reduce statistical power
    • Bias results
    • Distort size & even direction of associations

Acceptable Levels of Missingness

  • < 5%
    • “[O]ften inconsequential” (Schafer, 1999)
    • “Complete case analysis may be used as the primary analysis if the proportions of missing data are below approximately 5%” (Jakobsen et al., 2017)
  • 5 – 10%

Acceptable Levels of Missingness (cont.)

  • 10 – 20%
    • “[C]arefully evaluate the mechanism of missingness” (Enders, 2022)
  • > 20%
    • “[C]an seriously compromise statistical power and validity, even with imputation” (Enders, 2022)
  • > 40%

Reasons for Missing Data

  • Three data-related reasons values may be missing:
    1. Not missing at random: I.e., missingness is associated with other values in our data—both known & unknown
  • We cannot safely impute values missing from the known (non-missing) data since they are at least in part due to unknown (missing) values
  1. Missing at random: I.e., due only to known values in our data
  • We could impute missing values, but…
  1. Missing completely at random: I.e., due entirely to chance—or at least to factors outside of our data
  • We can likely impute missing values

Why MNAR Is Problematic

  • Compromises external (& perhaps internal) validity
    • We know some population members are poorly sampled
    • But we do not know who they are
  • Since missing values depend (at least in part) on unknowable values,
    • We cannot accurately impute missing values
  • With MNAR, best we can only use list/pairwise deletion
    • And do our darned best to ensure unknown values aren’t biasing our results (& report that we did)

Data Are Often Not Missing at Random

The Problem of MAR vs. MNAR

  • Theoretically, we could impute a given variable’s missing values if their actual values depended only on known (i.e., non-missing) values in our data
  • However, we don’t know what the missing values are,
    • So we can’t test them to see if they’re only affected by known and not also unknown values
    • I.e., we don’t have the information we need to distinguish between MAR & MNAR
  • But we can tell if they’re related to the ones we do know
    • So, we can test if data are missing completely at random

Testing Type of Missingness

General Strategy

  • Review for patterns
    • E.g., if participants seemed to end participant early
    • If certain items/tasks tended to be missed together
      • E.g., were sensitive issues or certain topics avoided
    • If certain subgroups had more missing values
  • If no patterns jump out, we’re justified to proceed & test formally if data are missing completely at random

Little’s MCAR Test

  • Tests whether data are Missing Completely at Random
  • Uses a χ² test of association
    • Tests whether the values of known data are associated with the prevalence of missing ones
      • I.e., are those with certain known variable values more likely to have missing data than those with other values?

Little’s MCAR Test (cont.)

  • If p < .05, known values significantly predict missingness
    • Missingness at least “at random,”
      • Which is not empirically distinguisable from MNAR
  • If p > .05, known values are not sig. related missingness
    • Consistent with MCAR
  • N.b., like any significance test, Little’s test is sensitive to sample size
    • And hypothesis tests aren’t good at “proving the null”

Strategies for Handling Missing Data

Removing/Replacing Missing Values

  • Techniques to remove or replace missing data:
    • Listwise deletion (aka complete case analysis)
    • Available case analysis
    • Single imputation
    • Multiple imputation

Listwise Deletion /
Complete Case Analysis

  • Using only rows (cases) where there are no missing data
  • Most common approach
  • Defensible with low levels of missingness
    • When missingness is low, non-randomness has minimal effect
    • Imputation can still help, but—well—minimally
  • Note, will reduce power (i.e., increase standard errors)

Available Case Analysis

  • Using all available data for a given analysis
    • Even if there are data missing for other variables
    • Pairwise deletion is an example of this
  • Has otherwise similar recommendations as listwise
    • While using more of the data
    • And thus retaining more power than listwise

Single Imputation

  • Estimating missing values using one other value
  • E.g. replacing missing values with:
    • The mean (or median) of available values for that variable
    • Value obtained through linear regression, etc. estimate
    • The “last observation carried forward”
      • Replacing missing values with that participant’s last known value
    • The “worst observation carried forward”
      • Replacing missing with participant’s worst observed value

Single Imputation (cont.)

  • Single imputation are easy to understand & implement
    • Requires little computing power or statistic prowess
  • But:
    • Under-estimates variance
    • Often over-estimates relationships
  • No longer recommended
    • But perhaps best that can be done without using R or a special SPSS add-on

Multiple Imputation

  • Replaces each missing value with a unique one estimated from available data
  • Values estimated using (full) maximum likelihood estimation (& Markov chain Monte Carlo)
    • Use most-complete variables to compute least-complete
      • Use all of those values to estimate next least-complete, etc.
    • Usually compute several possible values for each missing value

Multiple Imputation (cont.)

  • Computationally—& conceptually—complex & intensive
  • But:
    • Produces valid standard errors (& thus confidence intervals)
    • Retains reasonable levels of uncertainty
    • Flexibly accommodates many predictors
  • Currently most recommended in health and social sciences

Steps to Decide What to Do

Adapted from Jakobsen et al. (2017)

Reporting Recommendations

  • Report:
    • Percent missing by variable
    • Rationale for MCAR/MAR assumptions (and Little’s test if used)
    • MI settings (method, variables included, number of imputations)
    • Diagnostics and any sensitivity checks
      • E.g., results after MI vs. results with raw data:
        “Results were substantively unchanged compared to complete-case analyses of the non-imputed data.”

Additional Resourses