Handling Missing Data

Overview

General Considerations
Reasons for Missing Data
Testing Type of Missingness
Strategies to Handle Missing Data

General Considerations

Prevalence of Missing Data

Dong & Peng (2013) summarized sources finding that:
- 48% of exp. psych. & education research articles from 1998 – 2004 had missing data
  - Up to 16% more may have; couldn’t be sure
- Amount of missingness typically 15 – 20%
- 97% of studies handled missingness solely through list- or pairwise deletion

Prevalence of Missing Data (cont.)

Bell et al. (2014), randomized clinical trials published July – Dec., 2013, in the BMJ, JAMA, Lancet, & New England Journal of Medicine (n = 77):
- 95% reported missing outcome data
- Methods used to handle missing data:
  - Listwise deletion: 45%
  - Single imputation: 27%
  - Model-based methods: 19%
  - Multiple imputation: 8%

Common Effects of Missing Data

Ignoring missingness can:
- Reduce statistical power
- Bias results
- Distort size & even direction of associations

Acceptable Levels of Missingness

< 5%
- “[O]ften inconsequential” (Schafer, 1999)
- “Complete case analysis may be used as the primary analysis if the proportions of missing data are below approximately 5%” (Jakobsen et al., 2017)
5 – 10%
- Manageable, esp. through imputation (Graham, 2009)

Acceptable Levels of Missingness (cont.)

10 – 20%
- “[C]arefully evaluate the mechanism of missingness” (Enders, 2022)
> 20%
- “[C]an seriously compromise statistical power and validity, even with imputation” (Enders, 2022)
> 40%
- “[T]rial results may only be considered as hypothesis generating results” (Jakobsen et al., 2017)

Reasons for Missing Data

Three data-related reasons values may be missing:
1. Not missing at random: I.e., missingness is associated with other values in our data—both known & unknown

We cannot safely impute values missing from the known (non-missing) data since they are at least in part due to unknown (missing) values

Missing at random: I.e., due only to known values in our data

We could impute missing values, but…

Missing completely at random: I.e., due entirely to chance—or at least to factors outside of our data

We can likely impute missing values

Why MNAR Is Problematic

Compromises external (& perhaps internal) validity
- We know some population members are poorly sampled
- But we do not know who they are
Since missing values depend (at least in part) on unknowable values,
- We cannot accurately impute missing values
With MNAR, best we can only use list/pairwise deletion
- And do our darned best to ensure unknown values aren’t biasing our results (& report that we did)

Data Are Often Not Missing at Random

E.g., Silverwood et al. (2024) found more missing data among:
- Males
- Those of lower socio-economic statuses
- Those with mental health disorders (also de Graf, 2000; Ahern & le Brocque, 2005)
Also those with:
- Substance use disorders (Gmel, 2000)
- Chronic diseases & functional limitations (Hardy et al, 2009; Goldberg et al., 2006)

The Problem of MAR vs. MNAR

Theoretically, we could impute a given variable’s missing values if their actual values depended only on known (i.e., non-missing) values in our data
However, we don’t know what the missing values are,
- So we can’t test them to see if they’re only affected by known and not also unknown values
- I.e., we don’t have the information we need to distinguish between MAR & MNAR

But we can tell if they’re related to the ones we do know
- So, we can test if data are missing completely at random

Testing Type of Missingness

General Strategy

Review for patterns
- E.g., if participants seemed to end participant early
- If certain items/tasks tended to be missed together
  - E.g., were sensitive issues or certain topics avoided
- If certain subgroups had more missing values
If no patterns jump out, we’re justified to proceed & test formally if data are missing completely at random
- Via Little’s MCAR test

Little’s MCAR Test

Tests whether data are Missing Completely at Random
Uses a χ² test of association
- Tests whether the values of known data are associated with the prevalence of missing ones
  - I.e., are those with certain known variable values more likely to have missing data than those with other values?

Little’s MCAR Test (cont.)

If p < .05, known values significantly predict missingness
- Missingness at least “at random,”
  - Which is not empirically distinguisable from MNAR
If p > .05, known values are not sig. related missingness
- Consistent with MCAR
N.b., like any significance test, Little’s test is sensitive to sample size
- And hypothesis tests aren’t good at “proving the null”

Strategies for Handling Missing Data

Removing/Replacing Missing Values

Techniques to remove or replace missing data:
- Listwise deletion (aka complete case analysis)
- Available case analysis
- Single imputation
- Multiple imputation

Listwise Deletion /
Complete Case Analysis

Using only rows (cases) where there are no missing data
Most common approach
Defensible with low levels of missingness
- When missingness is low, non-randomness has minimal effect
- Imputation can still help, but—well—minimally
Note, will reduce power (i.e., increase standard errors)

Available Case Analysis

Using all available data for a given analysis
- Even if there are data missing for other variables
- Pairwise deletion is an example of this
Has otherwise similar recommendations as listwise
- While using more of the data
- And thus retaining more power than listwise

Single Imputation

Estimating missing values using one other value
E.g. replacing missing values with:
- The mean (or median) of available values for that variable
- Value obtained through linear regression, etc. estimate
- The “last observation carried forward”
  - Replacing missing values with that participant’s last known value
- The “worst observation carried forward”
  - Replacing missing with participant’s worst observed value

Single Imputation (cont.)

Single imputation are easy to understand & implement
- Requires little computing power or statistic prowess
But:
- Under-estimates variance
- Often over-estimates relationships
No longer recommended
- But perhaps best that can be done without using R or a special SPSS add-on

Multiple Imputation

Replaces each missing value with a unique one estimated from available data
Values estimated using (full) maximum likelihood estimation (& Markov chain Monte Carlo)
- Use most-complete variables to compute least-complete
  - Use all of those values to estimate next least-complete, etc.
- Usually compute several possible values for each missing value
  - And use conventions (Rubin’s rules) to determine best

Multiple Imputation (cont.)

Computationally—& conceptually—complex & intensive
But:
- Produces valid standard errors (& thus confidence intervals)
- Retains reasonable levels of uncertainty
- Flexibly accommodates many predictors
Currently most recommended in health and social sciences

Steps to Decide What to Do

Adapted from Jakobsen et al. (2017)

Reporting Recommendations

Report:
- Percent missing by variable
- Rationale for MCAR/MAR assumptions (and Little’s test if used)
- MI settings (method, variables included, number of imputations)
- Diagnostics and any sensitivity checks
  - E.g., results after MI vs. results with raw data:
    “Results were substantively unchanged compared to complete-case analyses of the non-imputed data.”

Additional Resourses

Bennett, D. A. (2001). How can I deal with missing data in my study? Australian and New Zealand Journal of Public Health, 25, 464–469.
Dong, Y., Peng, CY. J. (2013). Principled missing data methods for researchers. SpringerPlus, 2, 222. https://doi-org.proxy.wexler.hunter.cuny.edu/10.1186/2193-1801-2-222
Enders, C. K. (2010). Applied Missing Data Analysis. Guilford Press.
Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual Review of Psychology, 60, 549–576. https://doi.org/10.1146/annurev.psych.58.110405.085530
Jakobsen, J. C., Gluud, C., Wetterslev, J., et al. (2017). When and how should multiple imputation be used for handling missing data in randomised clinical trials: A practical guide with flowcharts. BMC Medical Research Methodology, 17, 162. https://doi-org.proxy.wexler.hunter.cuny.edu/10.1186/s12874-017-0442-1
Schafer, J. L. (1999). Multiple imputation: A primer. Statistical Methods in Medical Research, 8, 3–15. https://doi.org/10.1177/096228029900800102

Handling Missing Data

Overview

General Considerations

Prevalence of Missing Data

Prevalence of Missing Data (cont.)

Common Effects of Missing Data

Acceptable Levels of Missingness

Acceptable Levels of Missingness (cont.)

Reasons for Missing Data

Why MNAR Is Problematic

Data Are Often Not Missing at Random

The Problem of MAR vs. MNAR

Testing Type of Missingness

General Strategy

Little’s MCAR Test

Little’s MCAR Test (cont.)

Strategies for Handling Missing Data

Removing/Replacing Missing Values

Listwise Deletion /Complete Case Analysis

Available Case Analysis

Single Imputation

Single Imputation (cont.)

Multiple Imputation

Multiple Imputation (cont.)

Steps to Decide What to Do

Reporting Recommendations

Additional Resourses

Listwise Deletion /
Complete Case Analysis