misl

Note: This package is currently experimental and under active development. The API may change. Feedback and bug reports are welcome via GitHub Issues.

Overview

misl implements Multiple Imputation by Super Learning (MISL), a flexible approach to handling missing data that uses a stacked ensemble of machine learning algorithms to impute missing values across continuous, binary, and categorical variables.

Rather than relying on a single parametric imputation model, MISL builds a super learner for each incomplete variable using the tidymodels framework, combining learners such as linear/logistic regression, random forests, gradient boosted trees, and MARS to produce well-calibrated imputations.

The method is described in:

Carpenito T, Manjourides J. (2022) MISL: Multiple imputation by super learning. Statistical Methods in Medical Research. 31(10):1904–1915. doi: 10.1177/09622802221104238

Installation

misl is not yet on CRAN. Install the development version from GitHub:

# install.packages("remotes")
remotes::install_github("JustinManjourides/misl")

The following backend packages are optional but recommended:

install.packages(c("ranger", "xgboost", "earth"))

Quick Start

library(misl)

# Introduce missingness into a dataset
set.seed(42)
n <- 200
demo_data <- data.frame(
  age    = rnorm(n, mean = 50, sd = 10),
  weight = rnorm(n, mean = 70, sd = 15),
  smoker = rbinom(n, 1, 0.3),
  group  = factor(sample(c("A", "B", "C"), n, replace = TRUE))
)
demo_data[sample(n, 20), "age"]    <- NA
demo_data[sample(n, 15), "weight"] <- NA
demo_data[sample(n, 10), "smoker"] <- NA
demo_data[sample(n, 10), "group"]  <- NA

# Run MISL with default settings
misl_imp <- misl(
  demo_data,
  m      = 5,
  maxit  = 5,
  con_method = c("glm", "rand_forest"),
  bin_method = c("glm", "rand_forest"),
  cat_method = c("rand_forest", "multinom_reg")
)

# Each of the m imputed datasets is accessible via:
completed_data <- misl_imp[[1]]$datasets

# Trace plots can be used to inspect convergence:
trace <- misl_imp[[1]]$trace

Parallelism

Imputation across the m datasets is parallelised via the future framework. To enable parallel execution, set a plan before calling misl():

library(future)
plan(multisession, workers = 4)

misl_imp <- misl(demo_data, m = 5, maxit = 5)

plan(sequential)  # reset when done

Available learners

# View all available learners
list_learners()

# Filter by outcome type
list_learners("continuous")
list_learners("categorical")

# Show only installed learners
list_learners(installed_only = TRUE)

Citation

If you use misl in your research, please cite the original paper:

Carpenito T, Manjourides J. (2022) MISL: Multiple imputation by super
learning. Statistical Methods in Medical Research. 31(10):1904-1915.
doi: 10.1177/09622802221104238

BibTeX:

@article{carpenito2022misl,
  author  = {Carpenito, T and Manjourides, J},
  title   = {{MISL}: Multiple imputation by super learning},
  journal = {Statistical Methods in Medical Research},
  year    = {2022},
  volume  = {31},
  number  = {10},
  pages   = {1904--1915},
  doi     = {10.1177/09622802221104238}
}