An R package for designing and running multiverse analysis • multitool

Plan · Analyze · Explore

Installation

Install from CRAN:

install.packages("multitool")

You can install the development version of multitool from GitHub with:

# install.packages("devtools")
devtools::install_github("ethan-young/multitool")

Motivation

The goal of multitool is to provide a set of tools for designing and running multiverse-style analyses. I designed it to help users create an incremental workflow for slowly building up, keeping track of, and unpacking multiverse analyses and results.

Multiverse Primer

For those unfamiliar with multiverse analysis, here is a short primer:

Beyond Multiverse

I designed multitool to do multiverse analysis but its really just a tool for exploration.

In any new field, area, or project, there is a lot of uncertainty about which data analysis decisions to make. Clear research questions and criteria help reduce uncertainty about how to answer them but they never fully reduce them. multitool helps organize and systematically explore different options. That’s really it.

Design

I designed multitool to help users take a single use case (e.g., a single analysis pipeline) and expand it into a workflow to include alternative versions of the same analysis.

For example, imagine you would like to take some data, remove outliers, transform variables, run a linear model, do a post-hoc analysis, and plot the results. multitool can take theses tasks and transform them into a blueprint, which provides instructions for running your analysis pipeline.

The functions were designed to play nice with the tidyverse and require using the base R pipe |>. This makes it easy to quickly convert a single analysis into a multiverse analysis.

Basic components

My vision of a multitool workflow contains five steps:

multitool cannot make decisions for you but – once you know your set of data decisions – it can help you create and organize them into the workflow above.

A defining feature of multitool is that it saves your code. This allows the user to grab the code that produces a result and inspect it for accuracy, errors, or simply for peace of mind. By quickly grabbing code, the user can iterate between creating their blueprint and checking that the code works as intended.

multitool allows the user to model data however they’d like. The user is responsible for loading the relevant modeling packages. Regardless of your model choice, multitool will capture your code and build a blueprint with alternative analysis pipelines.

Finally, multiverse analyses were originally intended to look at how model parameters shift as a function of arbitrary data decisions. However, any computation might change depending on how you slice and dice the data. For this reason, I also built functions for computing descriptive, correlation, and reliability analysis alongside a particular modelling pipeline.

Usage

# load packages
library(tidyverse)
library(multitool)

# create some data
the_data <-
  data.frame(
    id  = 1:100,
    iv1 = rnorm(100),
    iv2 = rnorm(100),
    iv3 = rnorm(100),
    mod = rnorm(100),
    dv1 = rnorm(100),
    dv2 = rnorm(100),
    include1 = rbinom(100, size = 1, prob = .1),
    include2 = sample(1:3, size = 100, replace = TRUE),
    include3 = rnorm(100)
  )

# create a pipeline blueprint
full_pipeline <- 
  the_data |>
  add_filters(include1 == 0, include2 != 3, include3 > -2.5) |> 
  add_variables(var_group = "ivs", iv1, iv2, iv3) |> 
  add_variables(var_group = "dvs", dv1, dv2) |> 
  add_model("linear model", lm({dvs} ~ {ivs} * mod))

full_pipeline
#> # A tibble: 12 × 3
#>    type      group        code                          
#>    <chr>     <chr>        <chr>                         
#>  1 filters   include1     include1 == 0                 
#>  2 filters   include1     include1 %in% unique(include1)
#>  3 filters   include2     include2 != 3                 
#>  4 filters   include2     include2 %in% unique(include2)
#>  5 filters   include3     include3 > -2.5               
#>  6 filters   include3     include3 %in% unique(include3)
#>  7 variables ivs          iv1                           
#>  8 variables ivs          iv2                           
#>  9 variables ivs          iv3                           
#> 10 variables dvs          dv1                           
#> 11 variables dvs          dv2                           
#> 12 models    linear model lm({dvs} ~ {ivs} * mod)

# Expand your blueprint into a grid
expanded_pipeline <- expand_decisions(full_pipeline)
expanded_pipeline
#> # A tibble: 48 × 4
#>    decision variables        filters          models          
#>    <chr>    <list>           <list>           <list>          
#>  1 1        <tibble [1 × 2]> <tibble [1 × 3]> <tibble [1 × 2]>
#>  2 2        <tibble [1 × 2]> <tibble [1 × 3]> <tibble [1 × 2]>
#>  3 3        <tibble [1 × 2]> <tibble [1 × 3]> <tibble [1 × 2]>
#>  4 4        <tibble [1 × 2]> <tibble [1 × 3]> <tibble [1 × 2]>
#>  5 5        <tibble [1 × 2]> <tibble [1 × 3]> <tibble [1 × 2]>
#>  6 6        <tibble [1 × 2]> <tibble [1 × 3]> <tibble [1 × 2]>
#>  7 7        <tibble [1 × 2]> <tibble [1 × 3]> <tibble [1 × 2]>
#>  8 8        <tibble [1 × 2]> <tibble [1 × 3]> <tibble [1 × 2]>
#>  9 9        <tibble [1 × 2]> <tibble [1 × 3]> <tibble [1 × 2]>
#> 10 10       <tibble [1 × 2]> <tibble [1 × 3]> <tibble [1 × 2]>
#> # ℹ 38 more rows

# Run the blueprint
multiverse_results <- run_multiverse(expanded_pipeline)
multiverse_results
#> # A tibble: 48 × 4
#>    decision specifications   model_fitted     pipeline_code   
#>    <chr>    <list>           <list>           <list>          
#>  1 1        <tibble [1 × 3]> <tibble [1 × 5]> <tibble [1 × 2]>
#>  2 2        <tibble [1 × 3]> <tibble [1 × 5]> <tibble [1 × 2]>
#>  3 3        <tibble [1 × 3]> <tibble [1 × 5]> <tibble [1 × 2]>
#>  4 4        <tibble [1 × 3]> <tibble [1 × 5]> <tibble [1 × 2]>
#>  5 5        <tibble [1 × 3]> <tibble [1 × 5]> <tibble [1 × 2]>
#>  6 6        <tibble [1 × 3]> <tibble [1 × 5]> <tibble [1 × 2]>
#>  7 7        <tibble [1 × 3]> <tibble [1 × 5]> <tibble [1 × 2]>
#>  8 8        <tibble [1 × 3]> <tibble [1 × 5]> <tibble [1 × 2]>
#>  9 9        <tibble [1 × 3]> <tibble [1 × 5]> <tibble [1 × 2]>
#> 10 10       <tibble [1 × 3]> <tibble [1 × 5]> <tibble [1 × 2]>
#> # ℹ 38 more rows

# Unpack model coefficients
multiverse_results |> 
  reveal_model_parameters()
#> # A tibble: 192 × 20
#>    decision specifications   model_function parameter  unstd_coef    se unstd_ci
#>    <chr>    <list>           <chr>          <chr>           <dbl> <dbl>    <dbl>
#>  1 1        <tibble [1 × 3]> lm             (Intercep…    0.0609  0.114     0.95
#>  2 1        <tibble [1 × 3]> lm             iv1           0.0220  0.118     0.95
#>  3 1        <tibble [1 × 3]> lm             mod          -0.00847 0.138     0.95
#>  4 1        <tibble [1 × 3]> lm             iv1:mod       0.187   0.148     0.95
#>  5 2        <tibble [1 × 3]> lm             (Intercep…   -0.289   0.133     0.95
#>  6 2        <tibble [1 × 3]> lm             iv1           0.106   0.137     0.95
#>  7 2        <tibble [1 × 3]> lm             mod          -0.0277  0.160     0.95
#>  8 2        <tibble [1 × 3]> lm             iv1:mod       0.0471  0.172     0.95
#>  9 3        <tibble [1 × 3]> lm             (Intercep…    0.0839  0.113     0.95
#> 10 3        <tibble [1 × 3]> lm             iv2          -0.0919  0.119     0.95
#> # ℹ 182 more rows
#> # ℹ 13 more variables: unstd_ci_low <dbl>, unstd_ci_high <dbl>, t <dbl>,
#> #   df_error <int>, p <dbl>, std_coef <dbl>, std_ci <dbl>, std_ci_low <dbl>,
#> #   std_ci_high <dbl>, model_performance <list>, model_warnings <list>,
#> #   model_messages <list>, pipeline_code <list>

# Unpack model fit statistics
multiverse_results |> 
  reveal_model_performance()
#> # A tibble: 48 × 14
#>    decision specifications   model_function model_parameters   aic  aicc   bic
#>    <chr>    <list>           <chr>          <list>           <dbl> <dbl> <dbl>
#>  1 1        <tibble [1 × 3]> lm             <prmtrs_m>        182.  183.  193.
#>  2 2        <tibble [1 × 3]> lm             <prmtrs_m>        202.  203.  213.
#>  3 3        <tibble [1 × 3]> lm             <prmtrs_m>        182.  183.  193.
#>  4 4        <tibble [1 × 3]> lm             <prmtrs_m>        202.  203.  213.
#>  5 5        <tibble [1 × 3]> lm             <prmtrs_m>        182.  183.  193.
#>  6 6        <tibble [1 × 3]> lm             <prmtrs_m>        200.  201.  211.
#>  7 7        <tibble [1 × 3]> lm             <prmtrs_m>        182.  183.  193.
#>  8 8        <tibble [1 × 3]> lm             <prmtrs_m>        202.  203.  213.
#>  9 9        <tibble [1 × 3]> lm             <prmtrs_m>        182.  183.  193.
#> 10 10       <tibble [1 × 3]> lm             <prmtrs_m>        202.  203.  213.
#> # ℹ 38 more rows
#> # ℹ 7 more variables: r2 <dbl>, r2_adjusted <dbl>, rmse <dbl>, sigma <dbl>,
#> #   model_warnings <list>, model_messages <list>, pipeline_code <list>

# Summarize model coefficients
multiverse_results |> 
  reveal_model_parameters() |> 
  group_by(parameter) |> 
  condense(unstd_coef, list(mean = mean, median = median, sd = sd))
#> # A tibble: 8 × 5
#>   parameter   unstd_coef_mean unstd_coef_median unstd_coef_sd unstd_coef_list
#>   <chr>                 <dbl>             <dbl>         <dbl> <list>         
#> 1 (Intercept)        -0.0834           -0.0628         0.162  <dbl [48]>     
#> 2 iv1                 0.0617            0.0642         0.0498 <dbl [16]>     
#> 3 iv1:mod             0.0659            0.0590         0.0841 <dbl [16]>     
#> 4 iv2                -0.0238           -0.0243         0.0549 <dbl [16]>     
#> 5 iv2:mod             0.125             0.112          0.0323 <dbl [16]>     
#> 6 iv3                -0.138            -0.166          0.0631 <dbl [16]>     
#> 7 iv3:mod            -0.0116           -0.0127         0.0563 <dbl [16]>     
#> 8 mod                -0.00679          -0.00978        0.0281 <dbl [48]>

# Summarize fit statistics
multiverse_results |> 
  reveal_model_performance() |> 
  condense(r2, list(mean = mean, sd = sd))
#> # A tibble: 1 × 3
#>   r2_mean  r2_sd r2_list   
#>     <dbl>  <dbl> <list>    
#> 1  0.0206 0.0140 <dbl [48]>