Defining an Analysis Pipeline Blueprint
Source:vignettes/create-your-blueprint.Rmd
create-your-blueprint.Rmd
Setting up a data analysis pipeline
Image we have some data with several predictor variables, moderators,
covariates, and dependent measures. We want to know if our predictors
(ivs
) interact with our moderators (mods
) to
predict the outcome (dvs
).
But we have three versions of our predictor that (supposedly) measure the same thing, albeit in slightly different ways.
In addition, because we collected messy data from the real world (not
really but let’s pretend), we have some idea of which observations to
include and which we might exclude (e.g., include1
,
include2
, include3
).
Create a blueprint
Say we don’t know much about this new and exciting area of research.
We want to maximize our knowledge but we also want to be systematic. One approach would be to specify a reasonable analysis pipeline. Something that looks like the following:
# Filter out exclusions
filtered_data <-
the_data |>
filter(
include1 == 0, # --
include2 != 3, # Exclusion criteria
include3 > -2.5 # --
)
# Model the data
my_model <- lm(dv1 ~ iv1 * mod, data = filtered_data)
# Check the results
my_results <- parameters::parameters(my_model)
But what if there are valid alternative alternatives to this pipeline?
For example, using iv2
instead of iv1
or
only using two exclusion criteria instead of three? A sensible approach
would be to copy the code above, paste it, and edit with different
decisions.
This quickly become tedious. It adds many lines of code, many new objects, and is difficult to keep track of in a systematic way.
Enter multitool
.
With multitool
, the above analysis pipeline can be
transformed into a specification blueprint for
exploring all combinations of sensible data decisions in a pipeline. It
was designed to leverage already written code (e.g., the
filter
statement above) to create a all possible
combinations of data analysis pipelines.
Filtering specifications
Our example above has three exclusion criteria. If we don’t know which are important, for example, because they are based on arbitrary ‘rules of thumb’ (that may or may not have inherent wisdom) or we don’t know if including/excluding these cases is valid, we can generate all combinations:
the_data |>
add_filters(include1 == 0, include2 != 3, include3 > -2.5)
#> # A tibble: 6 × 3
#> type group code
#> <chr> <chr> <chr>
#> 1 filters include1 include1 == 0
#> 2 filters include1 include1 %in% unique(include1)
#> 3 filters include2 include2 != 3
#> 4 filters include2 include2 %in% unique(include2)
#> 5 filters include3 include3 > -2.5
#> 6 filters include3 include3 %in% unique(include3)
The output above is a simple tibble
(i.e.,
data.frame
) containing three columns.
Each row is a possible filter: the type
column refers to
the type of blueprint specification (see below for types other than
filters), the group
refers to the variable in the base data
frame (in our case the_data
) for which the filter applies,
and the code
column contains the code needed to execute the
filter.
For filtering decisions (e.g., exclusion criteria), a ‘do nothing’ alternative is always generated.
For example, perhaps some observations belong to a subgroup,
include1 == 1
. We may or may not have good reason to
exclude these cases (this depends on the specific situation).
But imagine that we don’t know if we should include them or not. When
include1 == 1
is added to add_filters()
, the
‘do nothing’ alternative
include1 %in% unique(include1)
is automatically generated
so you can compare including versus excluding cases based on a
criterion.
Adding alternative analysis variables
Most multiverse-style analyses explore a range of exclusion criteria and their alternatives. However, sometimes alternative versions of a variable are also included.
In the social sciences, it is fairly common to have many measures of roughly the same construct (i.e., measured variable). For example, a happiness researcher might measure positive mood, life satisfaction, and/or a single item measuring happiness (e.g., ‘how happy do your feel?’).
If you want to explore the output of your pipeline with differing
versions of a variable, you can use add_variables()
.
the_data |>
add_variables(var_group = "ivs", iv1, iv2, iv3)
#> # A tibble: 3 × 3
#> type group code
#> <chr> <chr> <chr>
#> 1 variables ivs iv1
#> 2 variables ivs iv2
#> 3 variables ivs iv3
The output above generates the same tibble
as
add_filters()
. Each row is a particular decision to use a
particular variable in your pipeline.
In contrast to filter, however, you need to tell
add_variables()
what to call each set of variables with the
var_group
argument. This is how multitool
knows that each variable name in the code
column is a
different alternative of a larger set.
Here, var_group = "ivs"
indicates that
iv1, iv2, iv3
are all different versions of
ivs
. I used “ivs” as way of indicating to myself that these
are alternative versions of my main independent variable.
You can add as many variable sets as you want. For example, we might
also want to analyze our two versions of the outcome, dv1
and dv2
.
the_data |>
add_variables(var_group = "ivs", iv1, iv2, iv3) |>
add_variables(var_group = "dvs", dv1, dv2)
#> # A tibble: 5 × 3
#> type group code
#> <chr> <chr> <chr>
#> 1 variables ivs iv1
#> 2 variables ivs iv2
#> 3 variables ivs iv3
#> 4 variables dvs dv1
#> 5 variables dvs dv2
Building up the blueprint
You can harness the real power of multitool
by piping
specification statements.
For example, perhaps we want to explore our exclusion criteria alternatives across different versions of our predictor and outcome variables. We can simply pipe new blueprint specifications into each other like so:
the_data |>
add_filters(include1 == 0, include2 != 3, include3 > -2.5) |>
add_variables(var_group = "ivs", iv1, iv2, iv3) |>
add_variables(var_group = "dvs", dv1, dv2)
#> # A tibble: 11 × 3
#> type group code
#> <chr> <chr> <chr>
#> 1 filters include1 include1 == 0
#> 2 filters include1 include1 %in% unique(include1)
#> 3 filters include2 include2 != 3
#> 4 filters include2 include2 %in% unique(include2)
#> 5 filters include3 include3 > -2.5
#> 6 filters include3 include3 %in% unique(include3)
#> 7 variables ivs iv1
#> 8 variables ivs iv2
#> 9 variables ivs iv3
#> 10 variables dvs dv1
#> 11 variables dvs dv2
Notice that we now have a specification blueprint with both exclusion alternatives and variable alternatives.
Adding a model
The whole point of building a specification blueprint is to eventually feed it to a model and examine the results.
You can add a model to your blueprint by using
add_model()
. I designed add_model()
so the
user can simply paste a model function. For example, our call to
lm()
can be simply pasted into add_model()
.
Make sure to give your model a label with the model_desc
argument.
the_data |>
add_filters(include1 == 0, include2 != 3, include3 > -2.5) |>
add_variables(var_group = "ivs", iv1, iv2, iv3) |>
add_variables(var_group = "dvs", dv1, dv2) |>
add_model("linear model", lm(dv1 ~ iv1 * mod))
#> # A tibble: 12 × 4
#> type group code additional_args
#> <chr> <chr> <chr> <lgl>
#> 1 filters include1 include1 == 0 NA
#> 2 filters include1 include1 %in% unique(include1) NA
#> 3 filters include2 include2 != 3 NA
#> 4 filters include2 include2 %in% unique(include2) NA
#> 5 filters include3 include3 > -2.5 NA
#> 6 filters include3 include3 %in% unique(include3) NA
#> 7 variables ivs iv1 NA
#> 8 variables ivs iv2 NA
#> 9 variables ivs iv3 NA
#> 10 variables dvs dv1 NA
#> 11 variables dvs dv2 NA
#> 12 models linear model lm(dv1 ~ iv1 * mod) NA
Above, the model is completely unquoted. It also has no
data
argument. This is intentional; multitool
is tracking the base dataset along the way (so you don’t have to). Note
that you can still quote the model formula, if that is more your
style.
the_data |>
add_filters(include1 == 0, include2 != 3, include3 > -2.5) |>
add_variables(var_group = "ivs", iv1, iv2, iv3) |>
add_variables(var_group = "dvs", dv1, dv2) |>
add_model("linear model", "lm(dv1 ~ iv1 * mod)")
#> # A tibble: 12 × 4
#> type group code additional_args
#> <chr> <chr> <chr> <lgl>
#> 1 filters include1 include1 == 0 NA
#> 2 filters include1 include1 %in% unique(include1) NA
#> 3 filters include2 include2 != 3 NA
#> 4 filters include2 include2 %in% unique(include2) NA
#> 5 filters include3 include3 > -2.5 NA
#> 6 filters include3 include3 %in% unique(include3) NA
#> 7 variables ivs iv1 NA
#> 8 variables ivs iv2 NA
#> 9 variables ivs iv3 NA
#> 10 variables dvs dv1 NA
#> 11 variables dvs dv2 NA
#> 12 models linear model lm(dv1 ~ iv1 * mod) NA
To make sure your add_variables()
works properly,
add_model()
was designed to interpret
glue::glue()
syntax. For example:
the_data |>
add_filters(include1 == 0, include2 != 3, include3 > -2.5) |>
add_variables(var_group = "ivs", iv1, iv2, iv3) |>
add_variables(var_group = "dvs", dv1, dv2) |>
add_model("linear model", lm({dvs} ~ {ivs} * mod)) # see the {} here
#> # A tibble: 12 × 4
#> type group code additional_args
#> <chr> <chr> <chr> <lgl>
#> 1 filters include1 include1 == 0 NA
#> 2 filters include1 include1 %in% unique(include1) NA
#> 3 filters include2 include2 != 3 NA
#> 4 filters include2 include2 %in% unique(include2) NA
#> 5 filters include3 include3 > -2.5 NA
#> 6 filters include3 include3 %in% unique(include3) NA
#> 7 variables ivs iv1 NA
#> 8 variables ivs iv2 NA
#> 9 variables ivs iv3 NA
#> 10 variables dvs dv1 NA
#> 11 variables dvs dv2 NA
#> 12 models linear model lm({dvs} ~ {ivs} * mod) NA
This allows multitool
to insert the correct version of
each variable specified in a add_variables()
step. Make
sure to embrace the variable with the var_group
argument
from add_variables()
, for example
add_model(lm({dvs} ~ {ivs} * mod))
.
Here, dvs
and ivs
tells
multitool
to insert the current version of the
ivs
and dvs
into the model.
Finalizing the specification blueprint
There are two steps in finalizing your blueprint. The first is to visualize your pipeline with a graph. This is optional, but I think it is helpful.
You can automate making a chart with
create_blueprint_graph()
. Feed your pipeline to
create_blueprint_graph()
to see a chart of your multiverse
pipeline plan:
full_pipeline <-
the_data |>
add_filters(include1 == 0, include2 != 3, include3 > -2.5) |>
add_variables(var_group = "ivs", iv1, iv2, iv3) |>
add_variables(var_group = "dvs", dv1, dv2) |>
add_model("linear model", lm({dvs} ~ {ivs} * mod))
create_blueprint_graph(full_pipeline)
#> No subgroups in your pipeline
#> no descriptives
#> you have no preprocessing steps in your pipeline
#> you have no post processing steps in your pipeline
The final step in making your blueprint is expanding all your
specifications into all possible combinations. You can do this by
calling expand_decisions()
at the end of your blueprint
pipeline:
expanded_pipeline <- expand_decisions(full_pipeline)
expanded_pipeline
#> # A tibble: 48 × 4
#> decision variables filters models
#> <chr> <list> <list> <list>
#> 1 1 <tibble [1 × 2]> <tibble [1 × 3]> <tibble [1 × 3]>
#> 2 2 <tibble [1 × 2]> <tibble [1 × 3]> <tibble [1 × 3]>
#> 3 3 <tibble [1 × 2]> <tibble [1 × 3]> <tibble [1 × 3]>
#> 4 4 <tibble [1 × 2]> <tibble [1 × 3]> <tibble [1 × 3]>
#> 5 5 <tibble [1 × 2]> <tibble [1 × 3]> <tibble [1 × 3]>
#> 6 6 <tibble [1 × 2]> <tibble [1 × 3]> <tibble [1 × 3]>
#> 7 7 <tibble [1 × 2]> <tibble [1 × 3]> <tibble [1 × 3]>
#> 8 8 <tibble [1 × 2]> <tibble [1 × 3]> <tibble [1 × 3]>
#> 9 9 <tibble [1 × 2]> <tibble [1 × 3]> <tibble [1 × 3]>
#> 10 10 <tibble [1 × 2]> <tibble [1 × 3]> <tibble [1 × 3]>
#> # ℹ 38 more rows
The result is an expanded tibble
with 1 row per unique
decision and columns for each major blueprint category. In our example,
we have alternative variables (predictors and outcomes), filters (three
exclusion alternatives), and a model to run.
Note that we have 3 exclusions (each with two combinations), 3
versions of our predictor, and 2 versions of our outcome. This means our
blueprint should have 2*2*2*3*2
or 48 rows, which
corresponds with our expanded pipeline:
2*2*2*3*2 == nrow(expanded_pipeline)
#> [1] TRUE
Our blueprint uses list columns to organize information. You can view
each list column by using
tidyr::unnest(<column name>)
. For example, we can
look at the filters:
expanded_pipeline |> unnest(filters)
#> # A tibble: 48 × 6
#> decision variables include1 include2 include3 models
#> <chr> <list> <chr> <chr> <chr> <list>
#> 1 1 <tibble [1 × 2]> include1 == 0 include2 != 3 include3 > -2… <tibble>
#> 2 2 <tibble [1 × 2]> include1 == 0 include2 != 3 include3 > -2… <tibble>
#> 3 3 <tibble [1 × 2]> include1 == 0 include2 != 3 include3 > -2… <tibble>
#> 4 4 <tibble [1 × 2]> include1 == 0 include2 != 3 include3 > -2… <tibble>
#> 5 5 <tibble [1 × 2]> include1 == 0 include2 != 3 include3 > -2… <tibble>
#> 6 6 <tibble [1 × 2]> include1 == 0 include2 != 3 include3 > -2… <tibble>
#> 7 7 <tibble [1 × 2]> include1 == 0 include2 != 3 include3 %in%… <tibble>
#> 8 8 <tibble [1 × 2]> include1 == 0 include2 != 3 include3 %in%… <tibble>
#> 9 9 <tibble [1 × 2]> include1 == 0 include2 != 3 include3 %in%… <tibble>
#> 10 10 <tibble [1 × 2]> include1 == 0 include2 != 3 include3 %in%… <tibble>
#> # ℹ 38 more rows
Or we could look at the models:
expanded_pipeline |> unnest(models)
#> # A tibble: 48 × 6
#> decision variables filters model model_meta model_args
#> <chr> <list> <list> <chr> <chr> <chr>
#> 1 1 <tibble [1 × 2]> <tibble [1 × 3]> lm(dv1 ~ iv… linear mo… ""
#> 2 2 <tibble [1 × 2]> <tibble [1 × 3]> lm(dv2 ~ iv… linear mo… ""
#> 3 3 <tibble [1 × 2]> <tibble [1 × 3]> lm(dv1 ~ iv… linear mo… ""
#> 4 4 <tibble [1 × 2]> <tibble [1 × 3]> lm(dv2 ~ iv… linear mo… ""
#> 5 5 <tibble [1 × 2]> <tibble [1 × 3]> lm(dv1 ~ iv… linear mo… ""
#> 6 6 <tibble [1 × 2]> <tibble [1 × 3]> lm(dv2 ~ iv… linear mo… ""
#> 7 7 <tibble [1 × 2]> <tibble [1 × 3]> lm(dv1 ~ iv… linear mo… ""
#> 8 8 <tibble [1 × 2]> <tibble [1 × 3]> lm(dv2 ~ iv… linear mo… ""
#> 9 9 <tibble [1 × 2]> <tibble [1 × 3]> lm(dv1 ~ iv… linear mo… ""
#> 10 10 <tibble [1 × 2]> <tibble [1 × 3]> lm(dv2 ~ iv… linear mo… ""
#> # ℹ 38 more rows
Notice that, with the glue::glue()
syntax, different
versions of our predictors and outcomes were inserted appropriately. You
can check their correspondence by using unnest()
on both
the models and variable list columns:
expanded_pipeline |> unnest(c(variables, models))
#> # A tibble: 48 × 7
#> decision ivs dvs filters model model_meta model_args
#> <chr> <chr> <chr> <list> <chr> <chr> <chr>
#> 1 1 iv1 dv1 <tibble [1 × 3]> lm(dv1 ~ iv1 * m… linear mo… ""
#> 2 2 iv1 dv2 <tibble [1 × 3]> lm(dv2 ~ iv1 * m… linear mo… ""
#> 3 3 iv2 dv1 <tibble [1 × 3]> lm(dv1 ~ iv2 * m… linear mo… ""
#> 4 4 iv2 dv2 <tibble [1 × 3]> lm(dv2 ~ iv2 * m… linear mo… ""
#> 5 5 iv3 dv1 <tibble [1 × 3]> lm(dv1 ~ iv3 * m… linear mo… ""
#> 6 6 iv3 dv2 <tibble [1 × 3]> lm(dv2 ~ iv3 * m… linear mo… ""
#> 7 7 iv1 dv1 <tibble [1 × 3]> lm(dv1 ~ iv1 * m… linear mo… ""
#> 8 8 iv1 dv2 <tibble [1 × 3]> lm(dv2 ~ iv1 * m… linear mo… ""
#> 9 9 iv2 dv1 <tibble [1 × 3]> lm(dv1 ~ iv2 * m… linear mo… ""
#> 10 10 iv2 dv2 <tibble [1 × 3]> lm(dv2 ~ iv2 * m… linear mo… ""
#> # ℹ 38 more rows