This vignette provides an introduction to the nestedmodels package and the most basic use case. For this and all other vignettes, it is assumed that you have a familiarity with the ‘tidymodels’ framework (e.g. by reading Tidy Modelling with R). This vignette does not aim to teach good statistical practices, and instead demonstrates how to use the package as simply as possible.

A quick example

In this vignette, we’re going to explore the most basic example of a nested model. You’re going to need the following packages:

library(nestedmodels)
library(tidyr)
library(parsnip)
library(recipes)
library(workflows)
library(rsample)
library(glmnet)

We’re going to use the example data included in the nestedmodels package. The data is very simple, and only serves as an example of data that can be nested, rather than representing anything concrete.

data("example_nested_data")
data <- example_nested_data
data
#> # A tibble: 1,000 × 7
#>       id   id2     x     y     z     a     b
#>    <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
#>  1     1     1    49  48.5  29.1  44.7 50.0 
#>  2     1     1    50  64.2  29.7  40.2 64.9 
#>  3     1     1    51 -19.4  26.6  43.2 38.0 
#>  4     1     1    52  41.0  28.8  66.4 61.7 
#>  5     1     1    53 -94.2  23.9  18.2 -1.66
#>  6     1     1    54  72.6  30.0  83.8 38.8 
#>  7     1     1    55 -91.5  24.0  91.7 40.7 
#>  8     1     1    56 -50.5  25.5  79.8 55.4 
#>  9     1     1    57  90.3  30.6  50.3 33.8 
#> 10     1     1    58  32.4  28.6  25.4 20.5 
#> # ℹ 990 more rows

The data can be nested in the following way:

nested_data <- nest(data, data = -id)
nested_data
#> # A tibble: 20 × 2
#>       id data             
#>    <int> <list>           
#>  1     1 <tibble [50 × 6]>
#>  2     2 <tibble [50 × 6]>
#>  3     3 <tibble [50 × 6]>
#>  4     4 <tibble [50 × 6]>
#>  5     5 <tibble [50 × 6]>
#>  6     6 <tibble [50 × 6]>
#>  7     7 <tibble [50 × 6]>
#>  8     8 <tibble [50 × 6]>
#>  9     9 <tibble [50 × 6]>
#> 10    10 <tibble [50 × 6]>
#> 11    11 <tibble [50 × 6]>
#> 12    12 <tibble [50 × 6]>
#> 13    13 <tibble [50 × 6]>
#> 14    14 <tibble [50 × 6]>
#> 15    15 <tibble [50 × 6]>
#> 16    16 <tibble [50 × 6]>
#> 17    17 <tibble [50 × 6]>
#> 18    18 <tibble [50 × 6]>
#> 19    19 <tibble [50 × 6]>
#> 20    20 <tibble [50 × 6]>

Lets split this data up into a training and testing set using the nested_resamples() function. This ensures that the training and testing set all contain data with every ‘id’ value.

split <- nested_resamples(nested_data, rsample::initial_split())
data_tr <- rsample::training(split)
data_tst <- rsample::testing(split)

Now let’s define our model:

model <- linear_reg(penalty = 0.1) %>%
  set_engine("glmnet")

Since we’re fitting this model to nested data, we need some way to make the model ‘nested’. This is very simple with the nested() function.

nested_model <- model %>%
  nested()
nested_model
#> Nested Model Specification
#> 
#> Inner model:
#> Linear Regression Model Specification (regression)
#> 
#> Main Arguments:
#>   penalty = 0.1
#> 
#> Computational engine: glmnet

We can then fit this model in the usual way. Note that the data must be nested, and formula cannot include the id column.

nested_tr <- tidyr::nest(data_tr, data = -id)
model_fit <- fit(nested_model, z ~ x + y + a + b, nested_tr)
model_fit
#> Nested model fit, with 20 inner models
#> # A tibble: 20 × 2
#>       id .model_fit
#>    <int> <list>    
#>  1     1 <fit[+]>  
#>  2     2 <fit[+]>  
#>  3     3 <fit[+]>  
#>  4     4 <fit[+]>  
#>  5     5 <fit[+]>  
#>  6     6 <fit[+]>  
#>  7     7 <fit[+]>  
#>  8     8 <fit[+]>  
#>  9     9 <fit[+]>  
#> 10    10 <fit[+]>  
#> 11    11 <fit[+]>  
#> 12    12 <fit[+]>  
#> 13    13 <fit[+]>  
#> 14    14 <fit[+]>  
#> 15    15 <fit[+]>  
#> 16    16 <fit[+]>  
#> 17    17 <fit[+]>  
#> 18    18 <fit[+]>  
#> 19    19 <fit[+]>  
#> 20    20 <fit[+]>

Predicting can also be done in the usual way (the data to predict on can be both nested and non-nested). Since this is just a demonstration, we use the same data that the model was fitted on.

predict(model_fit, data_tst)
#> # A tibble: 260 × 1
#>    .pred
#>    <dbl>
#>  1  31.2
#>  2  27.0
#>  3  25.6
#>  4  41.7
#>  5  28.9
#>  6  27.1
#>  7  17.5
#>  8  27.3
#>  9  27.3
#> 10  26.4
#> # ℹ 250 more rows

This method is fine, but having to nest the data ourselves is a pain. We can solve this by using a workflow.

We first define the recipe, and we define a step which is used to nest the data. This time, the formula can include the ‘id’ column, since the recipe needs to act on it.

recipe <- recipe(data_tr, z ~ x + y + a + b + id) %>%
  step_nest(id)

This is a little easier than nesting the data manually. Note that the recipe does not actually nest the data, but instead removes the specified columns and adds a new column, ‘.nest_id’, which specifies which nest each row belongs to.

recipe %>%
  prep() %>%
  bake(NULL)
#> # A tibble: 740 × 6
#>        x      y     a      b     z .nest_id
#>    <int>  <dbl> <dbl>  <dbl> <dbl> <fct>   
#>  1    50  64.2   40.2 64.9   29.7  Nest 1  
#>  2    74  75.8   98.7 57.2   38.8  Nest 1  
#>  3    85  -8.74  52.4 43.3   53.3  Nest 1  
#>  4    57  90.3   50.3 33.8   30.6  Nest 1  
#>  5    73 -67.2   31.3  5.80  33.6  Nest 1  
#>  6    92  39.9   77.3 99.6    3.31 Nest 1  
#>  7    52  41.0   66.4 61.7   28.8  Nest 1  
#>  8    65  94.6   54.8 74.7   22.9  Nest 1  
#>  9    77 -18.8   13.8 51.9   52.9  Nest 1  
#> 10    86 104.    63.8 -0.387 57.4  Nest 1  
#> # ℹ 730 more rows

Now we create the workflow, combining the recipe and the model.

wf <- workflow() %>%
  add_model(nested_model) %>%
  add_recipe(recipe)

A workflow can be fitted in the same way as a model, but note that since we used step_nest() the data does not have to be nested.

wf_fit <- fit(wf, data_tr)

This fit object can then be used to make predictions.

predict(wf_fit, data_tst)
#> # A tibble: 260 × 1
#>    .pred
#>    <dbl>
#>  1  31.2
#>  2  27.0
#>  3  25.6
#>  4  41.7
#>  5  28.9
#>  6  27.1
#>  7  17.5
#>  8  27.3
#>  9  27.3
#> 10  26.4
#> # ℹ 250 more rows

Other common parsnip functions can also be used on fitted nested models:

augment(wf_fit, data_tst)
#> # A tibble: 260 × 8
#>       id   id2     x      y     z     a     b .pred
#>    <int> <int> <int>  <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1     1     1    51 -19.4   26.6 43.2   38.0  31.2
#>  2     1     1    55 -91.5   24.0 91.7   40.7  27.0
#>  3     1     1    56 -50.5   25.5 79.8   55.4  25.6
#>  4     1     1    62 109.    23.4  5.23  19.8  41.7
#>  5     1     1    63   1.35  19.6 38.2   43.6  28.9
#>  6     1     1    66  46.0   21.2 30.4   60.6  27.1
#>  7     1     2    76 -37.7   52.2 60.8   72.2  17.5
#>  8     1     2    78  32.9   54.7 87.1   61.1  27.3
#>  9     1     2    80 129.    58.2 79.9   87.5  27.3
#> 10     1     2    81  84.9   56.7  2.82  58.2  26.4
#> # ℹ 250 more rows
tidy(wf_fit)
#> # A tibble: 100 × 4
#>    .nest_id term         estimate penalty
#>    <fct>    <chr>           <dbl>   <dbl>
#>  1 Nest 1   (Intercept)  49.0         0.1
#>  2 Nest 1   x            -0.181       0.1
#>  3 Nest 1   y             0.0798      0.1
#>  4 Nest 1   a             0.0621      0.1
#>  5 Nest 1   b            -0.256       0.1
#>  6 Nest 2   (Intercept) -84.2         0.1
#>  7 Nest 2   x             0.701       0.1
#>  8 Nest 2   y            -0.00725     0.1
#>  9 Nest 2   a            -0.0532      0.1
#> 10 Nest 2   b            -0.0261      0.1
#> # ℹ 90 more rows

This is all you really need to know to use the nestedmodels package. These models and workflows can be compared, fitted and tuned in much the same way as normal models and workflows - you can even combine them with normal models using the workflowsets and stacks packages.

Getting started with nestedmodels

What is nestedmodels?

Why do I need nestedmodels?

How does nestedmodels work?

A quick example

Conclusion