Implementing the model

Instead of diving in deep from the start and trying to include all relevant variables with all relevant relationships at once, it is often better to build a very simplified version first and to start adding more and more stuff to it as we continue.

Part 1: Adding vaccination, covid and sickness

The most important variables for us are the vaccination, the covid-19 infection and the sickness. All of these are variables that have a certain probability of occurrence at each point in time. Once they occur, they last for some duration (e.g. someone being sick for two weeks or something similar). After the event is over, there is usually some duration where the person is “immune” to receiving the event again. This is a perfect case for using a time-dependent node of type "time_to_event".

We start out modeling every one of these variables as completely independent of each other using the following DAG:

library(data.table)
library(ggplot2)
library(simDAG)

dag <- empty_dag() +
  node_td("vaccination", type="time_to_event", prob_fun=0.001,
          event_duration=21, immunity_duration=Inf) +
  node_td("covid", type="time_to_event", prob_fun=0.001, event_duration=30,
          immunity_duration=80) +
  node_td("sickness", type="time_to_event", prob_fun=0.0001,
          event_duration=2, immunity_duration=2)

In the DAG above, we supplied a constant value to each of the prob_fun arguments, indicating that regardless of time and other variables, each event has a constant probability of occurring on each day. We set the event_duration of vaccination to 21, because we want to model the time after vaccination in which the risk for the adverse side-effect (e.g. the sickness) is higher than usual later on. By setting the immunity_duration of the vaccination to Inf, we are currently only allowing the person to get one vaccination over the entire time. The sickness is allowed to occur again directly after it was over.

Part 2: Adding adverse effects of vaccination and covid

We can make this data-generation process a little more interesting by making both the vaccination and covid have an effect on the probability of developing the sickness. We will do this by simply raising the probability of occurrence of the sickness by a constant factor whenever either a covid or vaccination event is currently happening. This can be done by formulating an appropriate prob_fun for the sickness node:

prob_sickness <- function(data, rr_covid, rr_vacc, base_p) {

  # multiply base probability by relevant RRs
  p <- base_p * rr_vacc^(data$vaccination_event) * rr_covid^(data$covid_event)

  return(p)
}

This works because any number to an exponent of 1 is itself, while any number to an exponent of 0 is one. The vaccination_event and covid_event columns are always either TRUE (when an event is currently happening) or FALSE (when no event is currently happening), which are interpreted as 1 and 0 by R. Let’s update our DAG:

dag <- empty_dag() +
  node_td("vaccination", type="time_to_event", prob_fun=0.001,
          event_duration=21, immunity_duration=Inf) +
  node_td("covid", type="time_to_event", prob_fun=0.001, event_duration=30,
          immunity_duration=80) +
  node_td("sickness", type="time_to_event", prob_fun=prob_sickness,
          parents=c("vaccination_event", "covid_event"),
          base_p=0.0001, rr_covid=3.5, rr_vacc=3.24,
          event_duration=2, immunity_duration=2)

Instead of passing a constant value to the prob_fun argument, we are now passing it the previously defined function. Because our function has base_p, rr_covid and rr_vacc as arguments without defaults, we have to specify those in the node_td call as well. We keep the original base_p, and set the relative risks to 3.5 and 3.24 respectively. Additionally, we have to set both the vaccination_event and the covid_event columns as parents now, because they are used in the prob_sickness function.

Part 3: Making the vaccine useful

So far we assumed that the covid infection probability is unaffected by whether the person received the vaccine or not. We will now change this by implementing a time-window after receiving the vaccine in which the person cannot develop a covid infection. Again, this can be done by defining an appropriate prob_fun function, this time for the covid node:

prob_covid <- function(data, base_p, vacc_duration) {
  
  p <- fifelse(data$vaccination_time_since_last < vacc_duration,
               0, base_p, na=base_p)
  return(p)
}

In this function we use the column vaccination_time_since_last, which is a column that can optionally be created in time-to-event nodes by setting time_since_last to TRUE. So let’s again update our DAG accordingly:

dag <- empty_dag() +
  node_td("vaccination", type="time_to_event", prob_fun=0.001,
          event_duration=21, immunity_duration=Inf,
          time_since_last=TRUE) +
  node_td("covid", type="time_to_event", prob_fun=prob_covid,
          parents=c("vaccination_time_since_last"),
          base_p=0.001, vacc_duration=80, event_duration=30,
          immunity_duration=80) +
  node_td("sickness", type="time_to_event", prob_fun=prob_sickness,
          parents=c("vaccination_event", "covid_event"),
          base_p=0.0001, rr_covid=3.5, rr_vacc=3.24,
          event_duration=2, immunity_duration=2)

Instead of just updating the parents and prob_fun arguments of the covid node, we now also had to set the time_since_last argument of the vaccination node to TRUE as well to get the required additional column. Our data-generation algorithm is getting better now. But there is still a lot we can do.

Part 4: Sick people don’t get vaccinated

In reality, very little people who were currently experiencing a Covid-19 infection went and got the vaccine. In fact, this is absolutely discouraged by doctors world-wide. To add this circumstance to the model, we once again simply have to update the probability of receiving a vaccination, by defining an appropriate prob_fun:

prob_vaccination <- function(data, base_p) {
  
  p <- fifelse(data$covid_event, 0, base_p)
  
  return(p)
}

Using this function, the probability of getting vaccinated for any individual that is currently experiencing a covid infection is 0. Let’s update our DAG one more time to include these changes:

dag <- empty_dag() +
  node_td("vaccination", type="time_to_event",
          prob_fun=prob_vaccination,
          parents=c("covid_event"), base_p=0.001,
          event_duration=21, immunity_duration=Inf,
          time_since_last=TRUE) +
  node_td("covid", type="time_to_event", prob_fun=prob_covid,
          parents=c("vaccination_time_since_last"),
          base_p=0.001, vacc_duration=80, event_duration=30,
          immunity_duration=80) +
  node_td("sickness", type="time_to_event", prob_fun=prob_sickness,
          parents=c("vaccination_event", "covid_event"),
          base_p=0.0001, rr_covid=3.5, rr_vacc=3.24,
          event_duration=2, immunity_duration=2)

Again we simply changed the prob_fun argument and added the correct parents to the appropriate node. Our final “DAG” looks like this:

plot(dag, mark_td_nodes=FALSE)

Note that in this plot it doesn’t look like a classic DAG anymore, because it has a bi-directional arrow between covid and vaccination due to the time-dependent nature of their relationship.

Generating Data using the final model

Suppose we are now pleased with the complexity of our data-generation algorithm and want to simulate data from it. We can do this by simply calling the sim_discrete_time() function on the specified DAG:

set.seed(42)
sim <- sim_discrete_time(dag, n_sim=1000, max_t=800)
summary(sim)
#> A simDT object with:
#>   -  1000  observations
#>   -  800  distinct points in time
#>   -  3  time-varying variables in total
#>   -  3  time_to_event nodes
#>   -  0  competing_events nodes
#> Only the last state of the simulation was saved.

For exemplary purposes, we kind of arbitrarily used 1000 individuals and let the simulation run for 800 days. By calling the plot() method, we get a concise overview over the process we simulated:

plot(sim, box_text_size=4)

A more useful output of the resulting data can be obtained using the sim2data() function. For example, we could transform the output to the start-stop format:

sim2data(sim, to="start_stop")
#>         .id start  stop vaccination  covid sickness
#>       <int> <int> <num>      <lgcl> <lgcl>   <lgcl>
#>    1:     1     1   178       FALSE  FALSE    FALSE
#>    2:     1   179   199        TRUE  FALSE    FALSE
#>    3:     1   200   800       FALSE  FALSE    FALSE
#>    4:     2     1   501       FALSE  FALSE    FALSE
#>    5:     2   502   531       FALSE   TRUE    FALSE
#>   ---                                              
#> 3466:  1000     1    47       FALSE  FALSE    FALSE
#> 3467:  1000    48    49       FALSE  FALSE     TRUE
#> 3468:  1000    50   131       FALSE  FALSE    FALSE
#> 3469:  1000   132   152        TRUE  FALSE    FALSE
#> 3470:  1000   153   800       FALSE  FALSE    FALSE

As can be seen, we managed to implement a fairly complex data-generation mechanism using only a few small function definitions and a few lines of code, allowing us to generate a complex dataset with three interdependent time-varying variables with only minimal effort.

Going even further

There is no need to stop here. We could make this simulation model even more complex by implementing any of the following things:

Adding time-dependent base-probabilities for vaccination, covid and sickness
Adding different kinds of vaccinations, perhaps with different effects on covid and/or sickness
Adding time-fixed variables such as sex which have an effect on any of the other variables
Allowing multiple vaccinations
Changing the constant raising of the probabilities in the form of a relative risk to a more realistic non-linear time-dependent relative risk

There are of course many more possible extensions, all of which can be implemented by augmenting the respective prob_fun arguments and updating the dag accordingly. In fact, in the real monte-carlo simulation we conducted, that is exactly what we did. We used empirical data to model time-dependent base-probabilities and more. How much complexity you really need is completely up to you. We hope that the simDAG package can help you with whatever you need.

Simulating Covid-19 Vaccine Data using a Discrete-Time Simulation

Robin Denz

Introduction

How to get started

1.) Formulate the goal of your research project in a detailed fashion.

2.) Build a theoretical model of the system you want to simulate.

3.) Identify the parts of the system that you are most interested in.

4.) Obtain and analyze real data.

5.) Simulate data for \(t = 0\) (if needed).

6.) Write functions for each time-varying node, one at a time.

7.) Inspect the resulting data for inconsistencies.

Our research goal and the theoretical model

Research goal

Theoretical model