Before getting started, make sure that you have installed the package.
To install the package from CRAN:
To install the package from Github:
If the package is installed, make sure that you load it:
control_group argumentDepending on empirical context, the parallel-trends assumption may only hold for one of these possible control groups:
"never-treated": These are the control units that are
never observed receiving the treatment within the sampling time
frame."future-treated": These are the control units that are
observed receiving the treatment within the sampling time frame."all": These include both the “never-treated” and
“future-treated” groups.DiDforBigData makes it easy to choose among these
possibilities by setting the control_group argument of the
DiD function.
We start by simulating some data to work with:
sim = SimDiD(seed=123, sample_size = 1000)
simdata = sim$simdata
print(simdata)
#>          id year cohort         Y
#>     1:    1 2003   2012  9.556685
#>     2:    1 2004   2012 10.838554
#>     3:    1 2005   2012 10.639781
#>     4:    1 2006   2012 10.768935
#>     5:    1 2007   2012 10.613825
#>    ---                           
#> 10996: 1000 2009   2010  6.640274
#> 10997: 1000 2010   2010 10.571372
#> 10998: 1000 2011   2010 11.282597
#> 10999: 1000 2012   2010 13.403816
#> 11000: 1000 2013   2010 12.542082We can check what the true ATT is in the simulated data. In particular, let’s focus on the ATT for Cohort 2007:
print(sim$true_ATT[cohort==2007])
#>    cohort event ATTge
#> 1:   2007     0     1
#> 2:   2007     1     2
#> 3:   2007     2     3
#> 4:   2007     3     4
#> 5:   2007     4     5
#> 6:   2007     5     6
#> 7:   2007     6     7We also set up the variable names as a list so that
DiDforBigData knows which variable is the outcome, which
variable is the treatment cohort, etc.:
varnames = list()
varnames$time_name = "year" 
varnames$outcome_name = "Y"
varnames$cohort_name = "cohort"
varnames$id_name = "id"Now, we are ready to estimate DiD. We focus on the ATT for the 2007
cohort at event times -3,…,5. Initially, the control group is not set,
so the function will use the default option
control_group = "all":
did = DiD(inputdata = simdata, varnames = varnames, min_event = -3, max_event=5)
print(did$results_cohort[Cohort==2007])
#>    Cohort EventTime BaseEvent CalendarTime     ATTge   ATTge_SE Ncontrol
#> 1:   2007        -3        -1         2004 0.2212255 0.10491102      751
#> 2:   2007        -2        -1         2005 0.1274586 0.10260112      751
#> 3:   2007        -1        -1         2006 0.0000000 0.00000000      751
#> 4:   2007         0        -1         2007 1.3464834 0.11215961      751
#> 5:   2007         1        -1         2008 2.2323825 0.09443787      751
#> 6:   2007         2        -1         2009 3.3705987 0.10350708      751
#> 7:   2007         3        -1         2010 4.2668717 0.10723955      501
#> 8:   2007         4        -1         2011 5.3241860 0.10738958      501
#> 9:   2007         5        -1         2012 6.1964127 0.13611288      251
#>    Ntreated
#> 1:      249
#> 2:      249
#> 3:      249
#> 4:      249
#> 5:      249
#> 6:      249
#> 7:      249
#> 8:      249
#> 9:      249We now consider changing the estimation to only use the “never-treated” control group:
did = DiD(inputdata = simdata, varnames = varnames, 
          control_group = "never-treated", min_event = -3, max_event=5)
print(did$results_cohort[Cohort==2007])
#>    Cohort EventTime BaseEvent CalendarTime       ATTge  ATTge_SE Ncontrol
#> 1:   2007        -3        -1         2004  0.10444344 0.1269592      251
#> 2:   2007        -2        -1         2005 -0.05942059 0.1298412      251
#> 3:   2007        -1        -1         2006  0.00000000 0.0000000      251
#> 4:   2007         0        -1         2007  1.27025637 0.1322487      251
#> 5:   2007         1        -1         2008  2.15386918 0.1171451      251
#> 6:   2007         2        -1         2009  3.35116597 0.1288344      251
#> 7:   2007         3        -1         2010  4.11246994 0.1195819      251
#> 8:   2007         4        -1         2011  5.34013939 0.1225179      251
#> 9:   2007         5        -1         2012  6.19641269 0.1361129      251
#>    Ntreated
#> 1:      249
#> 2:      249
#> 3:      249
#> 4:      249
#> 5:      249
#> 6:      249
#> 7:      249
#> 8:      249
#> 9:      249Note that the sample size for the control group,
Ncontrol, is now substantially lower. Finally, we consider
changing this to only use the “future-treated” group:
did = DiD(inputdata = simdata, varnames = varnames, 
          control_group = "future-treated", min_event = -3, max_event=5)
print(did$results_cohort[Cohort==2007])
#>    Cohort EventTime BaseEvent CalendarTime     ATTge  ATTge_SE Ncontrol
#> 1:   2007        -3        -1         2004 0.2798501 0.1117079      500
#> 2:   2007        -2        -1         2005 0.2212720 0.1081106      500
#> 3:   2007        -1        -1         2006 0.0000000 0.0000000      500
#> 4:   2007         0        -1         2007 1.3847494 0.1183855      500
#> 5:   2007         1        -1         2008 2.2717963 0.1022833      500
#> 6:   2007         2        -1         2009 3.3803539 0.1094793      500
#> 7:   2007         3        -1         2010 4.4218911 0.1294090      250
#> 8:   2007         4        -1         2011 5.3081689 0.1292680      250
#>    Ntreated
#> 1:      249
#> 2:      249
#> 3:      249
#> 4:      249
#> 5:      249
#> 6:      249
#> 7:      249
#> 8:      249We see that, because there are no future-treated cohorts left at event time +5, it is no longer possible to provide an ATT estimate at event time +5 if using the future-treated control group.
base_event argumentIn DiD designs, the researcher must choose a base pre-period against
which differences are measured. A natural choice is
base_event = -1, which means to use the time period just
before treatment onset at event time 0.
As discussed by Callaway & Sant’Anna (2021), one possibility is
that the treatment group begins responding to treatment prior to
treatment onset, which is called “anticipation.” If anticipation begins
at period -1, then differences measured relative to
base_event = -1 yield inconsistent DiD estimates.
Fortunately, there is an easy solution: choose a base pre-period from
before anticipation began. If it seems reasonable to assume that the
treatment group could not have anticipated treatment 3 years before
treatment onset, then base_event = -3 should be free of
anticipation.
We can simulate data that is subject to anticipation using the
anticipation argument in the SimDiD
function:
sim = SimDiD(seed=123, sample_size = 200, anticipation = 2)
simdata = sim$simdata
print(simdata)
#>        id year cohort         Y
#>    1:   1 2003    Inf 11.238356
#>    2:   1 2004    Inf 11.520443
#>    3:   1 2005    Inf 13.252991
#>    4:   1 2006    Inf 11.940777
#>    5:   1 2007    Inf 13.120440
#>   ---                          
#> 2196: 200 2009    Inf 10.250944
#> 2197: 200 2010    Inf  8.812227
#> 2198: 200 2011    Inf  9.926592
#> 2199: 200 2012    Inf 11.188476
#> 2200: 200 2013    Inf  9.270495Let’s focus on the average ATT across all cohorts. There are two periods of treatment effects prior to treatment, which we can verify by checking the true ATT from the simulation:
print(sim$true_ATT[cohort=="Average"])
#>     cohort event    ATTge
#> 1: Average    -2 0.500000
#> 2: Average    -1 0.500000
#> 3: Average     0 1.503356
#> 4: Average     1 2.503356
#> 5: Average     2 3.252525
#> 6: Average     3 4.252525
#> 7: Average     4 5.000000
#> 8: Average     5 6.000000
#> 9: Average     6 7.000000We set up the varnames to prepare for estimation:
varnames = list()
varnames$time_name = "year" 
varnames$outcome_name = "Y"
varnames$cohort_name = "cohort"
varnames$id_name = "id"If we estimate DiD using the default argument that
base_event = -1, the estimate will be biased and
inconsistent:
did = DiD(inputdata = simdata, varnames = varnames, min_event = -3, max_event=3)
print(did$results_average)
#>    EventTime BaseEvent        ATTe   ATTe_SE Ncontrol Ntreated
#> 1:        -3        -1 -0.52346600 0.1534229      303      149
#> 2:        -2        -1 -0.04556457 0.1380282      303      149
#> 3:        -1        -1  0.00000000 0.0000000      303      149
#> 4:         0        -1  0.75946293 0.1495045      303      149
#> 5:         1        -1  1.80044696 0.1392958      303      149
#> 6:         2        -1  2.50749407 0.1718781      202       99
#> 7:         3        -1  3.33212214 0.1917002      152       99We now set base_event = -3 to avoid the anticipation at
-1 and -2:
did = DiD(inputdata = simdata, varnames = varnames, 
          base_event = -3, min_event = -3, max_event=3)
print(did$results_average) 
#>    EventTime BaseEvent      ATTe   ATTe_SE Ncontrol Ntreated
#> 1:        -3        -3 0.0000000 0.0000000      303      149
#> 2:        -2        -3 0.4779014 0.1521152      303      149
#> 3:        -1        -3 0.5234660 0.1534229      303      149
#> 4:         0        -3 1.2829289 0.1441747      303      149
#> 5:         1        -3 2.3239130 0.1485972      303      149
#> 6:         2        -3 2.9653357 0.1643872      202       99
#> 7:         3        -3 3.8282455 0.1773802      152       99We see that the estimate is now close to the true ATT.
covariate_names entryWe sometimes worry that treatment cohorts are selected on time-varying observables, and those time-varying observables also directly affect the outcome. If the growth rate in the observables differs between the treatment and control groups, it creates a violation of the parallel-trends assumption: the treatment and control groups would have experienced different growth profiles in the absence of treatment due to their different observables.
Fortunately, this is easy to fix: Since the confounding variables are
observable, we just need to control for those observables.
DiDforBigData requires only that the list of variable
names, varnames, is modified to include a
covariate_names entry. For example,
varnames$covariate_names = c("X1","X2") tells it to control
linearly for time-variation in X1 and X2.
There is an option in SimDiD() to add a couple of
covariates.
sim = SimDiD(sample_size=1000, time_covars=TRUE)
simdata = sim$simdata
print(simdata)
#>          id year cohort         Y          X1           X2
#>     1:    1 2003   2007 11.763806 -5.28713552 -0.546247081
#>     2:    1 2004   2007 14.216536 -3.16836592  0.882383754
#>     3:    1 2005   2007 12.898204 -3.52698674 -0.007446627
#>     4:    1 2006   2007 14.117949 -3.83264390 -0.307454720
#>     5:    1 2007   2007  9.932859  0.04643156  0.153863150
#>    ---                                                    
#> 10996: 1000 2009   2010 10.096541 -0.96662668 -0.375274801
#> 10997: 1000 2010   2010  6.679059  4.38379275  0.229111354
#> 10998: 1000 2011   2010  8.328832  4.11630763  0.242690466
#> 10999: 1000 2012   2010  9.171007  3.05442888  2.269272560
#> 11000: 1000 2013   2010  9.044345  4.46350435  0.510888022
print(sim$true_ATT[cohort==2007])
#>    cohort event ATTge
#> 1:   2007     0     1
#> 2:   2007     1     2
#> 3:   2007     2     3
#> 4:   2007     3     4
#> 5:   2007     4     5
#> 6:   2007     5     6
#> 7:   2007     6     7There are two covariates, X1 and X2. In particular, X2 differs in growth rates across treatment cohorts, which means that it causes violations of parallel trends if not controlled.
We set up the varnames to prepare for estimation,
ignoring the covariates for now:
varnames = list()
varnames$time_name = "year" 
varnames$outcome_name = "Y"
varnames$cohort_name = "cohort"
varnames$id_name = "id"We verify that DiD gives the wrong answer for cohort 2007:
did = DiD(inputdata = simdata, varnames = varnames, min_event = -3, max_event=5)
print(did$results_cohort[Cohort==2007])
#>    Cohort EventTime BaseEvent CalendarTime     ATTge  ATTge_SE Ncontrol
#> 1:   2007        -3        -1         2004 0.1677000 0.1877317      751
#> 2:   2007        -2        -1         2005 0.2901911 0.1778623      751
#> 3:   2007        -1        -1         2006 0.0000000 0.0000000      751
#> 4:   2007         0        -1         2007 1.1509604 0.1816170      751
#> 5:   2007         1        -1         2008 2.8765749 0.1862649      751
#> 6:   2007         2        -1         2009 4.3001933 0.1904476      751
#> 7:   2007         3        -1         2010 5.4489463 0.2012098      501
#> 8:   2007         4        -1         2011 6.8211665 0.1907504      501
#> 9:   2007         5        -1         2012 7.9496252 0.2256687      251
#>    Ntreated
#> 1:      249
#> 2:      249
#> 3:      249
#> 4:      249
#> 5:      249
#> 6:      249
#> 7:      249
#> 8:      249
#> 9:      249To control linearly for X1 and X2, we just need to add them to the
covariate_names argument of varnames:
Now we check DiD with controls for time-variation in X1 and X2:
did = DiD(inputdata = simdata, varnames = varnames, min_event = -3, max_event=5)
print(did$results_cohort[Cohort==2007])
#>    Cohort EventTime BaseEvent CalendarTime      ATTge  ATTge_SE Ncontrol
#> 1:   2007        -3        -1         2004 -0.1225027 0.1074379      751
#> 2:   2007        -2        -1         2005  0.1021934 0.1042450      751
#> 3:   2007        -1        -1         2006  0.0000000 0.0000000      751
#> 4:   2007         0        -1         2007  0.9427569 0.1066192      751
#> 5:   2007         1        -1         2008  2.0692206 0.1117700      751
#> 6:   2007         2        -1         2009  2.8921399 0.1144014      751
#> 7:   2007         3        -1         2010  3.9086404 0.1233646      501
#> 8:   2007         4        -1         2011  5.1380269 0.1275067      501
#> 9:   2007         5        -1         2012  5.9657885 0.1503580      251
#>    Ntreated
#> 1:      249
#> 2:      249
#> 3:      249
#> 4:      249
#> 5:      249
#> 6:      249
#> 7:      249
#> 8:      249
#> 9:      249We see that the bias has been removed thanks to the control variables.
cluster_names entryBy default, this package always provides heteroskedasticity-robust standard errors. However, in difference-in-differences applications, it is often the case that treatment is assigned to groups of individuals (e.g., a change in state-wide policy treats all individuals in a state simultaneously). If those groups are also subject to common shocks, this induces correlation in the estimation errors within cluster, and standard errors will tend to be too small.
Fortunately, this is easy to fix: If the groups within which the
estimation errors are correlated are known to the researcher, we just
need to cluster standard errors by group. DiDforBigData
requires only that the list of variable names, varnames, is
modified to include a cluster_names entry. For example,
varnames$cluster_names = c("group1","group2") tells it to
use multi-way clustering in a way that accounts for common shocks to
each of observable groups, “group1” and “group2”.
Note: When estimating a regression that combines multiple treatment
cohorts and/or multiple event times, it is necessary to always cluster
on unit (individual). DiDforBigData adds this clustering by
default.
There is an option in SimDiD() using
clusters=TRUE to group individuals into bins that are
differentially selected for treatment and that also face common shocks
within each bin:
sim = SimDiD(sample_size=1000, clusters = TRUE)
simdata = sim$simdata
print(simdata)
#>          id year cohort         Y cluster
#>     1:    1 2003   2007  8.266417       9
#>     2:    1 2004   2007 12.657646       9
#>     3:    1 2005   2007  9.373441       9
#>     4:    1 2006   2007 10.851528       9
#>     5:    1 2007   2007  9.306055       9
#>    ---                                   
#> 10996: 1000 2009   2010 10.317600       6
#> 10997: 1000 2010   2010  9.623423       6
#> 10998: 1000 2011   2010 11.509660       6
#> 10999: 1000 2012   2010  9.718423       6
#> 11000: 1000 2013   2010 11.042858       6
print(sim$true_ATT[cohort=="Average"])
#>     cohort event    ATTge
#> 1: Average     0 1.500668
#> 2: Average     1 2.500668
#> 3: Average     2 3.250501
#> 4: Average     3 4.250501
#> 5: Average     4 5.000000
#> 6: Average     5 6.000000
#> 7: Average     6 7.000000We set up the varnames to prepare for estimation:
varnames = list()
varnames$time_name = "year" 
varnames$outcome_name = "Y"
varnames$cohort_name = "cohort"
varnames$id_name = "id"We check the usual standard errors, which are clustered on unit based
on varnames$id_name by default:
did = DiD(inputdata = simdata, varnames = varnames, min_event = -1, max_event=3)
print(did$results_average)
#>    EventTime BaseEvent     ATTe    ATTe_SE Ncontrol Ntreated
#> 1:        -1        -1 0.000000 0.00000000     1503      749
#> 2:         0        -1 1.654672 0.07813674     1503      749
#> 3:         1        -1 2.710981 0.09333436     1503      749
#> 4:         2        -1 3.319803 0.11998209     1002      499
#> 5:         3        -1 4.597428 0.12376782      752      499Next, we cluster on the “cluster” variable by adding it to the
varnames and re-estimating:
varnames$cluster_names = "cluster" 
did = DiD(inputdata = copy(simdata), varnames = varnames, min_event = -1, max_event=3)
print(did$results_average)
#>    EventTime BaseEvent     ATTe    ATTe_SE Ncontrol Ntreated
#> 1:        -1        -1 0.000000 0.00000000     1503      749
#> 2:         0        -1 1.654672 0.08714482     1503      749
#> 3:         1        -1 2.710981 0.11885828     1503      749
#> 4:         2        -1 3.319803 0.13355642     1002      499
#> 5:         3        -1 4.597428 0.25745379      752      499parallel_cores argumentIf you have the parallel package installed, it is
trivial to execute your DiD estimation in parallel by setting the
parallel_cores argument in the DiD() command.
For example, DiD(..., parallel_cores = 4) will utilize 4
cores in parallel. To determine how many cores are available on your
system, just run the command parallel::detectCores() in R.
(I suggest leaving at least 1 core free to keep your system from
freezing.)
While parallel processing usually does not interact well with the
progress bar, I have written a modified version of R’s parallelization
protocol that correctly updates the progress bar as it completes its
tasks. Thus, if you have the progress package installed,
you will correctly see a progress bar and predicted completion time
while your DiD estimation is executing in parallel.
We simulate some data:
We set up the varnames to prepare for estimation:
varnames = list()
varnames$time_name = "year" 
varnames$outcome_name = "Y"
varnames$cohort_name = "cohort"
varnames$id_name = "id"We run the estimation not in parallel:
We run the estimation in parallel with 2 processes:
Esets argumentSuppose you wish to average the DiD estimate across a few event
times, with a corresponding standard error for the average across event
times. This is done by providing a list to the Esets
argument in the DiD() command.
For example, if Esets = list(c(1,2,3)), then the output
will include the average DiD for event times e=1,2,3.
Multiple sets of event times can be provided, e.g.,
Esets = list(c(1,2,3), c(1,3)) will provide the average
across e=1,2,3 as well as the average for e=1
and e=3.
Event set averages are returned in the $results_Esets
argument of the output list. Sample size is not provided, as it is
unclear how to define the sample size when averaging multiple
statistics, each of which has a different sample size.
We simulate some data:
We set up the varnames to prepare for estimation:
varnames = list()
varnames$time_name = "year" 
varnames$outcome_name = "Y"
varnames$cohort_name = "cohort"
varnames$id_name = "id"We run the estimation with two event set averages:
did = DiD(inputdata = copy(simdata), varnames = varnames, min_event = -1, max_event=3, Esets = list(c(1,2,3), c(1,3)))
print(did)
#> $results_cohort
#>     Cohort EventTime BaseEvent CalendarTime    ATTge   ATTge_SE Ncontrol
#>  1:   2007        -1        -1         2006 0.000000 0.00000000      751
#>  2:   2007         0        -1         2007 1.346483 0.11215961      751
#>  3:   2007         1        -1         2008 2.232383 0.09443787      751
#>  4:   2007         2        -1         2009 3.370599 0.10350708      751
#>  5:   2007         3        -1         2010 4.266872 0.10723955      501
#>  6:   2010        -1        -1         2009 0.000000 0.00000000      501
#>  7:   2010         0        -1         2010 1.506723 0.10838000      501
#>  8:   2010         1        -1         2011 2.660030 0.10325447      501
#>  9:   2010         2        -1         2012 3.637140 0.12739621      251
#> 10:   2010         3        -1         2013 4.398071 0.12737026      251
#> 11:   2012        -1        -1         2011 0.000000 0.00000000      251
#> 12:   2012         0        -1         2012 1.920898 0.13066712      251
#> 13:   2012         1        -1         2013 2.786013 0.12683651      251
#>     Ntreated
#>  1:      249
#>  2:      249
#>  3:      249
#>  4:      249
#>  5:      249
#>  6:      250
#>  7:      250
#>  8:      250
#>  9:      250
#> 10:      250
#> 11:      250
#> 12:      250
#> 13:      250
#> 
#> $results_average
#>    EventTime BaseEvent     ATTe    ATTe_SE Ncontrol Ntreated
#> 1:        -1        -1 0.000000 0.00000000     1503      749
#> 2:         0        -1 1.591695 0.06629046     1503      749
#> 3:         1        -1 2.559912 0.06200891     1503      749
#> 4:         2        -1 3.504136 0.08179983     1002      499
#> 5:         3        -1 4.332603 0.08219426      752      499
#> 
#> $results_Esets
#>     Eset ATT_Eset ATT_Eset_SE
#> 1: 1,2,3 3.340838  0.05596817
#> 2:   1,3 3.251005  0.05801106