collapse and plm

Fast Transformation and Exploration of Panel Data

Sebastian Krantz

2020-03-12

collapse is a C/C++ based package for data manipulation in R. It’s aims are

  1. to facilitate complex data transformation and exploration tasks and

  2. to help make R code fast, flexible, parsimonious and programmer friendly.

This vignette focuses on the integration of collapse and the popular plm (‘Linear Models for Panel Data’) package by Yves Croissant and Giovanni Millo. It will demonstrate the utility of the pseries and pdata.frame classes introduced in plm together with the corresponding methods for fast collapse functions (implemented in C or C++), to extend and facilitate extremely fast computations on panel-vectors and panel-data.frames (20-100 times faster than native plm). The collapse package should enable R programmers to - with very little effort - write high-performance code in the domain of panel-data exploration and panel-data econometrics.

The computations considered are between and within transformations (grouped averaging and centering), higher-dimensional between and within transformations (i.e. averaging and centering over multiple groups), standardizing (i.e. scaling and centering), weighted versions of all of the above, sequences of panel- lags / leads and lagged / leaded and iterated differences and growth rates / log-differences, panel- auto-, partial-auto and cross-correlation functions, panel-data to (ts-) matrix / array conversions, and summary statistics for panel-data. Not really covered in this vignette is the whole suite of Fast Statistical Functions in the collapse package, which may of course also be used for grouped and weighted operations on panel-data, but currently do not have methods for plm classes.


Note: To learn more about collapse, see the ‘Introduction to collapse’ vignette or the built-in structured documentation available under help("collapse-documentation") after installing the package. In addition help("collapse-package") provides a compact set of examples for quick-start.


The vignette is structured as follows:

For this vignette we will use a dataset (wlddev) supplied with collapse containing a panel of 4 key development indicators taken from the World Bank Development Indicators Database:

library(collapse)

head(wlddev)
#       country iso3c       date year decade     region     income  OECD PCGDP LIFEEX GINI       ODA
# 1 Afghanistan   AFG 1961-01-01 1960   1960 South Asia Low income FALSE    NA 32.292   NA 114440000
# 2 Afghanistan   AFG 1962-01-01 1961   1960 South Asia Low income FALSE    NA 32.742   NA 233350000
# 3 Afghanistan   AFG 1963-01-01 1962   1960 South Asia Low income FALSE    NA 33.185   NA 114880000
# 4 Afghanistan   AFG 1964-01-01 1963   1960 South Asia Low income FALSE    NA 33.624   NA 236450000
# 5 Afghanistan   AFG 1965-01-01 1964   1960 South Asia Low income FALSE    NA 34.060   NA 302480000
# 6 Afghanistan   AFG 1966-01-01 1965   1960 South Asia Low income FALSE    NA 34.495   NA 370250000

fNobs(wlddev)      # This column-wise counts the number of observations
# country   iso3c    date    year  decade  region  income    OECD   PCGDP  LIFEEX    GINI     ODA 
#   12744   12744   12744   12744   12744   12744   12744   12744    8995   11068    1356    8336

fNdistinct(wlddev) # This counts the number of distinct values
# country   iso3c    date    year  decade  region  income    OECD   PCGDP  LIFEEX    GINI     ODA 
#     216     216      59      59       7       7       4       2    8995   10048     363    7564

Part 1: Fast Transformation of Panel Data

First let us convert this data to a plm panel-data.frame (class pdata.frame):

library(plm)

# This creates a panel-data frame
pwlddev <- pdata.frame(wlddev, index = c("iso3c", "year"))

str(pwlddev, give.attr = FALSE)
# Classes 'pdata.frame' and 'data.frame':   12744 obs. of  12 variables:
#  $ country: Factor w/ 216 levels "Afghanistan",..: 10 10 10 10 10 10 10 10 10 10 ...
#  $ iso3c  : Factor w/ 216 levels "ABW","AFG","AGO",..: 1 1 1 1 1 1 1 1 1 1 ...
#  $ date   : pseries, format: "1961-01-01" "1962-01-01" "1963-01-01" ...
#  $ year   : Factor w/ 59 levels "1960","1961",..: 1 2 3 4 5 6 7 8 9 10 ...
#  $ decade : 'pseries' Named num  1960 1960 1960 1960 1960 1960 1970 1970 1970 1970 ...
#  $ region : Factor w/ 7 levels "East Asia & Pacific",..: 3 3 3 3 3 3 3 3 3 3 ...
#  $ income : Factor w/ 4 levels "High income",..: 1 1 1 1 1 1 1 1 1 1 ...
#  $ OECD   : 'pseries' Named logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
#  $ PCGDP  : 'pseries' Named num  NA NA NA NA NA NA NA NA NA NA ...
#  $ LIFEEX : 'pseries' Named num  65.7 66.1 66.4 66.8 67.1 ...
#  $ GINI   : 'pseries' Named num  NA NA NA NA NA NA NA NA NA NA ...
#  $ ODA    : 'pseries' Named num  NA NA NA NA NA NA NA NA NA NA ...

# A pdata.frame has an index attribute attached [retrieved using index(pwlddev) or attr(pwlddev, "index")]
str(index(pwlddev))
# Classes 'pindex' and 'data.frame':    12744 obs. of  2 variables:
#  $ iso3c: Factor w/ 216 levels "ABW","AFG","AGO",..: 1 1 1 1 1 1 1 1 1 1 ...
#  $ year : Factor w/ 59 levels "1960","1961",..: 1 2 3 4 5 6 7 8 9 10 ...

# This shows the individual and time dimensions
pdim(pwlddev)
# Balanced Panel: n = 216, T = 59, N = 12744

# This shows which variables vary across which dimensions
pvar(pwlddev)
# no time variation:       country iso3c region income OECD PCGDP LIFEEX GINI ODA 
# no individual variation: date year decade PCGDP LIFEEX GINI ODA 
# all NA in time dimension for at least one individual:  PCGDP LIFEEX GINI ODA 
# all NA in ind. dimension for at least one time period: PCGDP LIFEEX GINI ODA

A plm::pdata.frame is a data.frame with panel identifiers attached as a list of factors in an index attribute (non-factor index variables are converted to factor). Each column in that data.frame is a Panel-Series (plm::pseries), which also has the panel identifiers attached:

# Panel-Series of GDP per Capita and Life-Expectancy at Birth
PCGDP <- pwlddev$PCGDP
LIFEEX <- pwlddev$LIFEEX
str(LIFEEX)
#  'pseries' Named num [1:12744] 65.7 66.1 66.4 66.8 67.1 ...
#  - attr(*, "names")= chr [1:12744] "ABW-1960" "ABW-1961" "ABW-1962" "ABW-1963" ...
#  - attr(*, "index")=Classes 'pindex' and 'data.frame':    12744 obs. of  2 variables:
#   ..$ iso3c: Factor w/ 216 levels "ABW","AFG","AGO",..: 1 1 1 1 1 1 1 1 1 1 ...
#   ..$ year : Factor w/ 59 levels "1960","1961",..: 1 2 3 4 5 6 7 8 9 10 ...

Now that we have explored the basic data structures provided in the plm package, let’s compute some transformations on them:

1.1 Between and Within Transformations

The functions fbetween and fbetween can be used to compute efficient between and within transformations on panel vectors and panel data.frames:

by default na.rm = TRUE thus both functions skip (preserve) missing values in the data (which, by the way, is the default for all collapse functions). For fbetween the output behavior can be altered with the option fill: Setting fill = TRUE will compute the group-means on the complete cases in each group (as long as na.rm = TRUE), but replace all values in each group with the group mean (hence overwriting or ‘filling up’ missing values):

For fwithin there is also a second method of computation enabled by the argument add.global.mean = TRUE, which will add the overall mean of the series back to the data after subtracting out group means. This is to preserve the level of the data (and will only change the intercept when employed in a regression):

fbetween and fwithin can also be applied to panel-data.frames where they will perform these computations variable by variable:

Now next to fbetween and fwithin there also exist short versions B and W, which I termed transformation operators. These are essentially wrappers around fbetween and fwithin and provide the same functionality, but are more parsimonious to employ in regression formulas and also offer additional features when applied to panel-data.frames. For panel-series, B and W are exact analogues to fbetween and fwithin, just under a shorter name:

When applied to panel-data.frames, B and W offer some additional utility by (a) allowing you to select colums to transform using the cols argument (default is cols = is.numeric, so by default all numeric columns will be selected for transformation), (b) allowing you to add a prefix to the transformed columns with the stub argument (default is stub = "B." for B and stub = "W." for W) and (c) preserving the panel-id’s with the keep.ids argument (default keep.ids = TRUE):

fbetween / B and fwithin / W also support weighted computations. This of course applies more to panel-survey settings, but for the sake of illustration suppose we wanted to weight our between and within transformations by the amount of ODA these countries received:

As shown above, with B and W the weight column can also be passed as a formula or character string, whereas fbetween and fwithin require the all inputs to be passed directly in terms of data (i.e. fbetween(get_vars(pwlddev, 9:11), w = pwlddev$ODA)), and the weight vector or id colums are never preserved in the output. Therefore in most applications B and W are probably more convenient for quick use, whereas fbetween and fwithin are the preferred programmers choice, also because they have a little less R-overhead which makes them a tiny bit faster.

1.2 Higher-Dimensional Between and Within Transformations

Analogous to fbetween / B and fwithin / W, collapse provides a duo of functions and operators fHDbetween / HDB and fHDwithin / HDW to efficiently average and center data on multiple groups. The credit herefore goes to Simen Gaure, the author of the lfe package who wrote an efficient C- implementation of the alternating-projections algorithm to perform this task. fHDbetween / HDB and fHDwithin / HDW enrich this implementation (available in the function lfe::demeanlist) by providing more options regarding missing values, and also allowing continuous covariates and (full) interactions to be projected out alongside factors. The methods for pseries and pdata.frame’s are however rather simple, as they simply simultaneously center panel-vectors on all panel-identifiers in the index (which can be more than 2):

The architecture of fHDbetween / HDB and fHDwithin / HDW differs a bit from fbetween / B and fwithin / W. This is essentially a consequence of the underlying C-implementation (accessed through lfe::demeanlist), which was not built to accommodate missing values. fHDbetween / HDB and fHDwithin / HDW therefore both have an argument fill = TRUE (the default), which stipulates that missing values in the data are preserved in the output. The collapse default na.rm = TRUE again ensures that only complete cases are used for the computation:

# Missing values are preserved in the output when fill = TRUE (the default)
head(HDB(PCGDP), 30)  
# ABW-1960 ABW-1961 ABW-1962 ABW-1963 ABW-1964 ABW-1965 ABW-1966 ABW-1967 ABW-1968 ABW-1969 ABW-1970 
#       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA 
# ABW-1971 ABW-1972 ABW-1973 ABW-1974 ABW-1975 ABW-1976 ABW-1977 ABW-1978 ABW-1979 ABW-1980 ABW-1981 
#       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA 
# ABW-1982 ABW-1983 ABW-1984 ABW-1985 ABW-1986 ABW-1987 ABW-1988 ABW-1989 
#       NA       NA       NA       NA 21750.50 22024.44 22371.47 22670.55

# When fill = FALSE, only the complete cases are returned
nofill <- HDB(PCGDP, fill = FALSE)
head(nofill, 30)
# ABW-1986 ABW-1987 ABW-1988 ABW-1989 ABW-1990 ABW-1991 ABW-1992 ABW-1993 ABW-1994 ABW-1995 ABW-1996 
# 21750.50 22024.44 22371.47 22670.55 22990.95 23001.82 23042.98 23085.61 23307.28 23506.84 23690.18 
# ABW-1997 ABW-1998 ABW-1999 ABW-2000 ABW-2001 ABW-2002 ABW-2003 ABW-2004 ABW-2005 ABW-2006 ABW-2007 
# 24025.68 24305.15 24611.12 25073.75 25255.17 25445.18 25693.93 26195.16 26517.71 27017.07 27535.56 
# ABW-2008 ABW-2009 ABW-2010 ABW-2011 ABW-2012 ABW-2013 ABW-2014 ABW-2015 
# 27560.67 26822.40 27049.76 27246.63 27290.13 27465.78 27646.39 27839.22

# This results in a shorter panel-vector 
length(nofill)   
# [1] 8995
length(PCGDP)
# [1] 12744

# The cases that were missing and removed from the output are available as an attribute
head(attr(nofill, "na.rm"), 30)
# ABW-1960 ABW-1961 ABW-1962 ABW-1963 ABW-1964 ABW-1965 ABW-1966 ABW-1967 ABW-1968 ABW-1969 ABW-1970 
#        1        2        3        4        5        6        7        8        9       10       11 
# ABW-1971 ABW-1972 ABW-1973 ABW-1974 ABW-1975 ABW-1976 ABW-1977 ABW-1978 ABW-1979 ABW-1980 ABW-1981 
#       12       13       14       15       16       17       18       19       20       21       22 
# ABW-1982 ABW-1983 ABW-1984 ABW-1985 ABW-2018 AFG-1960 AFG-1961 AFG-1962 
#       23       24       25       26       59       60       61       62

In the pdata.frame methods there are 3 different choices how to deal with missing values. The default for the plm classes in variable.wise = TRUE, which will essentially sequentially apply fHDbetween.pseries and fHDwithin.pseries (with the default fill = TRUE) to all columns. This is the same behavior as in fbetween / B and fwithin / W, which also consider the column-wise complete obs:

If variable.wise = FALSE, fHDbetween / HDB and fHDwithin / HDW will only consider the complete cases in the dataset, but still return a dataset of the same dimensions (as long as fill = TRUE), resulting in some rows all-missing:

Finally, if also fill = FALSE, the behavior is the same as in the pseries method: Missing cases are removed from the data:

Notes: (1) Because of the different missing case options and associated challenges, panel-identifiers are not preserved in HDB and HDW. (2) The default variable.wise = TRUE and fill = TRUE was only set for the pseries and pdata.frame methods, to harmonize the default implementations with fbetween / B and fwithin / W for these classes. In the standard default, matrix and data.frame methods, the defaults are variable.wise = FALSE and fill = FALSE (i.e. missing cases are removed beforehand), which is generally more efficient.

1.3 Scaling and Centering

Next to the above functions for grouped centering and averaging, the function / operator pair fscale / STD can be used to efficiently standardize (i.e. scale and center) panel data along an arbitrary dimension. The architecture is identical to that of fwithin / W or fbetween / B.

And similarly for pdata.frame’s:

Scaling without centering can be done with the fsd function using:

Again, the Fast Statistical Functions in collapse do not have methods for pseries or pdata.frame’s (yet).

1.4 Panel Lags / Leads, Differences and Growth Rates

A proper and fast implementation of panel- lags, differences and growth rates has been missing in R so far. With ‘proper’ I mean an implementation that does not require panel-vectors to be sorted (amounting i.e. to a grouped-lag on sorted data), but that takes into account both individual an time-identifiers in the computation. plm::lag and dplyr::lag (with the order_by argument) provide proper implementations but rely on base-R (split-apply-combine logic) which makes them slow. data.table::shift allows for pretty fast grouped lags on sorted data, but without taking into account the time-identifiers (i.e. ‘improper’, you can do something like DT[oder(time), shift(col1), by = pid] in data.table but that sorts the data and is definitely more computationally expensive than the implementation introduced here).

With flag / L / F, fdiff / D and fgrowth / G, collapse provides a fast and comprehensive C++ based solution to the computation of (sequences of) lags / leads and (sequences of) lagged / leaded and suitably iterated differences and growth rates / log-differences on panel-data. The pseries and pdata.frame methods to these functions and associated transformation operators automatically use the panel-identifiers in the ‘index’ attached to these objects (where the last variable in the ‘index’ is taken as the time-variable and the variables before that are taken as individual identifiers) to perform fast fully-identified time-dependent operations on panel-data, without the need of sorting the data.

With flag / L / F, it is easy to lag or lead pseries:

It is also possible to compute a sequence of lags / leads using flag or one of the operators:

Of course the lag orders may be unevenly spaced, i.e. L(x, -1:3*12) would compute seasonal lags on monthly data. On pdata.frame’s, the effects of flag and L / F differ insofar that flag will just lag the entire dataset without preserving identifiers (although the index attribute is always preserved), whereas L / F by default (cols = is.numeric) select the numeric variables and add the panel-id’s on the left (default keep.ids = TRUE):

# This lags the entire data
head(flag(pwlddev))
#          L1.country L1.iso3c    L1.date L1.year L1.decade                  L1.region   L1.income
# ABW-1960       <NA>     <NA>       <NA>    <NA>        NA                       <NA>        <NA>
# ABW-1961      Aruba      ABW 1961-01-01    1960      1960 Latin America & Caribbean  High income
# ABW-1962      Aruba      ABW 1962-01-01    1961      1960 Latin America & Caribbean  High income
# ABW-1963      Aruba      ABW 1963-01-01    1962      1960 Latin America & Caribbean  High income
# ABW-1964      Aruba      ABW 1964-01-01    1963      1960 Latin America & Caribbean  High income
# ABW-1965      Aruba      ABW 1965-01-01    1964      1960 Latin America & Caribbean  High income
#          L1.OECD L1.PCGDP L1.LIFEEX L1.GINI L1.ODA
# ABW-1960      NA       NA        NA      NA     NA
# ABW-1961   FALSE       NA    65.662      NA     NA
# ABW-1962   FALSE       NA    66.074      NA     NA
# ABW-1963   FALSE       NA    66.444      NA     NA
# ABW-1964   FALSE       NA    66.787      NA     NA
# ABW-1965   FALSE       NA    67.113      NA     NA

# This lags only numeric columns and preserves panel-id's
head(L(pwlddev))
#          iso3c year L1.decade L1.PCGDP L1.LIFEEX L1.GINI L1.ODA
# ABW-1960   ABW 1960        NA       NA        NA      NA     NA
# ABW-1961   ABW 1961      1960       NA    65.662      NA     NA
# ABW-1962   ABW 1962      1960       NA    66.074      NA     NA
# ABW-1963   ABW 1963      1960       NA    66.444      NA     NA
# ABW-1964   ABW 1964      1960       NA    66.787      NA     NA
# ABW-1965   ABW 1965      1960       NA    67.113      NA     NA

# This lags only columns 9 through 12 and preserves panel-id's
head(L(pwlddev, cols = 9:12))
#          iso3c year L1.PCGDP L1.LIFEEX L1.GINI L1.ODA
# ABW-1960   ABW 1960       NA        NA      NA     NA
# ABW-1961   ABW 1961       NA    65.662      NA     NA
# ABW-1962   ABW 1962       NA    66.074      NA     NA
# ABW-1963   ABW 1963       NA    66.444      NA     NA
# ABW-1964   ABW 1964       NA    66.787      NA     NA
# ABW-1965   ABW 1965       NA    67.113      NA     NA

We can also easily compute a sequence of lags / leads on a panel-data.frame:

Essentially the same functionality applies to fdiff / D and fgrowth / G, with the main differences that these functions also have a diff argument to determine the number of iterations:

By default, growth rates are calculated as (x - lag(x)) / lag(x) * 100, but we can also compute growth rates based on log-differences which are often used in economics for various reasons (i.e. symmetry, exponential trends on macro-series, heteroskedasticity, properties of the log etc..). In that case the formula is (log(x) - lag(log(x))) * 100 or log(x/lag(x)) * 100:

It is also possible to compute sequences of lagged / leaded and iterated differences and growth rates:

# first and second forward-difference and first and second difference of lags 1-3 of Life-Expectancy
head(D(LIFEEX, -1:3, 1:2))
#             FD1    FD2     --    D1     D2  L2D1   L2D2  L3D1 L3D2
# ABW-1960 -0.412 -0.042 65.662    NA     NA    NA     NA    NA   NA
# ABW-1961 -0.370 -0.027 66.074 0.412     NA    NA     NA    NA   NA
# ABW-1962 -0.343 -0.017 66.444 0.370 -0.042 0.782     NA    NA   NA
# ABW-1963 -0.326 -0.004 66.787 0.343 -0.027 0.713     NA 1.125   NA
# ABW-1964 -0.322  0.005 67.113 0.326 -0.017 0.669 -0.113 1.039   NA
# ABW-1965 -0.327  0.006 67.435 0.322 -0.004 0.648 -0.065 0.991   NA

# Same with (exact) growth rates
head(G(LIFEEX, -1:3, 1:2))
#                 FG1       FG2     --        G1         G2      L2G1      L2G2     L3G1 L3G2
# ABW-1960 -0.6235433 11.974895 65.662        NA         NA        NA        NA       NA   NA
# ABW-1961 -0.5568599  8.428580 66.074 0.6274558         NA        NA        NA       NA   NA
# ABW-1962 -0.5135730  5.728297 66.444 0.5599782 -10.754153 1.1909476        NA       NA   NA
# ABW-1963 -0.4857479  1.727984 66.787 0.5162242  -7.813521 1.0790931        NA 1.713320   NA
# ABW-1964 -0.4774968 -1.051555 67.113 0.4881189  -5.444387 1.0068629 -15.45699 1.572479   NA
# ABW-1965 -0.4825714 -1.319230 67.435 0.4797878  -1.706782 0.9702487 -10.08666 1.491482   NA

# Same with Log-differences (growth rates)
head(G(LIFEEX, -1:3, 1:2, logdiff = TRUE))
#              FDlog1 FDlog2     --     Dlog1      Dlog2  L2Dlog1   L2Dlog2  L3Dlog1 L3Dlog2
# ABW-1960 -0.6254955    NaN 65.662        NA         NA       NA        NA       NA      NA
# ABW-1961 -0.5584162    NaN 66.074 0.6254955         NA       NA        NA       NA      NA
# ABW-1962 -0.5148963    NaN 66.444 0.5584162 -11.343957 1.183912        NA       NA      NA
# ABW-1963 -0.4869315    NaN 66.787 0.5148963  -8.113893 1.073312        NA 1.698808      NA
# ABW-1964 -0.4786405    NaN 67.113 0.4869315  -5.584209 1.001828 -16.69977 1.560244      NA
# ABW-1965 -0.4837395    NaN 67.435 0.4786405  -1.717366 0.965572 -10.57842 1.480468      NA

Another important advantage of the collapse functions compared to plm::lag or plm::diff is that the panel-identifiers are preserved, even if a matrix of lags / leads / differences or growth rates is returned. This allows for nested panel-computations, for example we can compute shifted sequences of lagged / leaded and iterated panel differences:

All of this naturally generalized to computations on pdata.frames:

head(D(pwlddev, -1:3, 1:2, cols = 9:10), 3)
#          iso3c year FD1.PCGDP FD2.PCGDP PCGDP D1.PCGDP D2.PCGDP L2D1.PCGDP L2D2.PCGDP L3D1.PCGDP
# ABW-1960   ABW 1960        NA        NA    NA       NA       NA         NA         NA         NA
# ABW-1961   ABW 1961        NA        NA    NA       NA       NA         NA         NA         NA
# ABW-1962   ABW 1962        NA        NA    NA       NA       NA         NA         NA         NA
#          L3D2.PCGDP FD1.LIFEEX FD2.LIFEEX LIFEEX D1.LIFEEX D2.LIFEEX L2D1.LIFEEX L2D2.LIFEEX
# ABW-1960         NA     -0.412     -0.042 65.662        NA        NA          NA          NA
# ABW-1961         NA     -0.370     -0.027 66.074     0.412        NA          NA          NA
# ABW-1962         NA     -0.343     -0.017 66.444     0.370    -0.042       0.782          NA
#          L3D1.LIFEEX L3D2.LIFEEX
# ABW-1960          NA          NA
# ABW-1961          NA          NA
# ABW-1962          NA          NA

head(L(D(pwlddev, -1:3, 1:2, cols = 9:10), 0:1), 3)
#          iso3c year FD1.PCGDP L1.FD1.PCGDP FD2.PCGDP L1.FD2.PCGDP PCGDP L1.PCGDP D1.PCGDP
# ABW-1960   ABW 1960        NA           NA        NA           NA    NA       NA       NA
# ABW-1961   ABW 1961        NA           NA        NA           NA    NA       NA       NA
# ABW-1962   ABW 1962        NA           NA        NA           NA    NA       NA       NA
#          L1.D1.PCGDP D2.PCGDP L1.D2.PCGDP L2D1.PCGDP L1.L2D1.PCGDP L2D2.PCGDP L1.L2D2.PCGDP
# ABW-1960          NA       NA          NA         NA            NA         NA            NA
# ABW-1961          NA       NA          NA         NA            NA         NA            NA
# ABW-1962          NA       NA          NA         NA            NA         NA            NA
#          L3D1.PCGDP L1.L3D1.PCGDP L3D2.PCGDP L1.L3D2.PCGDP FD1.LIFEEX L1.FD1.LIFEEX FD2.LIFEEX
# ABW-1960         NA            NA         NA            NA     -0.412            NA     -0.042
# ABW-1961         NA            NA         NA            NA     -0.370        -0.412     -0.027
# ABW-1962         NA            NA         NA            NA     -0.343        -0.370     -0.017
#          L1.FD2.LIFEEX LIFEEX L1.LIFEEX D1.LIFEEX L1.D1.LIFEEX D2.LIFEEX L1.D2.LIFEEX L2D1.LIFEEX
# ABW-1960            NA 65.662        NA        NA           NA        NA           NA          NA
# ABW-1961        -0.042 66.074    65.662     0.412           NA        NA           NA          NA
# ABW-1962        -0.027 66.444    66.074     0.370        0.412    -0.042           NA       0.782
#          L1.L2D1.LIFEEX L2D2.LIFEEX L1.L2D2.LIFEEX L3D1.LIFEEX L1.L3D1.LIFEEX L3D2.LIFEEX
# ABW-1960             NA          NA             NA          NA             NA          NA
# ABW-1961             NA          NA             NA          NA             NA          NA
# ABW-1962             NA          NA             NA          NA             NA          NA
#          L1.L3D2.LIFEEX
# ABW-1960             NA
# ABW-1961             NA
# ABW-1962             NA

1.5 Panel-Data to Array Conversions

Viewing and transforming panel-data stored in an array can be a powerful strategy, especially as it provides much more direct access to the different dimensions of the data. The function psmat can be used to efficiently transform pseries to a 2D matrix, and pdata.frame’s to a 3D array:

Applying psmat to a pdata.frame yields a 3D array:

This format can be very convenient to quickly and freely access data for different countries, variables and time-periods:

psmat can also return the output as a list of panel-series matrices:

This list can then be unlisted using the function unlist2d (for unlisting in 2-dimensions), to yield a reshaped data.frame:

head(unlist2d(pslist, idcols = "Variable", row.names = "Country Code"), 3)
#   Variable Country Code 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974
# 1    PCGDP          ABW   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
# 2    PCGDP          AFG   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
# 3    PCGDP          AGO   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
#   1975 1976 1977 1978 1979    1980     1981     1982     1983     1984     1985      1986     1987
# 1   NA   NA   NA   NA   NA      NA       NA       NA       NA       NA       NA 15669.616 18427.61
# 2   NA   NA   NA   NA   NA      NA       NA       NA       NA       NA       NA        NA       NA
# 3   NA   NA   NA   NA   NA 2969.96 2742.656 2646.013 2660.145 2724.889 2732.077  2730.993  2767.18
#        1988      1989      1990      1991      1992     1993      1994      1995      1996
# 1 22134.017 24837.951 25357.787 26329.313 26401.969 26663.21 27272.310 26705.181 26087.776
# 2        NA        NA        NA        NA        NA       NA        NA        NA        NA
# 3  2861.356  2786.726  2614.493  2560.063  2333.477  1716.21  1684.215  1878.793  2073.215
#        1997     1998      1999      2000      2001       2002      2003       2004       2005
# 1 27190.501 27151.92 26954.405 28417.384 26966.055 25508.3025 25469.287 27005.5295 26979.8854
# 2        NA       NA        NA        NA        NA   339.6333   352.244   341.6125   365.5487
# 3  2164.082  2204.91  2190.087  2189.561  2208.792  2426.4318  2412.393  2582.6465  2866.4347
#         2006       2007       2008       2009      2010       2011       2012       2013       2014
# 1 27046.7604 27428.1202 27367.2810 24464.1745 23512.603 24231.3389 23777.3161 24629.0800 24692.4972
# 2   372.8967   412.9196   418.4788   495.1089   550.515   536.0125   584.9074   597.5252   594.5741
# 3  3085.4248  3394.5123  3641.4475  3544.0266  3585.906  3580.2699  3750.2091  3799.4296  3846.2409
#         2015       2016       2017 2018
# 1 24452.6066 24288.9871 24508.8091   NA
# 2   585.7083   583.0551   583.8696   NA
# 3  3751.6945  3533.8652  3413.6564   NA

Of course we could also have applied some transformation (like computing pairwise correlations) to each matrix before unlisting. In any case this kind of programming provides lots of possibilities to explore and manipulate panel data (as we will see in Part 2).

Benchmarks

Below I benchmark the collapse implementation against native plm. To do that I extend the dataset used so far to have approx 1 million observations:

The data has 21600 individuals (countries) each observed for 59 years, the total number of rows is 1274400. We can pull out a series of life expectancy and run some benchmarks. My windows laptop on which these benchmarks were run has a 2x 2.2 GHZ Intel i5 processor, 8GB DDR3 RAM and a Samsung SSD hard drive (so a decent laptop but nothing fancy).

# Creating the extended panel-series for Life Expectancy (l for large)
LIFEEX_l <- data$LIFEEX
str(LIFEEX_l)
#  'pseries' Named num [1:1274400] 65.7 66.1 66.4 66.8 67.1 ...
#  - attr(*, "names")= chr [1:1274400] "ABW1-1960" "ABW1-1961" "ABW1-1962" "ABW1-1963" ...
#  - attr(*, "index")=Classes 'pindex' and 'data.frame':    1274400 obs. of  2 variables:
#   ..$ iso3c: Factor w/ 21600 levels "ABW1","ABW10",..: 1 1 1 1 1 1 1 1 1 1 ...
#   ..$ year : Factor w/ 59 levels "1960","1961",..: 1 2 3 4 5 6 7 8 9 10 ...

# Between Transformations
system.time(Between(LIFEEX_l, na.rm = TRUE))
#    user  system elapsed 
#    0.29    0.00    0.29
system.time(fbetween(LIFEEX_l))
#    user  system elapsed 
#    0.00    0.03    0.03

# Within Transformations
system.time(Within(LIFEEX_l, na.rm = TRUE))
#    user  system elapsed 
#    0.44    0.04    0.49
system.time(fwithin(LIFEEX_l))
#    user  system elapsed 
#    0.01    0.02    0.03

# Higher-Dimenional Between and Within Transformations
system.time(fHDbetween(LIFEEX_l))
#    user  system elapsed 
#    0.08    0.04    0.13
system.time(fHDwithin(LIFEEX_l))
#    user  system elapsed 
#    0.11    0.03    0.14

# Single Lag
system.time(plm::lag(LIFEEX_l))
#    user  system elapsed 
#    0.60    0.00    0.59
system.time(flag(LIFEEX_l))
#    user  system elapsed 
#    0.02    0.00    0.02

# Sequence of Lags / Leads
system.time(plm::lag(LIFEEX_l, -1:3))
#    user  system elapsed 
#    2.54    0.16    2.70
system.time(flag(LIFEEX_l, -1:3))
#    user  system elapsed 
#    0.04    0.00    0.03

# Single difference
system.time(diff(LIFEEX_l))
#    user  system elapsed 
#    0.65    0.03    0.69
system.time(fdiff(LIFEEX_l))
#    user  system elapsed 
#    0.02    0.00    0.01

# Iterated Difference
system.time(fdiff(LIFEEX_l, diff = 2))
#    user  system elapsed 
#    0.03    0.00    0.03

# Sequence of Lagged / Leaded and iterated differences
system.time(fdiff(LIFEEX_l, -1:3, 1:2))
#    user  system elapsed 
#    0.02    0.06    0.08

# Single Growth Rate
system.time(fgrowth(LIFEEX_l))
#    user  system elapsed 
#    0.03    0.00    0.03

# Single Log-Difference
system.time(fgrowth(LIFEEX_l, logdiff = TRUE))
#    user  system elapsed 
#    0.11    0.00    0.11

# Panel-Series to Matrix Conversion
# system.time(as.matrix(LIFEEX_l))  This takes about 3 minutes to compute
system.time(psmat(LIFEEX_l))
#    user  system elapsed 
#       0       0       0

The results show that I did not promise to much in the introduction. A speed gain of 20-40x is the norm, and for certain operations such as the sequence of lags and leads the speed gain is about 100x, and for the panel-series to matrix conversion a 300x speed gain by using collapse vs. native plm. I am sure some will want to see a comparison with data.table:

The above dataset has 1 million obs in 20 thousand groups, but what about 10 million obs and 1 million groups? Do collapse functions scale efficiently as data and the number of groups grows large? Here is a simple benchmark:

The message is clear: collapse functions perform very well even as the number of groups grows large, in fact tests show that the large sample performance for aggregations with collapse is similar to data.table, and collapse grouped transformations like the ones shown here are generally faster than what can be done data.table.

The conclusion of this benchmark analysis is that collapse’s fast functions, with or without the help of plm classes, allow for very fast transformations of panel-data, and should enable R programmers and econometricians to implement high-performance panel-data estimators without having to dive into C/C++ themselves or resorting to data.table metaprogramming.

Part 2: Fast Exploration of Panel-Data

collapse also provides some essential functions to summarize and explore panel data, such as fast summary-statistics for panel-data, panel-auto, partial-auto and cross-correlation functions, and a fast F-test to test fixed effects and other exclusion restrictions on (large) panel-data models. I also offer some suggestions on applying simple correlational and unsupervised learning tools to panel-series matrices to learn more about the data.

2.1 Summary Statistics for Panel-Data

Efficient summary statistics for panel data have long been implemented in other statistical softwares. The command qsu, shorthand for ‘quick-summary’, is a very efficient summary statistics command inspired by the xtsummarize command in the STATA statistical software. It computes a default set of 5 statistics (N, mean, sd, min and max) and can also computed higher moments (skewness and kurtosis) in a single pass through the data (using a numerically stable online algorithm generalized from Welford’s Algorithm for variance computations). With panel-data, qsu computes these statistics not just on the raw data, but also on the between-transformed and within-transformed data:

Key statistics to look at in this summary are the sample size and the standard-deviation decomposed into the between-individuals and the within-individuals standard-deviation: For GDP per Capita we have 8995 observations in the panel series for 203 countries, with on average 44.31 observations (time-periods T) per country. The between-country standard deviation is 19600 USD, around 3-times larger than the within-country (over-time) standard deviation of 6300 USD. Regarding the mean, the Between-Mean computed as a cross-sectional average of country averages usually differs slightly from the overall average taken across all data points. The within-transformed data is computed and summarized with the overall mean added back (i.e. as in fwithin(PCGDP, add.global.mean = TRUE)).

We can also do groupwise panel-statistics and qsu also supports weights. For the sake of illustration, below I summarize the data by income group with unit weights1:

qsu(pwlddev, ~ income, w = rep(1, nrow(pwlddev)), cols = 9:12, higher = TRUE)
# , , Overall, PCGDP
# 
#                       N/T      Mean        SD     Min        Max  Skew   Kurt
# High income          3038  28974.73  22910.72  944.29  191586.64  2.15  10.25
# Low income           1405     596.8    308.21  131.65     1506.3  1.15   3.59
# Lower middle income  2120   1583.37    890.74  150.22    4662.88  0.83   3.28
# Upper middle income  2432   4849.75   2959.23  131.96   20333.94  1.32   5.21
# 
# , , Between, PCGDP
# 
#                      N/T      Mean        SD      Min        Max  Skew   Kurt
# High income           70  28974.73  20222.54  5191.59  141165.08  2.14  10.28
# Low income            30     596.8       276    255.4    1340.72  1.28    3.8
# Lower middle income   47   1583.37    702.74    410.2    3120.44   0.3   2.13
# Upper middle income   56   4849.75   2325.34  1662.03   13171.53  1.35    5.1
# 
# , , Within, PCGDP
# 
#                        N/T      Mean        SD        Min       Max  Skew  Kurt
# High income           43.4  11563.65  10767.99  -30529.09  75348.07  0.42  6.05
# Low income           46.83  11563.65    137.18    11020.6  12234.64  0.39  4.91
# Lower middle income  45.11  11563.65    547.34     9717.2   14037.9  0.65  4.98
# Upper middle income  43.43  11563.65   1830.25    4528.64  24375.59  0.72  8.47
# 
# , , Overall, LIFEEX
# 
#                       N/T   Mean    SD    Min    Max   Skew  Kurt
# High income          3682  73.22  5.51  42.67  85.42  -1.04  5.81
# Low income           1881  49.62  8.89  27.61  74.43   0.24  2.64
# Lower middle income  2628  58.56  9.39  18.91  76.25  -0.43  2.77
# Upper middle income  2877  65.97  7.65  36.74  79.83  -1.03  3.98
# 
# , , Between, LIFEEX
# 
#                      N/T   Mean    SD    Min    Max   Skew  Kurt
# High income           74  73.22  3.34  63.31  85.42  -0.65  3.17
# Low income            33  49.62  5.25  39.35  66.69   1.27  5.67
# Lower middle income   47  58.56  6.63  44.29  71.12  -0.17  2.27
# Upper middle income   53  65.97  5.13  47.29  73.99  -1.19  4.95
# 
# , , Within, LIFEEX
# 
#                        N/T   Mean    SD    Min    Max   Skew  Kurt
# High income          49.76  63.84  4.38   43.2  77.56  -0.47  4.07
# Low income              57  63.84  7.18  43.74  83.26      0  2.55
# Lower middle income  55.91  63.84  6.64  33.47  83.86   -0.2  3.55
# Upper middle income  54.28  63.84  5.68  41.29  81.95  -0.48  3.86
# 
# , , Overall, GINI
# 
#                      N/T   Mean    SD   Min   Max   Skew  Kurt
# High income          478  34.32  7.86    21  58.9    1.3  4.15
# Low income           109  41.47  6.79  28.9  65.8   0.65  3.93
# Lower middle income  330  40.07  9.36    24  63.2   0.48  2.27
# Upper middle income  439  43.91  9.75  16.2  64.8  -0.17  2.41
# 
# , , Between, GINI
# 
#                      N/T   Mean    SD    Min    Max   Skew  Kurt
# High income           40  34.32  7.62  25.28  54.22   1.28  3.86
# Low income            30  41.47  4.91  32.13   53.7   0.26  3.06
# Lower middle income   45  40.07  8.67  27.93  56.25   0.42  1.88
# Upper middle income   46  43.91  9.24  23.37  61.71  -0.16  2.12
# 
# , , Within, GINI
# 
#                        N/T  Mean    SD    Min    Max   Skew  Kurt
# High income          11.95  39.4  1.94  31.22  46.86  -0.19   5.5
# Low income            3.63  39.4  4.69  23.96   54.8   0.03  4.17
# Lower middle income   7.33  39.4  3.53  28.81   54.5   0.44  4.35
# Upper middle income   9.54  39.4  3.12  26.31  52.53  -0.05  4.71
# 
# , , Overall, ODA
# 
#                       N/T        Mean              SD              Min             Max   Skew
# High income          1627  151,154554      415,406000      -512,730000  4.64666000e+09   5.29
# Low income           1798  544,223382      792,312970          -450000  1.11545600e+10    4.8
# Lower middle income  2378  680,100029  1.00278593e+09      -486,220000  1.12780600e+10   3.76
# Upper middle income  2533  289,108010      757,988522  -1.08038000e+09  2.45521800e+10  16.12
#                        Kurt
# High income           37.44
# Low income            40.14
# Lower middle income   24.57
# Upper middle income  445.01
# 
# , , Between, ODA
# 
#                      N/T        Mean          SD          Min             Max  Skew   Kurt
# High income           43  151,154554  335,970871    423846.15  2.16970133e+09  4.16  21.18
# Low income            33  544,223382  399,556253  59,763076.9  1.41753857e+09  1.02   2.84
# Lower middle income   47  680,100029  753,840926  26,981379.3  3.53258914e+09  2.04    7.1
# Upper middle income   55  289,108010  377,699701    10,907561  1.96011067e+09  2.17   7.37
# 
# , , Within, ODA
# 
#                        N/T        Mean          SD              Min             Max   Skew    Kurt
# High income          37.84  428,746468  244,306608      -923,883087  2.90570514e+09    2.3   30.24
# Low income           54.48  428,746468  684,189040      -944,301290  1.01926687e+10   4.31   44.85
# Lower middle income   50.6  428,746468  661,289258  -2.47969577e+09  1.07855444e+10   3.91   48.01
# Upper middle income  46.05  428,746468  657,183031  -2.18778866e+09  2.35093916e+10  19.46  630.58

Here it should be noted that any grouping is applied independently from the data-transformation, that is the data is first transformed, and then grouped statistics are calculated on the transformed data. The computation of statistics is very efficient. Here I summarize the extended life-expectancy series used in the benchmarks in Part 1:

Using the transformation functions and the functions pwcor and pwcov, we can also easily explore the aggregate correlation structure of the data:

The correlations show that the between (cross-country) relationships between these macro-variables are quite strong, but within countries the relationships are much weaker, for example there seems to be no significant relationship between GDP per Capita and either inequality or ODA received within countries over time.

2.2 Exploring Panel-Data in Matrix / Array Form

We can take a single panel-series such as GDP per Capita and explore it further:

There is also a nice plot-method applied to panel-series arrays returned when psmat is applied to a panel-data.frame:

Above we have explored the cross-sectional relationship between the different national GDP series. Now we explore the time-dependence of the panel-vectors as a whole:

2.3 Panel- Auto-, Partial-Auto and Cross-Correlation Functions

The functions psacf, pspacf and psccf mimic stats::acf, stats::pacf and stats::ccf for panel-vectors and panel data.frames. Below I show the panel-series autocorrelation function of the data:

The computation is conducted by first scaling and centering (i.e. standardizing) the panel-vectors by groups (using fscale, default argument gscale = TRUE), and then taking the covariance of each series with a matrix of properly computed panel-lags of itself (using flag), and dividing that by the variance of the overall series (using fvar).

In a similar way we can compute the Partial-ACF (using a multivariate Yule-Walker decomposition on the ACF, as in stats::pacf):

and the panel-cross-correlation function between GDP per capita and life expectancy (which is already contained in the ACF plot above):

2.4 Testing for Individual Specific and Time-Effects

As a final step of exploration, we could analyze our series and simple models for the significance and explanatory power of individual or time-fixed effects, without going all the way to running a Hausman Test of fixed vs. random effects on a fully specified model. The main function here is fFtest which efficiently computes a fast R-Squared based F-test of exclusion restrictions on models potentially involving many factors. By default (argument full.df = TRUE) the degrees of freedom of the test are adjusted to make it identical to the F-statistic from regressing the series on a set of country and time dummies2.

Below I test the correlation between the country and time-means of GDP and Life-Expectancy:

We can also test for the significance of individual and time-fixed effects (or both) in the regression of GDP on life expectancy and ODA received:

As can be expected in this cross-country data, individual and time-fixed effects play a large role in explaining the data, and these effects are correlated across series, suggesting that a fixed-effects model with both types of fixed-effects would be appropriate. To round things off, below I compute the Hausman test of Fixed vs. Random effects, which confirms these conclusion:

Part 3: Programming Panel-Data Estimators

A central goal of the collapse package is to facilitate advanced and fast programming with data. A prime area of application for the functions introduced above is to program efficient panel-data estimators. In this section I provide a short example of how this can be done. The application will be an implementation of the Hausman and Taylor (1981) estimator, considering a more general case than currently implemented in the plm package:

In Hausman and Taylor (1981), in a more general scenario, we have a linear panel-model of the form \[y_{it} = \beta_1X_{1it} + \beta_2X_{2it} + \beta_3Z_{1i} + \beta_4Z_{2i} + \alpha_i + \gamma_t + \epsilon\] where \(\alpha_i\) denotes unobserved individual specific effects and \(\gamma_t\) denotes unobserved global events. This model has up to 4 kinds of covariates:

Now the main problem arises from \(E[Z_{2i}\alpha_i] \neq 0\), which would usually prevent us from estimating \(\beta_4\) since taking a within-transformation (fixed effects) would remove \(Z_{2i}\) from the equation. Hausman and Taylor (1981) stipulated that since \(E[X_{1it}\alpha_i] = 0\), once could use \(X_{1i.}\) i.e. the between-transformed \(X_{1it}\) to instrument for \(Z_{2i}\). They propose an IV/2SLS estimation of the whole equation where the within-transformed covariates \(\tilde{X}_{1it}\) and \(\tilde{X}_{2it}\) are used to instrument \(X_{1it}\) and \(X_{2it}\), and \(X_{1i.}\) instruments \(Z_{2i}\). Assuming that missing values have been removed beforehand, and also taking into account the possibility that \(E[X_{1it}\gamma_t] \neq 0\) and \(E[X_{2it}\gamma_t] \neq 0\) (i.e. accounting for time fixed-effects), this estimator can be coded as follows:

HT_est <- function(y, X1, Z2, X2 = NULL, Z1 = NULL, time.FE = FALSE) {
  
  # Create matrix of independent variables
  X <- do.call(cbind, c(X1, X2, Z1, Z2)) 
  
  # Create instrument matrix: if time.FE, higher-order demean X1 and X2, else normal demeaning
  IVS <- do.call(cbind, c(if(time.FE) fHDwithin(X1, na.rm = FALSE) else fwithin(X1, na.rm = FALSE), 
                 if(is.null(X2)) X2 else if(time.FE) fHDwithin(X2, na.rm = FALSE) else fwithin(X2, na.rm = FALSE),
                 Z1, fbetween(X1, na.rm = FALSE)))
  
  if(length(IVS) == length(X)) { # The IV estimator case
    return(drop(solve(crossprod(IVS, X), crossprod(IVS, y))))
  } else { # The 2SLS case
    Xhat <- qr.fitted(qr(IVS), X)  # First stage
    return(drop(qr.coef(qr(Xhat), y)))   # Second stage
  }
}

The estimator is written in such a way that variables of the type \(X_{2it}\) and \(Z_{1i}\) are optional, and it also includes an option as to whether time fixed effects are also projected out or not. The expected inputs for \(X_{1it}\) (X1), and \(X_{2it}\) (X2) are column-subsets of a pdata.frame.

Having coded the estimator, it would be good to have an example to run it on. I have tried to squeeze an example out of the wlddev data used so far in this vignette. It is quite crappy and suffers from a weak-IV problem, but for there sake of illustration lets do it: We want to estimate the panel-regression of life-expectancy on GDP per Capita, ODA received, the GINI index and a time-invariant dummy indicating whether the country is an OECD member. All variables except the dummy enter in logs, so this is an elasticity regression.

dat <- get_vars(wlddev, c("iso3c","year","OECD","PCGDP","LIFEEX","GINI","ODA"))
get_vars(dat, 4:7) <- log(get_vars(dat, 4:7))       # Taking logs of the data
dat$OECD <- as.numeric(dat$OECD)                    # Creating OECD dummy
dat <- pdata.frame(droplevels(na.omit(dat)),        # Creating Panel-data.frame, after removing missing values
                   index = c("iso3c", "year"))      # and dropping unused factor levels
pdim(dat)
# Unbalanced Panel: n = 132, T = 1-30, N = 918
pvar(dat)
# no time variation:       iso3c OECD 
# no individual variation: year

Using the GINI index cost a lot of observations and brought the sample size down from 12000 to under 1000, but the GINI index will be a key variable in what follows. Clearly the OECD dummy is time-invariant. Below I run Hausman-tests of fixed vs. random effects to determine which covariates might be correlated with the unobserved individual effects, and which model would be most appropriate.

# This tests each oth the covariates is correlated with with alpha_i
phtest(LIFEEX ~ PCGDP, dat)  # Likely correlated !
# 
#   Hausman Test
# 
# data:  LIFEEX ~ PCGDP
# chisq = 13.085, df = 1, p-value = 0.0002977
# alternative hypothesis: one model is inconsistent
phtest(LIFEEX ~ ODA, dat)    # Likely correlated !
# 
#   Hausman Test
# 
# data:  LIFEEX ~ ODA
# chisq = 41.803, df = 1, p-value = 1.009e-10
# alternative hypothesis: one model is inconsistent
phtest(LIFEEX ~ GINI, dat)   # Likely not correlated !!
# 
#   Hausman Test
# 
# data:  LIFEEX ~ GINI
# chisq = 1.3343, df = 1, p-value = 0.248
# alternative hypothesis: one model is inconsistent
phtest(LIFEEX ~ PCGDP + ODA + GINI, dat)  # Fixed Effects is the appropriate model for this regression
# 
#   Hausman Test
# 
# data:  LIFEEX ~ PCGDP + ODA + GINI
# chisq = 20.652, df = 3, p-value = 0.0001244
# alternative hypothesis: one model is inconsistent

The tests suggest that both GDP per Capita and ODA are correlated with country-specific unobservables affecting life-expectancy, and overall a fixed-effects model would be appropriate. However, the Hausman test on the GINI index fails to reject: Country specific unobservables affecting life-expectancy are not necessarily correlated with the level of inequality across countries.

Now if we want to include the OECD dummy in the regression, we cannot use fixed-effects as this would wipe-out the dummy as well. If the dummy is uncorrelated with the country-specific unobservables affecting life-expectancy (the \(\alpha_i\)), then we could use a solution suggested by Mundlak (1978) and simply add between-transformed versions of PCGDP and ODA in the regression (in addition to PCGDP and ODA in levels), and so ‘control’ for the part of PCGDP and ODA correlated with the \(\alpha_i\) (in the IV literature this is known as the control-function approach to IV estimation). If however the OECD dummy is correlated with the \(\alpha_i\), then we need to use the Hausman and Taylor (1981) estimator. Below I suggest 2 methods of testing this correlation:

# Testing the correlation between OECD dummy and the Between-transformed Life-Expectancy (i.e. not accounting for other covariates)
cor.test(dat$OECD, B(dat$LIFEEX)) # -> Significant correlation of 0.21
# 
#   Pearson's product-moment correlation
# 
# data:  dat$OECD and B(dat$LIFEEX)
# t = 6.4945, df = 916, p-value = 1.364e-10
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
#  0.1471020 0.2708361
# sample estimates:
#       cor 
# 0.2098089
 
# Getting the fixed-effects (estimates of alpha_i) from the model (i.e. accounting for the other covariates)
fe <- fixef(plm(LIFEEX ~ PCGDP + ODA + GINI, dat, model = "within"))
mODA <- fmean(dat$ODA, dat$iso3c)
# Again testing the correlation
cor.test(fe, mODA[match(names(fe), names(mODA))]) # -> Not Significant.. but probably due to small sample size, the correlation is still 0.13
# 
#   Pearson's product-moment correlation
# 
# data:  fe and mODA[match(names(fe), names(mODA))]
# t = 1.4906, df = 130, p-value = 0.1385
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
#  -0.04217488  0.29399213
# sample estimates:
#       cor 
# 0.1296318

I interpret the test results as rejecting the hypothesis that the dummy is uncorrelated with \(\alpha_i\), thus we do have a case for Hausman and Taylor (1981) here: the OECD dummy is a \(Z_{2i}\) with \(E[Z_{2i}\alpha_i]\neq 0\). The Hausman tests above suggested that the GINI index is the only variable uncorrelated with \(\alpha_i\), thus GINI is \(X_{1it}\) with \(E[X_{1it}\alpha_i] = 0\). Finally PCGDP and ODA jointly constitute \(X_{2it}\), where the Hausman tests strongly suggested that \(E[X_{2it}\alpha_i] \neq 0\). We do not have a \(Z_{1i}\) in this setup, i.e. a time-invariant variable uncorrelated with the \(\alpha_i\).

The Hausman and Taylor (1981) estimator suggests that we should instrument the OECD dummy with \(X_{1i.}\), the between-transformed GINI index. Let us therefore test the regression of the dummy on this instrument to see of it would be a good (i.e. relevant) instrument:

# This computes the regression of OECD on the GINI instrument: Weak IV problem !!
fFtest(dat$OECD, B(dat$GINI))
#   R-Sq.     DF1     DF2 F-Stat. P-value 
#   0.000       1     916   0.212   0.645

The 0 R-Squared and the F-Statistic of 0.21 suggest that the instrument is very weak indeed, rubbish to be precise, thus the implementation of the HT estimator below is also a rubbish example, but it is still good for illustration purposes:

HT_est(y = dat$LIFEEX, 
       X1 = get_vars(dat, "GINI"), 
       Z2 = get_vars(dat, "OECD"),
       X2 = get_vars(dat, c("PCGDP","ODA"))) 
#         GINI        PCGDP          ODA         OECD 
# -0.021283719  0.119913000  0.004333494 47.531609898

Now a central questions is of course: How computationally efficient is this estimator? Let us try to re-run it on the data generated for the benchmark in Part 1:

dat <- get_vars(data, c("iso3c","year","OECD","PCGDP","LIFEEX","GINI","ODA"))
get_vars(dat, 4:7) <- log(get_vars(dat, 4:7))       # Taking logs of the data
dat$OECD <- as.numeric(dat$OECD)                    # Creating OECD dummy
dat <- pdata.frame(droplevels(na.omit(dat)),        # Creating Panel-data.frame, after removing missing values
                   index = c("iso3c", "year"))      # and dropping unused factor levels
pdim(dat)
# Unbalanced Panel: n = 13200, T = 1-30, N = 91800
pvar(dat)
# no time variation:       iso3c OECD 
# no individual variation: year

library(microbenchmark)
microbenchmark(HT_est = HT_est(y = dat$LIFEEX,     # The estimator as before
                      X1 = get_vars(dat, "GINI"),
                      Z2 = get_vars(dat, "OECD"),
                      X2 = get_vars(dat, c("PCGDP","ODA"))),
              HT_est_TFE =  HT_est(y = dat$LIFEEX, # Also Projecting out Time-FE
                      X1 = get_vars(dat, "GINI"),
                      Z2 = get_vars(dat, "OECD"),
                      X2 = get_vars(dat, c("PCGDP","ODA")),
                      time.FE = TRUE))
# Unit: milliseconds
#        expr      min        lq     mean    median        uq      max neval cld
#      HT_est  7.53311  7.716073  8.95827  8.047635  8.494105 43.99112   100  a 
#  HT_est_TFE 33.88583 35.082670 37.23130 35.963117 37.448900 75.23338   100   b

At around 100000 obs and 13000 groups in an unbalanced panel, the computation involving 3 grouped centering and 1 grouped averaging task as well as 2 list-to matrix conversions and an IV-procedure took on about 10 milliseconds with only individual effects, and about 40 - 45 milliseconds with individual and time-fixed effects (projected out iteratively). This should leave some room for running this on much larger data, and even for implementing a bootstrap standard error at this sample size.

References

Hausman J, Taylor W (1981). “Panel Data and Unobservable Individual Effects.” Econometrica, 49, 1377–1398.

Mundlak, Yair. 1978. “On the Pooling of Time Series and Cross Section Data.” Econometrica 46 (1): 69–85.


  1. Which of course amounts to the same as omitting the weights

  2. In fact factors are projected out using lfe::demeanlist and no regression is run at all