Introduction to collapse

Advanced and Fast Data Transformation in R

Sebastian Krantz

2020-03-12

collapse is a C/C++ based package for data manipulation in R. It’s aims are

  1. to facilitate complex data transformation and exploration tasks and

  2. to help make R code fast, flexible, parsimonious and programmer friendly.

This vignette demonstrates these two points and introduces all of the main features of the package. Apart from this vignette, collapse comes with a built-in structured documentation available under help("collapse-documentation") after installing the package, and help("collapse-package") provides a compact set of examples for quick-start. The two other vignettes focus on the integration of collapse with dplyr workflows (highly recommended for dplyr / tidyverse users), and on the integration of collapse with the plm package (+ some advanced programming with panel-data).


1. Data and Summary Statistics

This vignette utilizes the 2 datasets that come with collapse: wlddev and GGDC10S, as well a few datasets from base R: mtcars, iris, airquality, and the time-series Airpassengers and EuStockMarkets. Below I introduce wlddev and GGDC10S and summarize them using qsu (quick-summary), as I will not spend much time explaining these datasets in the remainder of the vignette. You may choose to skip this section and start with Section 2.

1.1. World Bank Development Data

This dataset contains 4 key World Bank Development Indicators covering 216 countries over 59 years. It is a balanced panel with \(216 \times 59 = 12744\) observations.

library(collapse)

head(wlddev)
#       country iso3c       date year decade     region     income  OECD PCGDP LIFEEX GINI       ODA
# 1 Afghanistan   AFG 1961-01-01 1960   1960 South Asia Low income FALSE    NA 32.292   NA 114440000
# 2 Afghanistan   AFG 1962-01-01 1961   1960 South Asia Low income FALSE    NA 32.742   NA 233350000
# 3 Afghanistan   AFG 1963-01-01 1962   1960 South Asia Low income FALSE    NA 33.185   NA 114880000
# 4 Afghanistan   AFG 1964-01-01 1963   1960 South Asia Low income FALSE    NA 33.624   NA 236450000
# 5 Afghanistan   AFG 1965-01-01 1964   1960 South Asia Low income FALSE    NA 34.060   NA 302480000
# 6 Afghanistan   AFG 1966-01-01 1965   1960 South Asia Low income FALSE    NA 34.495   NA 370250000

# The variables have "label" attributes. Use vlabels() to get and set labels
namlab(wlddev, class = TRUE)
#    Variable     Class                                   Label
# 1   country character                            Country Name
# 2     iso3c    factor                            Country Code
# 3      date      Date              Date Recorded (Fictitious)
# 4      year   integer                                    Year
# 5    decade   numeric                                  Decade
# 6    region    factor                                  Region
# 7    income    factor                            Income Level
# 8      OECD   logical                 Is OECD Member Country?
# 9     PCGDP   numeric      GDP per capita (constant 2010 US$)
# 10   LIFEEX   numeric Life expectancy at birth, total (years)
# 11     GINI   numeric        GINI index (World Bank estimate)
# 12      ODA   numeric    Net ODA received (constant 2015 US$)

# This counts the number of non-missing values, more in section 2
fNobs(wlddev)
# country   iso3c    date    year  decade  region  income    OECD   PCGDP  LIFEEX    GINI     ODA 
#   12744   12744   12744   12744   12744   12744   12744   12744    8995   11068    1356    8336

# This counts the number of distinct values, more in section 2
fNdistinct(wlddev)
# country   iso3c    date    year  decade  region  income    OECD   PCGDP  LIFEEX    GINI     ODA 
#     216     216      59      59       7       7       4       2    8995   10048     363    7564

# The countries included:
cat(levels(wlddev$iso3c))
# ABW AFG AGO ALB AND ARE ARG ARM ASM ATG AUS AUT AZE BDI BEL BEN BFA BGD BGR BHR BHS BIH BLR BLZ BMU BOL BRA BRB BRN BTN BWA CAF CAN CHE CHI CHL CHN CIV CMR COD COG COL COM CPV CRI CUB CUW CYM CYP CZE DEU DJI DMA DNK DOM DZA ECU EGY ERI ESP EST ETH FIN FJI FRA FRO FSM GAB GBR GEO GHA GIB GIN GMB GNB GNQ GRC GRD GRL GTM GUM GUY HKG HND HRV HTI HUN IDN IMN IND IRL IRN IRQ ISL ISR ITA JAM JOR JPN KAZ KEN KGZ KHM KIR KNA KOR KWT LAO LBN LBR LBY LCA LIE LKA LSO LTU LUX LVA MAC MAF MAR MCO MDA MDG MDV MEX MHL MKD MLI MLT MMR MNE MNG MNP MOZ MRT MUS MWI MYS NAM NCL NER NGA NIC NLD NOR NPL NRU NZL OMN PAK PAN PER PHL PLW PNG POL PRI PRT PRY PSE PYF QAT ROU RUS RWA SAU SDN SEN SGP SLB SLE SLV SMR SOM SRB SSD STP SUR SVK SVN SWE SWZ SXM SYC SYR TCA TCD TGO THA TJK TKM TLS TON TTO TUN TUR TUV TZA UGA UKR URY USA UZB VCT VEN VGB VIR VNM VUT WSM XKX YEM ZAF ZMB ZWE

# use descr(wlddev) for a more detailed description of each variable

Of the categorical identifiers, the date variable was artificially generated to have an example dataset that contains all common data types frequently encountered in R.

Below I show how this data can be properly summarized using the function qsu. qsu stands shorthand for quick-summary and was inspired by the summarize and xtsummarize commands in STATA. Since wlddev is a panel-dataset, we would normally like to obtain statistics not just on the overall variation in the data, but also on the variation between country averages vs. the variation within countries over time. We might also be interested in higher moments such as the skewness and the kurtosis. Such a summary is easily implemented using qsu:

The output above is a 3D array of statistics which can also be subsetted ([) or permuted using aperm(). For each variable statistics are computed on the Overall (raw) data, and on the Between-country and Within-country transformed data1.

The statistics show that year is individual-invariant (evident from the 0 Between-country standard-deviation), that we have GINI-data on only 161 countries, with on average only 8.42 observations per country, and that PCGDP, LIFEEX and GINI vary more between countries, but ODA received varies more within countries over time. It is a common pattern that the kurtosis increases in within-transformed data, while the skewness decreases in most cases.

Note: Other distributional statistics like the median and quantiles are currently not implemented for reasons having to do with computation speed (>10x faster than base::summary and suitable for really large panels) and the algorithm2 behind qsu, but might come in a further update of qsu.

1.2. GGDC 10-Sector Database

The Groningen Growth and Development Centre 10-Sector Database provides long-run data on sectoral productivity performance in Africa, Asia, and Latin America. Variables covered in the data set are annual series of value added (VA, in local currency), and persons employed (EMP) for 10 broad sectors.

head(GGDC10S)
# # A tibble: 6 x 16
#   Country Regioncode Region Variable  Year   AGR   MIN    MAN     PU    CON   WRT   TRA  FIRE   GOV
#   <chr>   <chr>      <chr>  <chr>    <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA     SSA        Sub-s~ VA        1960  NA   NA    NA     NA     NA     NA    NA    NA    NA   
# 2 BWA     SSA        Sub-s~ VA        1961  NA   NA    NA     NA     NA     NA    NA    NA    NA   
# 3 BWA     SSA        Sub-s~ VA        1962  NA   NA    NA     NA     NA     NA    NA    NA    NA   
# 4 BWA     SSA        Sub-s~ VA        1963  NA   NA    NA     NA     NA     NA    NA    NA    NA   
# 5 BWA     SSA        Sub-s~ VA        1964  16.3  3.49  0.737  0.104  0.660  6.24  1.66  1.12  4.82
# 6 BWA     SSA        Sub-s~ VA        1965  15.7  2.50  1.02   0.135  1.35   7.06  1.94  1.25  5.70
# # ... with 2 more variables: OTH <dbl>, SUM <dbl>

namlab(GGDC10S, class = TRUE)
#      Variable     Class                                                 Label
# 1     Country character                                               Country
# 2  Regioncode character                                           Region code
# 3      Region character                                                Region
# 4    Variable character                                              Variable
# 5        Year   numeric                                                  Year
# 6         AGR   numeric                                          Agriculture 
# 7         MIN   numeric                                                Mining
# 8         MAN   numeric                                         Manufacturing
# 9          PU   numeric                                             Utilities
# 10        CON   numeric                                          Construction
# 11        WRT   numeric                         Trade, restaurants and hotels
# 12        TRA   numeric                  Transport, storage and communication
# 13       FIRE   numeric Finance, insurance, real estate and business services
# 14        GOV   numeric                                   Government services
# 15        OTH   numeric               Community, social and personal services
# 16        SUM   numeric                               Summation of sector GDP

fNobs(GGDC10S)
#    Country Regioncode     Region   Variable       Year        AGR        MIN        MAN         PU 
#       5027       5027       5027       5027       5027       4364       4355       4355       4354 
#        CON        WRT        TRA       FIRE        GOV        OTH        SUM 
#       4355       4355       4355       4355       3482       4248       4364

fNdistinct(GGDC10S)
#    Country Regioncode     Region   Variable       Year        AGR        MIN        MAN         PU 
#         43          6          6          2         67       4353       4224       4353       4237 
#        CON        WRT        TRA       FIRE        GOV        OTH        SUM 
#       4339       4344       4334       4349       3470       4238       4364

# The countries included:
cat(funique(GGDC10S$Country, ordered = TRUE))
# ARG BOL BRA BWA CHL CHN COL CRI DEW DNK EGY ESP ETH FRA GBR GHA HKG IDN IND ITA JPN KEN KOR MEX MOR MUS MWI MYS NGA NGA(alt) NLD PER PHL SEN SGP SWE THA TWN TZA USA VEN ZAF ZMB

# use descr(GGDC10S) for a more detailed description of each variable

The first problem in summarizing this data is that value added (VA) is in local currency, the second that it contains 2 different Variables (VA and EMP) stacked in the same column. One way of solving the first problem could be converting the data to percentages through dividing by the overall VA and EMP contained in the last column. A different solution involving grouped-scaling is introduced in section 4.4. The second problem in nicely handled by qsu, which can also compute panel-statistics by groups.

# Converting data to percentages of overall VA / EMP
pGGDC10S <- sweep(GGDC10S[6:15], 1, GGDC10S$SUM, "/") * 100
# Summarizing the sectoral data by variable, overall, between and within countries
su <- qsu(pGGDC10S, by = GGDC10S$Variable, pid = GGDC10S[c("Variable","Country")], higher = TRUE) 

# This gives a 4D array of summary statistics
str(su)
#  'qsu' num [1:2, 1:7, 1:3, 1:10] 2225 2139 35.1 17.3 26.7 ...
#  - attr(*, "dimnames")=List of 4
#   ..$ : chr [1:2] "EMP" "VA"
#   ..$ : chr [1:7] "N/T" "Mean" "SD" "Min" ...
#   ..$ : chr [1:3] "Overall" "Between" "Within"
#   ..$ : chr [1:10] "AGR" "MIN" "MAN" "PU" ...

# Permuting this array to a more readible format
aperm(su, c(4,2,3,1))
# , , Overall, EMP
# 
#        N/T   Mean     SD   Min    Max   Skew   Kurt
# AGR   2225  35.09  26.72  0.16    100   0.49    2.1
# MIN   2216   1.03   1.42     0   9.41   3.13  15.04
# MAN   2216  14.98   8.04  0.58   45.3   0.43   2.85
# PU    2215   0.58   0.36  0.02   2.48   1.26   5.58
# CON   2216   5.66   2.93  0.14  15.99  -0.06   2.27
# WRT   2216  14.92   6.56  0.81   32.8  -0.18   2.32
# TRA   2216   4.82   2.65  0.15  15.05   0.95   4.47
# FIRE  2216   4.65   4.35  0.08  21.77   1.23   4.08
# GOV   1780  13.13   8.08     0  34.89   0.63   2.53
# OTH   2109    8.4   6.64  0.42  34.89    1.4   4.32
# 
# , , Between, EMP
# 
#       N/T   Mean     SD   Min    Max   Skew   Kurt
# AGR    42  35.09  24.12     1  88.33   0.52   2.24
# MIN    42   1.03   1.23  0.03   6.85   2.73  12.33
# MAN    42  14.98   7.04  1.72  32.34  -0.02   2.43
# PU     42   0.58    0.3  0.07   1.32   0.55   2.69
# CON    42   5.66   2.47   0.5  10.37  -0.44   2.33
# WRT    42  14.92   5.26     4  26.77  -0.55   2.73
# TRA    42   4.82   2.47  0.37  12.39   0.98   4.79
# FIRE   42   4.65   3.45  0.15  12.44   0.61   2.59
# GOV    34  13.13   7.28  2.01  29.16   0.39   2.11
# OTH    40    8.4   6.27  1.35   26.4   1.43   4.32
# 
# , , Within, EMP
# 
#         N/T   Mean    SD    Min     Max   Skew   Kurt
# AGR   52.98  26.38  11.5  -5.32  107.49    1.6  11.97
# MIN   52.76    3.4  0.72  -1.41    7.51   -0.2  15.03
# MAN   52.76  17.48  3.89  -1.11    40.4  -0.08    7.4
# PU    52.74   1.39  0.19   0.63    2.55   0.57   7.85
# CON   52.76   5.76  1.56    0.9   12.97   0.31   4.12
# WRT   52.76  15.76  3.91   3.74   29.76   0.33   3.34
# TRA   52.76   6.35  0.96   2.35   11.11   0.27   5.72
# FIRE  52.76   5.82  2.66  -2.98      16   0.55   4.03
# GOV   52.35  13.26  3.51   -2.2   23.61  -0.56   4.73
# OTH   52.73   7.39   2.2  -2.33   17.44   0.29   6.46
# 
# , , Overall, VA
# 
#        N/T   Mean     SD      Min    Max   Skew   Kurt
# AGR   2139  17.31  15.51     0.03  95.22   1.33   4.88
# MIN   2139   5.85    9.1        0  59.06   2.72  10.92
# MAN   2139  20.07      8     0.98  41.63  -0.03   2.68
# PU    2139   2.23   1.11        0   9.19   0.89   6.24
# CON   2139   5.87   2.51      0.6  25.86    1.5   8.96
# WRT   2139  16.63   5.14     4.52  39.76   0.35   3.27
# TRA   2139   7.93   3.11      0.8  25.96   1.01   5.71
# FIRE  2139   7.04  12.71  -151.07  39.17  -6.23  59.87
# GOV   1702  13.41   6.35     0.76  32.51   0.49    2.9
# OTH   2139    6.4   5.84     0.23  31.45    1.5   4.21
# 
# , , Between, VA
# 
#       N/T   Mean     SD     Min    Max   Skew  Kurt
# AGR    43  17.31  13.19    0.61  63.84   1.13  4.71
# MIN    43   5.85   7.57    0.05  27.92   1.71  4.81
# MAN    43  20.07   6.64    4.19  32.11  -0.36  2.62
# PU     43   2.23   0.75    0.45   4.31   0.62  3.87
# CON    43   5.87   1.85    2.94  12.93   1.33   6.5
# WRT    43  16.63   4.38    8.42  26.39   0.29  2.46
# TRA    43   7.93   2.72    2.04  14.89   0.64  3.67
# FIRE   43   7.04   9.03  -35.61  23.87  -2.67  15.1
# GOV    35  13.41   5.87    1.98  27.77   0.52  3.04
# OTH    43    6.4   5.61    1.12  19.53   1.33   3.2
# 
# , , Within, VA
# 
#         N/T   Mean    SD      Min    Max   Skew   Kurt
# AGR   49.74  26.38  8.15     5.24  94.35   1.23   9.53
# MIN   49.74    3.4  5.05   -20.05  35.71   0.34   13.1
# MAN   49.74  17.48  4.46     1.12  36.35  -0.19   3.93
# PU    49.74   1.39  0.82    -1.09   6.27   0.53   5.35
# CON   49.74   5.76   1.7    -0.35  18.69   0.75   6.38
# WRT   49.74  15.76  2.69     4.65  32.67   0.23    4.5
# TRA   49.74   6.35   1.5     0.92   18.6    0.7  10.11
# FIRE  49.74   5.82  8.94  -109.63  54.12  -2.77   54.6
# GOV   48.63  13.26  2.42     5.12  22.85   0.17   3.31
# OTH   49.74   7.39  1.62    -0.92  19.31   0.73   9.66

The statistics show that the dataset is very consistent: Employment data cover 42 countries and 53 time-periods in almost all sectors. Agriculture is the largest sector in terms of employment, amounting to a 35% share of employment across countries and time, with a standard deviation (SD) of around 27%. The between-country SD in agricultural employment share is 24% and the within SD is 12%, indicating that processes of structural change are very gradual and most of the variation in structure is between countries. The next largest sectors after agriculture are manufacturing, wholesale and retail trade and government, each claiming an approx. 15% share of the economy. In these sectors the between-country SD is also about twice as large as the within-country SD.

In terms of value added, the data covers 43 countries in 50 time-periods. Agriculture, manufacturing, wholesale and retail trade and government are also the largest sectors in terms of VA, but with a diminished agricultural share (around 17%) and a greater share for manufacturing (around 20%). The variation between countries is again greater than the variation within countries, but it seems that at least in terms of agricultural VA share there is also a considerable within-country SD of 8%. This is also true for the finance and real estate sector with a within SD of 9%, suggesting (using a bit of common sense) that a diminishing VA share in agriculture and increased VA share in finance and real estate was a pattern characterizing most of the countries in this sample.

I note that these two examples have not yet exhausted the capabilities of qsu which can also compute weighted versions of all the above statistics and output to list of matrices instead of higher-dimensional array. It is of course also possible to compute conventional and weighted statistics on cross-sectional data using qsu.

As a final step I introduce a plot function which can be used to plot the structural transformation of any supported country. Below I do so for Tanzania.

2. Advanced Data Programming

A key feature of collapse is it’s broad set of Fast Statistical Functions (fsum, fprod, fmean, fmedian, fmode, fvar, fsd, fmin, fmax, ffirst, flast, fNobs, fNdistinct), which are able to dramatically speed-up column-wise, grouped and weighted computations on vectors, matrices or data.frame’s. The basic syntax common to all of these functions is:

FUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,] use.g.names = TRUE, drop = TRUE)

where x is a vector, matrix or data.frame, g takes groups supplied as vector, factor, list of vectors or GRP object, and w takes a weight vector (available only to fmean, fmode, fvar and fsd). TRA and can be used to transform x using the computed statistics and one of 8 available transformations ("replace_fill", "replace", "-", "-+", "/", "%", "+", "*", discussed in section 4.3). na.rm efficiently removes missing values and is TRUE by default. use.g.names = TRUE generates new row-names from the unique groups supplied to g, and drop = TRUE returns a vector when performing simple (non-grouped) computations on matrix or data.frame columns.

With that in mind, let’s start with some simple examples. To calculate the mean of each column in a data.frame or matrix, it is sufficient to type:

fmean(mtcars)
#        mpg        cyl       disp         hp       drat         wt       qsec         vs         am 
#  20.090625   6.187500 230.721875 146.687500   3.596562   3.217250  17.848750   0.437500   0.406250 
#       gear       carb 
#   3.687500   2.812500

fmean(mtcars, drop = FALSE)  # This returns a 1-row data-frame
#        mpg    cyl     disp       hp     drat      wt     qsec     vs      am   gear   carb
# 1 20.09062 6.1875 230.7219 146.6875 3.596562 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125

m <- qM(mtcars) # This quickly converts objects to matrices
fmean(m)
#        mpg        cyl       disp         hp       drat         wt       qsec         vs         am 
#  20.090625   6.187500 230.721875 146.687500   3.596562   3.217250  17.848750   0.437500   0.406250 
#       gear       carb 
#   3.687500   2.812500

fmean(mtcars, drop = FALSE)  # This returns a 1-row matrix
#        mpg    cyl     disp       hp     drat      wt     qsec     vs      am   gear   carb
# 1 20.09062 6.1875 230.7219 146.6875 3.596562 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125

It is also possible to calculate fast groupwise statistics, by simply passing grouping vectors or lists of grouping vectors to the fast functions:

fmean(mtcars, mtcars$cyl)
#        mpg cyl     disp        hp     drat       wt     qsec        vs        am     gear     carb
# 4 26.66364   4 105.1364  82.63636 4.070909 2.285727 19.13727 0.9090909 0.7272727 4.090909 1.545455
# 6 19.74286   6 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286 0.4285714 3.857143 3.428571
# 8 15.10000   8 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000 0.1428571 3.285714 3.500000

fmean(mtcars, mtcars[c("cyl","vs","am")])
#            mpg cyl     disp        hp     drat       wt     qsec vs am     gear     carb
# 4.0.1 26.00000   4 120.3000  91.00000 4.430000 2.140000 16.70000  0  1 5.000000 2.000000
# 4.1.0 22.90000   4 135.8667  84.66667 3.770000 2.935000 20.97000  1  0 3.666667 1.666667
# 4.1.1 28.37143   4  89.8000  80.57143 4.148571 2.028286 18.70000  1  1 4.142857 1.428571
# 6.0.1 20.56667   6 155.0000 131.66667 3.806667 2.755000 16.32667  0  1 4.333333 4.666667
# 6.1.0 19.12500   6 204.5500 115.25000 3.420000 3.388750 19.21500  1  0 3.500000 2.500000
# 8.0.0 15.05000   8 357.6167 194.16667 3.120833 4.104083 17.14250  0  0 3.000000 3.083333
# 8.0.1 15.40000   8 326.0000 299.50000 3.880000 3.370000 14.55000  0  1 5.000000 6.000000

In the example above we might be inclined to remove the grouping columns from the output, as the unique row-names already indicate the combination of grouping variables. This can be done in a secure and more efficient way using get_vars:

# Getting column indices [same as match(c("cyl","vs","am"), names(mtcars)) but gives error if non-matched]
ind <- get_vars(mtcars, c("cyl","vs","am"), return = "indices")

# Subsetting columns with get_vars is 2x faster than [.data.frame
fmean(get_vars(mtcars, -ind), get_vars(mtcars, ind))
#            mpg     disp        hp     drat       wt     qsec     gear     carb
# 4.0.1 26.00000 120.3000  91.00000 4.430000 2.140000 16.70000 5.000000 2.000000
# 4.1.0 22.90000 135.8667  84.66667 3.770000 2.935000 20.97000 3.666667 1.666667
# 4.1.1 28.37143  89.8000  80.57143 4.148571 2.028286 18.70000 4.142857 1.428571
# 6.0.1 20.56667 155.0000 131.66667 3.806667 2.755000 16.32667 4.333333 4.666667
# 6.1.0 19.12500 204.5500 115.25000 3.420000 3.388750 19.21500 3.500000 2.500000
# 8.0.0 15.05000 357.6167 194.16667 3.120833 4.104083 17.14250 3.000000 3.083333
# 8.0.1 15.40000 326.0000 299.50000 3.880000 3.370000 14.55000 5.000000 6.000000

get_vars also subsets data.table columns and other data.frame-like classes, and is about 2x the speed of [.data.frame. Replacements of the form get_vars(data, ind) <- newcols are about 4x as fast as data[ind] <- newcols. It is also possible to subset with functions i.e. get_vars(mtcars, is.ordered) and regular expressions i.e. get_vars(mtcars, c("c","v","a"), regex = TRUE) or get_vars(mtcars, "c|v|a", regex = TRUE). Next to get_vars there are also the predicates num_vars, cat_vars, char_vars, fact_vars, logi_vars and Date_vars to subset and replace data by type.

This programming can become even more efficient when passing factors or grouping objects to the g argument. qF efficiently turns atomic vectors into factors, and the GRP function creates grouping objects (of class GRP) from vectors or lists of columns. By default, both are ordered, but must not be. For multiple variables, GRP is always superior to creating multiple factors and interacting them, and it is also faster than base::interaction for lists of factors.

# This creates a (ordered) factor, about 10x faster than as.factor(mtcars$cyl)
f <- qF(mtcars$cyl, na.exclude = FALSE)
str(f)
#  Ord.factor w/ 3 levels "4"<"6"<"8": 2 2 1 2 3 2 3 1 1 2 ...

# This creates a 'GRP' object. Grouping is done via radix ordering in C (using data.table's forder function)
g <- GRP(mtcars, ~ cyl + vs + am) # Using the formula interface, could also use c("cyl","vs","am") or c(2,8:9)
g
# collapse grouping object of length 32 with 7 ordered groups
# 
# Call: GRP.default(X = mtcars, by = ~cyl + vs + am), unordered
# 
# Distribution of group sizes: 
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   1.000   2.500   3.000   4.571   5.500  12.000 
# 
# Groups with sizes: 
# 4.0.1 4.1.0 4.1.1 6.0.1 6.1.0 8.0.0 8.0.1 
#     1     3     7     3     4    12     2
plot(g)

With factors or GRP objects, computations are faster since the fast functions would otherwise internally group the vectors every time they are executed. Compared to factors, grouped computations using GRP objects are a bit more efficient, primarily because they require no further checks, while factors are checked for missing values3 unless a class ‘na.included’ is attached. By default qF acts just like as.factor and preserves missing values when generating factors. Therefore the most effective way of programming with factors is to use qF(x, na.exclude = FALSE) to create the factor. This will create an underlying integer for NA‘s and attach a class’na.included’, so that no further checks are run on that factor in the collapse ecosystem.

Using the objects just created, it is easy to compute over the same groups with multiple functions:

dat <- get_vars(mtcars, -ind)

# Grouped mean
fmean(dat, f)
#        mpg     disp        hp     drat       wt     qsec     gear     carb
# 4 26.66364 105.1364  82.63636 4.070909 2.285727 19.13727 4.090909 1.545455
# 6 19.74286 183.3143 122.28571 3.585714 3.117143 17.97714 3.857143 3.428571
# 8 15.10000 353.1000 209.21429 3.229286 3.999214 16.77214 3.285714 3.500000

# Grouped standard-deviation
fsd(dat, f)
#        mpg     disp       hp      drat        wt     qsec      gear     carb
# 4 4.509828 26.87159 20.93453 0.3654711 0.5695637 1.682445 0.5393599 0.522233
# 6 1.453567 41.56246 24.26049 0.4760552 0.3563455 1.706866 0.6900656 1.812654
# 8 2.560048 67.77132 50.97689 0.3723618 0.7594047 1.196014 0.7262730 1.556624

fsd(dat, g)
#             mpg      disp       hp      drat        wt       qsec      gear      carb
# 4.0.1        NA        NA       NA        NA        NA         NA        NA        NA
# 4.1.0 1.4525839 13.969371 19.65536 0.1300000 0.4075230 1.67143651 0.5773503 0.5773503
# 4.1.1 4.7577005 18.802128 24.14441 0.3783926 0.4400840 0.94546285 0.3779645 0.5345225
# 6.0.1 0.7505553  8.660254 37.52777 0.1616581 0.1281601 0.76872188 0.5773503 1.1547005
# 6.1.0 1.6317169 44.742634  9.17878 0.5919459 0.1162164 0.81590441 0.5773503 1.7320508
# 8.0.0 2.7743959 71.823494 33.35984 0.2302749 0.7683069 0.80164745 0.0000000 0.9003366
# 8.0.1 0.5656854 35.355339 50.20458 0.4808326 0.2828427 0.07071068 0.0000000 2.8284271

Now suppose we wanted to create a new dataset which contains the mean, sd, min and max of the variables mpg and disp grouped by cyl, vs and am:

dat <- get_vars(mtcars, c("mpg", "disp"))

# add_stub is a collapse predicate to add a prefix (default) or postfix to column names
cbind(add_stub(fmean(dat, g), "mean_"),
      add_stub(fsd(dat, g), "sd_"), 
      add_stub(fmin(dat, g), "min_"),
      add_stub(fmax(dat, g), "max_"))
#       mean_mpg mean_disp    sd_mpg   sd_disp min_mpg min_disp max_mpg max_disp
# 4.0.1 26.00000  120.3000        NA        NA    26.0    120.3    26.0    120.3
# 4.1.0 22.90000  135.8667 1.4525839 13.969371    21.5    120.1    24.4    146.7
# 4.1.1 28.37143   89.8000 4.7577005 18.802128    21.4     71.1    33.9    121.0
# 6.0.1 20.56667  155.0000 0.7505553  8.660254    19.7    145.0    21.0    160.0
# 6.1.0 19.12500  204.5500 1.6317169 44.742634    17.8    167.6    21.4    258.0
# 8.0.0 15.05000  357.6167 2.7743959 71.823494    10.4    275.8    19.2    472.0
# 8.0.1 15.40000  326.0000 0.5656854 35.355339    15.0    301.0    15.8    351.0

We could also calculate groupwise-frequency weighted means and standard-deviations using a weight vector, and we could decide to include the original grouping columns and omit the generated row-names, as shown below4.

There is also a collapse predicate add_vars which serves as a much faster and more versatile alternative to cbind.data.frame. The intention behind add_vars is to be able to efficiently add multiple columns to an existing data.frame. Thus in a call add_vars(data, newcols1, newcols2), newcols1 and newcols2 are added (by default) at the end of data, while preserving all attributes of data.

# This generates a random vector of weights
weights <- abs(rnorm(nrow(mtcars)))

# Grouped and weighted mean and sd and grouped min and max, combined using add_vars
add_vars(g[["groups"]],
         add_stub(fmean(dat, g, weights, use.g.names = FALSE), "w_mean_"),
         add_stub(fsd(dat, g, weights, use.g.names = FALSE), "w_sd_"), 
         add_stub(fmin(dat, g, use.g.names = FALSE), "min_"),
         add_stub(fmax(dat, g, use.g.names = FALSE), "max_"))
#   cyl vs am w_mean_mpg w_mean_disp w_sd_mpg w_sd_disp min_mpg min_disp max_mpg max_disp
# 1   4  0  1   26.00000   120.30000 0.000000   0.00000    26.0    120.3    26.0    120.3
# 2   4  1  0   22.77276   138.51716 1.707875  18.72771    21.5    120.1    24.4    146.7
# 3   4  1  1   29.52737    81.64415 4.674793  16.42655    21.4     71.1    33.9    121.0
# 4   6  0  1   20.52959   154.57224 1.194314  13.78055    19.7    145.0    21.0    160.0
# 5   6  1  0   18.47185   208.18111 1.438912  42.94401    17.8    167.6    21.4    258.0
# 6   8  0  0   15.46451   335.07016 2.182173  65.12019    10.4    275.8    19.2    472.0
# 7   8  0  1   15.27441   318.15046 0.736511  46.03194    15.0    301.0    15.8    351.0

We can also use add_vars to bind columns in a different order than as they are passed. Specifying add_vars(data, newcols1, newcols2, pos = "front") would be equivalent to add_vars(newcols1, newcols2, data) while keeping the attributes of data. Moreover it is also possible to pass a vector of positions that the new columns should have in the combined data:

# Binding and reordering columns in a single step: Add columns in specific positions 
add_vars(g[["groups"]],
         add_stub(fmean(dat, g, weights, use.g.names = FALSE), "w_mean_"),
         add_stub(fsd(dat, g, weights, use.g.names = FALSE), "w_sd_"), 
         add_stub(fmin(dat, g, use.g.names = FALSE), "min_"),
         add_stub(fmax(dat, g, use.g.names = FALSE), "max_"), 
         pos = c(4,8,5,9,6,10,7,11))
#   cyl vs am w_mean_mpg w_sd_mpg min_mpg max_mpg w_mean_disp w_sd_disp min_disp max_disp
# 1   4  0  1   26.00000 0.000000    26.0    26.0   120.30000   0.00000    120.3    120.3
# 2   4  1  0   22.77276 1.707875    21.5    24.4   138.51716  18.72771    120.1    146.7
# 3   4  1  1   29.52737 4.674793    21.4    33.9    81.64415  16.42655     71.1    121.0
# 4   6  0  1   20.52959 1.194314    19.7    21.0   154.57224  13.78055    145.0    160.0
# 5   6  1  0   18.47185 1.438912    17.8    21.4   208.18111  42.94401    167.6    258.0
# 6   8  0  0   15.46451 2.182173    10.4    19.2   335.07016  65.12019    275.8    472.0
# 7   8  0  1   15.27441 0.736511    15.0    15.8   318.15046  46.03194    301.0    351.0

As a final layer of added complexity, we could utilize the TRA argument to generate groupwise-weighted demeaned, and scaled data, with additional columns giving the group-minimum and maximum values:

head(add_vars(get_vars(mtcars, ind),
              add_stub(fmean(dat, g, weights, "-"), "w_demean_"), # This calculates weighted group means and uses them to demean the data
              add_stub(fsd(dat, g, weights, "/"), "w_scale_"),    # This calculates weighted group sd's and uses them to scale the data
              add_stub(fmin(dat, g, "replace"), "min_"),          # This replaces all observations by their group-minimum
              add_stub(fmax(dat, g, "replace"), "max_")))         # This replaces all observations by their group-maximum
#                   cyl vs am w_demean_mpg w_demean_disp w_scale_mpg w_scale_disp min_mpg min_disp
# Mazda RX4           6  0  1    0.4704056      5.427756   17.583310    11.610567    19.7    145.0
# Mazda RX4 Wag       6  0  1    0.4704056      5.427756   17.583310    11.610567    19.7    145.0
# Datsun 710          4  1  1   -6.7273707     26.355848    4.877221     6.574723    21.4     71.1
# Hornet 4 Drive      6  1  0    2.9281456     49.818890   14.872349     6.007823    17.8    167.6
# Hornet Sportabout   8  0  0    3.2354853     24.929837    8.569441     5.528239    10.4    275.8
# Valiant             6  1  0   -0.3718544     16.818890   12.578950     5.239380    17.8    167.6
#                   max_mpg max_disp
# Mazda RX4            21.0      160
# Mazda RX4 Wag        21.0      160
# Datsun 710           33.9      121
# Hornet 4 Drive       21.4      258
# Hornet Sportabout    19.2      472
# Valiant              21.4      258

It is also possible to add_vars<- to mtcars itself. The default option would add these columns at the end, but we could also specify positions:

# This defines the positions where we want to add these columns
pos <- c(2,8,3,9,4,10,5,11)

add_vars(mtcars, pos) <- c(add_stub(fmean(dat, g, weights, "-"), "w_demean_"),
                           add_stub(fsd(dat, g, weights, "/"), "w_scale_"), 
                           add_stub(fmin(dat, g, "replace"), "min_"),
                           add_stub(fmax(dat, g, "replace"), "max_"))
head(mtcars)
#                    mpg w_demean_mpg w_scale_mpg min_mpg max_mpg cyl disp w_demean_disp w_scale_disp
# Mazda RX4         21.0    0.4704056   17.583310    19.7    21.0   6  160      5.427756    11.610567
# Mazda RX4 Wag     21.0    0.4704056   17.583310    19.7    21.0   6  160      5.427756    11.610567
# Datsun 710        22.8   -6.7273707    4.877221    21.4    33.9   4  108     26.355848     6.574723
# Hornet 4 Drive    21.4    2.9281456   14.872349    17.8    21.4   6  258     49.818890     6.007823
# Hornet Sportabout 18.7    3.2354853    8.569441    10.4    19.2   8  360     24.929837     5.528239
# Valiant           18.1   -0.3718544   12.578950    17.8    21.4   6  225     16.818890     5.239380
#                   min_disp max_disp  hp drat    wt  qsec vs am gear carb
# Mazda RX4            145.0      160 110 3.90 2.620 16.46  0  1    4    4
# Mazda RX4 Wag        145.0      160 110 3.90 2.875 17.02  0  1    4    4
# Datsun 710            71.1      121  93 3.85 2.320 18.61  1  1    4    1
# Hornet 4 Drive       167.6      258 110 3.08 3.215 19.44  1  0    3    1
# Hornet Sportabout    275.8      472 175 3.15 3.440 17.02  0  0    3    2
# Valiant              167.6      258 105 2.76 3.460 20.22  1  0    3    1
rm(mtcars)

These examples above could be made more involved using the full set of Fast Statistical Functions, and also employing all of the vector- valued functions and operators (fscale/STD, fbetween/B, fwithin/W, fHDbetween/HDB, fHDwithin/HDW, flag/L/F, fdiff/D, fgrowth/G) discussed later. They also provide merely suggestions for use of these features and are focused on programming with data.frames (as the predicates get_vars, add_vars etc. are made for data.frames). The Fast Statistical Functions however work equally well on vectors and matrices. Not really discussed so far were a set of functions qDF, qDT, qM which deliver very fast conversions between matrices, data.frames and data.tables.

Using collapse’s fast functions and the programming principles laid out here can speed up grouped computations by orders of magnitude - even compared to packages like dplyr or data.table (see e.g. the benchmarks provided further down). Simple column-wise computations on matrices are also slightly faster than with base functions like colMeans, colSums, and of course a lot faster than applying these base functions to data.frame’s (which involves a conversion to matrix). Fast row-wise operations are not really the focus of collapse for the moment, also provided that they are not so common. Using conversions with qM together with base functions like rowSums however does a very decent job of speeding them up (i.e. evaluate the speed of rowSums(qM(mtcars)) against rowSums(mtcars)).

3. Advanced Data Aggregation

The kind of advanced groupwise programming introduced in the previous section is the fastest and most customizable way of dealing with many data transformation problems, and it is also made highly compatible with workflows in packages like dplyr and plm (see the two vignettes on these subjects). Some tasks such as multivariate aggregations on a single data.frame are however so common that this demands for a more compact solution that efficiently integrates multiple computational steps:

collap is a fast multi-purpose aggregation command designed to solve complex aggregation problems efficiently and with a minimum of coding. collap performs optimally together with the Fast Statistical Functions, but will also work with other functions.

To perform the above aggregation with collap, one would simply need to type5:

collap(mtcars, mpg + disp ~ cyl + vs + am, list(fmean, fsd, fmin, fmax), keep.col.order = FALSE)
#   cyl vs am fmean.mpg fmean.disp   fsd.mpg  fsd.disp fmin.mpg fmin.disp fmax.mpg fmax.disp
# 1   4  0  1  26.00000   120.3000        NA        NA     26.0     120.3     26.0     120.3
# 2   4  1  0  22.90000   135.8667 1.4525839 13.969371     21.5     120.1     24.4     146.7
# 3   4  1  1  28.37143    89.8000 4.7577005 18.802128     21.4      71.1     33.9     121.0
# 4   6  0  1  20.56667   155.0000 0.7505553  8.660254     19.7     145.0     21.0     160.0
# 5   6  1  0  19.12500   204.5500 1.6317169 44.742634     17.8     167.6     21.4     258.0
# 6   8  0  0  15.05000   357.6167 2.7743959 71.823494     10.4     275.8     19.2     472.0
# 7   8  0  1  15.40000   326.0000 0.5656854 35.355339     15.0     301.0     15.8     351.0

The original idea behind collap is however better demonstrated with a different dataset. Consider the World Development Dataset wlddev included in the package and introduced in section 1:

head(wlddev)
#       country iso3c       date year decade     region     income  OECD PCGDP LIFEEX GINI       ODA
# 1 Afghanistan   AFG 1961-01-01 1960   1960 South Asia Low income FALSE    NA 32.292   NA 114440000
# 2 Afghanistan   AFG 1962-01-01 1961   1960 South Asia Low income FALSE    NA 32.742   NA 233350000
# 3 Afghanistan   AFG 1963-01-01 1962   1960 South Asia Low income FALSE    NA 33.185   NA 114880000
# 4 Afghanistan   AFG 1964-01-01 1963   1960 South Asia Low income FALSE    NA 33.624   NA 236450000
# 5 Afghanistan   AFG 1965-01-01 1964   1960 South Asia Low income FALSE    NA 34.060   NA 302480000
# 6 Afghanistan   AFG 1966-01-01 1965   1960 South Asia Low income FALSE    NA 34.495   NA 370250000

Suppose we would like to aggregate this data by country and decade, but keep all that categorical information. With collap this is extremely simple:

head(collap(wlddev, ~ iso3c + decade))
#   country iso3c       date   year decade                     region      income  OECD    PCGDP
# 1   Aruba   ABW 1961-01-01 1962.5   1960 Latin America & Caribbean  High income FALSE       NA
# 2   Aruba   ABW 1967-01-01 1970.0   1970 Latin America & Caribbean  High income FALSE       NA
# 3   Aruba   ABW 1976-01-01 1980.0   1980 Latin America & Caribbean  High income FALSE       NA
# 4   Aruba   ABW 1987-01-01 1990.0   1990 Latin America & Caribbean  High income FALSE 23677.09
# 5   Aruba   ABW 1996-01-01 2000.0   2000 Latin America & Caribbean  High income FALSE 26766.93
# 6   Aruba   ABW 2007-01-01 2010.0   2010 Latin America & Caribbean  High income FALSE 25238.80
#     LIFEEX GINI      ODA
# 1 66.58583   NA       NA
# 2 69.14178   NA       NA
# 3 72.17600   NA 33630000
# 4 73.45356   NA 41563333
# 5 73.85773   NA 19857000
# 6 75.01078   NA       NA

Note that the columns of the data are in the original order and also retain all their attributes. To understand this result let us briefly examine the syntax of collap:

collap(X, by, FUN = fmean, catFUN = fmode, cols = NULL, custom = NULL,
       keep.by = TRUE, keep.col.order = TRUE, sort.row = TRUE,
       parallel = FALSE, mc.cores = 1L,
       return = c("wide","list","long","long_dupl"), give.names = "auto") # , ...

It is clear that X is the data and by supplies the grouping information, which can be a one- or two-sided formula or alternatively grouping vectors, factors, lists and GRP objects (like the Fast Statistical Functions). Then FUN provides the function(s) applied only to numeric variables in X and defaults to the mean, while catFUN provides the function(s) applied only to categorical variables in X and defaults to a fast implementation of the statistical mode6. keep.col.order = TRUE specifies that the data is to be returned with the original column-order. Thus in the above example it was sufficient to supply X and by and collap did the rest for us.

Suppose we only want to aggregate the 4 series in this dataset. This can be done utilizing the cols argument:

head(collap(wlddev, ~ iso3c + decade, cols = 9:12))
#   iso3c decade    PCGDP   LIFEEX GINI      ODA
# 1   ABW   1960       NA 66.58583   NA       NA
# 2   ABW   1970       NA 69.14178   NA       NA
# 3   ABW   1980       NA 72.17600   NA 33630000
# 4   ABW   1990 23677.09 73.45356   NA 41563333
# 5   ABW   2000 26766.93 73.85773   NA 19857000
# 6   ABW   2010 25238.80 75.01078   NA       NA

As before we could use multiple functions by putting them in a named or unnamed list7:

head(collap(wlddev, ~ iso3c + decade, list(fmean, fmedian, fsd), cols = 9:12))
#   iso3c decade fmean.PCGDP fmedian.PCGDP fsd.PCGDP fmean.LIFEEX fmedian.LIFEEX fsd.LIFEEX fmean.GINI
# 1   ABW   1960          NA            NA        NA     66.58583        66.6155  0.6595475         NA
# 2   ABW   1970          NA            NA        NA     69.14178        69.1400  0.9521791         NA
# 3   ABW   1980          NA            NA        NA     72.17600        72.2930  0.8054561         NA
# 4   ABW   1990    23677.09      25357.79 4100.7901     73.45356        73.4680  0.1152921         NA
# 5   ABW   2000    26766.93      26966.05  834.3735     73.85773        73.7870  0.2217034         NA
# 6   ABW   2010    25238.80      24629.08 1580.8698     75.01078        75.0160  0.3942914         NA
#   fmedian.GINI fsd.GINI fmean.ODA fmedian.ODA  fsd.ODA
# 1           NA       NA        NA          NA       NA
# 2           NA       NA        NA          NA       NA
# 3           NA       NA  33630000    33630000       NA
# 4           NA       NA  41563333    36710000 16691094
# 5           NA       NA  19857000    16530000 28602034
# 6           NA       NA        NA          NA       NA

With multiple functions, we could also request collap to return a long-format of the data:

head(collap(wlddev, ~ iso3c + decade, list(fmean, fmedian, fsd), cols = 9:12, return = "long"))
#   Function iso3c decade    PCGDP   LIFEEX GINI      ODA
# 1    fmean   ABW   1960       NA 66.58583   NA       NA
# 2    fmean   ABW   1970       NA 69.14178   NA       NA
# 3    fmean   ABW   1980       NA 72.17600   NA 33630000
# 4    fmean   ABW   1990 23677.09 73.45356   NA 41563333
# 5    fmean   ABW   2000 26766.93 73.85773   NA 19857000
# 6    fmean   ABW   2010 25238.80 75.01078   NA       NA

The final feature of collap I want to highlight at this point is the custom argument, which allows the user to circumvent the broad distinction into numeric and categorical data (and the associated FUN and catFUN arguments) and specify exactly which columns to aggregate using which functions:

head(collap(wlddev, ~ iso3c + decade, 
            custom = list(fmean = 9:12, fsd = 9:12, 
                          ffirst = c("country","region","income"), 
                          flast = c("year","date"),
                          fmode = "OECD")))
#   ffirst.country iso3c flast.date flast.year decade              ffirst.region ffirst.income
# 1          Aruba   ABW 1966-01-01       1965   1960 Latin America & Caribbean    High income
# 2          Aruba   ABW 1975-01-01       1974   1970 Latin America & Caribbean    High income
# 3          Aruba   ABW 1986-01-01       1985   1980 Latin America & Caribbean    High income
# 4          Aruba   ABW 1995-01-01       1994   1990 Latin America & Caribbean    High income
# 5          Aruba   ABW 2006-01-01       2005   2000 Latin America & Caribbean    High income
# 6          Aruba   ABW 2015-01-01       2014   2010 Latin America & Caribbean    High income
#   fmode.OECD fmean.PCGDP fsd.PCGDP fmean.LIFEEX fsd.LIFEEX fmean.GINI fsd.GINI fmean.ODA  fsd.ODA
# 1      FALSE          NA        NA     66.58583  0.6595475         NA       NA        NA       NA
# 2      FALSE          NA        NA     69.14178  0.9521791         NA       NA        NA       NA
# 3      FALSE          NA        NA     72.17600  0.8054561         NA       NA  33630000       NA
# 4      FALSE    23677.09 4100.7901     73.45356  0.1152921         NA       NA  41563333 16691094
# 5      FALSE    26766.93  834.3735     73.85773  0.2217034         NA       NA  19857000 28602034
# 6      FALSE    25238.80 1580.8698     75.01078  0.3942914         NA       NA        NA       NA

Through setting the argument give.names = FALSE, the output can also be generated without changing the column names.

Aggregation Benchmarks

When it comes to larger aggregation problems, the performance if collapse is in line with data.table, and offers the additional advantage of high-performance weighted and categorical aggregations:

# Creating a data.table with 10 columns and 1 mio. obs, including missing values
testdat <- na_insert(qDT(replicate(10, rnorm(1e6), simplify = FALSE)), prop = 0.1) # 10% missing
testdat[["g1"]] <- sample.int(1000, 1e6, replace = TRUE) # 1000 groups
testdat[["g2"]] <- sample.int(100, 1e6, replace = TRUE) # 100 groups

# The average group size is 10, there are about 100000 groups
GRP(testdat, ~ g1 + g2) 
# collapse grouping object of length 1000000 with 99998 ordered groups
# 
# Call: GRP.default(X = testdat, by = ~g1 + g2), unordered
# 
# Distribution of group sizes: 
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#       1       8      10      10      12      26 
# 
# Groups with sizes: 
# 1.1 1.2 1.3 1.4 1.5 1.6 
#   7  13  10   5  16  18 
#   ---
#  1000.95  1000.96  1000.97  1000.98  1000.99 1000.100 
#       10        8       11       14       18        7

# dplyr vs. data.table vs. collap (calling Fast Functions):
library(dplyr)

# Sum
system.time(testdat %>% group_by(g1,g2) %>% summarize_all(sum, na.rm = TRUE))
#    user  system elapsed 
#    0.52    0.01    0.53
system.time(testdat[, lapply(.SD, sum, na.rm = TRUE), keyby = c("g1","g2")])
#    user  system elapsed 
#    0.17    0.00    0.09
system.time(collap(testdat, ~ g1 + g2, fsum))
#    user  system elapsed 
#     0.1     0.0     0.1

# Product
system.time(testdat %>% group_by(g1,g2) %>% summarize_all(prod, na.rm = TRUE))
#    user  system elapsed 
#    2.67    0.02    2.69
system.time(testdat[, lapply(.SD, prod, na.rm = TRUE), keyby = c("g1","g2")])
#    user  system elapsed 
#    0.24    0.01    0.21
system.time(collap(testdat, ~ g1 + g2, fprod))
#    user  system elapsed 
#    0.13    0.00    0.13

# Mean
system.time(testdat %>% group_by(g1,g2) %>% summarize_all(mean.default, na.rm = TRUE)) 
#    user  system elapsed 
#    5.29    0.00    5.29
system.time(testdat[, lapply(.SD, mean, na.rm = TRUE), keyby = c("g1","g2")])
#    user  system elapsed 
#    0.16    0.05    0.18
system.time(collap(testdat, ~ g1 + g2))
#    user  system elapsed 
#    0.16    0.00    0.16

# Weighted Mean
w <- abs(100*rnorm(1e6)) + 1 
testdat[["w"]] <- w
# Seems not possible with dplyr ...
system.time(testdat[, lapply(.SD, weighted.mean, w = w, na.rm = TRUE), keyby = c("g1","g2")])
#    user  system elapsed 
#   10.92    0.00   10.92
system.time(collap(testdat, ~ g1 + g2, w = w))
#    user  system elapsed 
#    0.16    0.00    0.16

# Maximum
system.time(testdat %>% group_by(g1,g2) %>% summarize_all(max, na.rm = TRUE))
#    user  system elapsed 
#    0.48    0.03    0.52
system.time(testdat[, lapply(.SD, max, na.rm = TRUE), keyby = c("g1","g2")])
#    user  system elapsed 
#    0.30    0.00    0.25
system.time(collap(testdat, ~ g1 + g2, fmax))
#    user  system elapsed 
#    0.12    0.00    0.12

# Median
system.time(testdat %>% group_by(g1,g2) %>% summarize_all(median.default, na.rm = TRUE)) 
#    user  system elapsed 
#   46.92    0.00   47.15
system.time(testdat[, lapply(.SD, median, na.rm = TRUE), keyby = c("g1","g2")])
#    user  system elapsed 
#    0.50    0.00    0.46
system.time(collap(testdat, ~ g1 + g2, fmedian))
#    user  system elapsed 
#    0.70    0.01    0.72

# Variance
system.time(testdat %>% group_by(g1,g2) %>% summarize_all(var, na.rm = TRUE)) 
#    user  system elapsed 
#   16.31    0.02   16.32
system.time(testdat[, lapply(.SD, var, na.rm = TRUE), keyby = c("g1","g2")])
#    user  system elapsed 
#    0.64    0.03    0.63
system.time(collap(testdat, ~ g1 + g2, fvar)) 
#    user  system elapsed 
#    0.21    0.00    0.20
# Note: fvar implements a numerically stable online variance using Welfords Algorithm.

# Weighted Variance
# Don't know how to do this fast in dplyr or data.table. 
system.time(collap(testdat, ~ g1 + g2, fvar, w = w))
#    user  system elapsed 
#    0.22    0.00    0.22

# Last value
system.time(testdat %>% group_by(g1,g2) %>% summarize_all(last))
#    user  system elapsed 
#    4.17    0.01    4.19
system.time(testdat[, lapply(.SD, last), keyby = c("g1","g2")])
#    user  system elapsed 
#    0.09    0.02    0.06
system.time(collap(testdat, ~ g1 + g2, flast, na.rm = FALSE)) 
#    user  system elapsed 
#    0.08    0.00    0.08
# Note: collapse functions ffirst and flast by default also remove missing values i.e. take the first and last non-missing data point

# Mode
# Defining a mode function in base R and applying it by groups is very slow, no matter whether you use dplyr or data.table. 
# There are solutions suggested on stackoverflow on using chained operations in data.table to compute the mode, 
# but those I find rather arcane and they are also not very fast. 
system.time(collap(testdat, ~ g1 + g2, fmode)) 
#    user  system elapsed 
#    1.17    0.03    1.21
# Note: This mode function uses index hashing in C++, it's a blast !

# Weighted Mode
system.time(collap(testdat, ~ g1 + g2, fmode, w = w))
#    user  system elapsed 
#    2.37    0.13    2.50

# Number of Distinct Values
# No straightforward data.table solution.. 
system.time(testdat %>% group_by(g1,g2) %>% summarize_all(n_distinct, na.rm = TRUE))
#    user  system elapsed 
#    8.04    0.00    8.11
system.time(collap(testdat, ~ g1 + g2, fNdistinct)) 
#    user  system elapsed 
#    1.08    0.09    1.20

I believe on really huge datasets aggregated on a multi-core machine, data.table’s memory efficiency and thread-parallelization will let it run faster with some GeForce optimized functions, but that does not apply to most users (I have tested up to 10 million obs. on my laptop where collapse is still very much in line). In comparison to collapse and data.table the performance of dplyr on this data is rather poor, especially for base functions that are not highly optimized like sum. I do however very much appreciate the tidyverse ecosystem for highly organized data exploration and transformation. Therefore I have created methods for all of the Fast Statistical Functions as well as collap, enabling them to be used effectively in the dplyr ecosystem where they produce amazing speed gains. This is the subject of the ‘collapse and dplyr’ vignette.

Apart from its non-reliance on non-standard evaluation, a central advantage of collapse for programming is the speed it maintains on smaller problems where it’s more efficient R code compared to dplyr and data.table really plays out:

# 12000 obs in 1500 groups: A more typical case
GRP(wlddev, ~ iso3c + decade)
# collapse grouping object of length 12744 with 1512 ordered groups
# 
# Call: GRP.default(X = wlddev, by = ~iso3c + decade), unordered
# 
# Distribution of group sizes: 
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   4.000   6.000   9.000   8.429  11.000  11.000 
# 
# Groups with sizes: 
# ABW.1960 ABW.1970 ABW.1980 ABW.1990 ABW.2000 ABW.2010 
#        6        9       11        9       11        9 
#   ---
# ZWE.1970 ZWE.1980 ZWE.1990 ZWE.2000 ZWE.2010 ZWE.2020 
#        9       11        9       11        9        4

library(microbenchmark)
dtwlddev <- qDT(wlddev)
microbenchmark(dplyr = dtwlddev %>% group_by(iso3c,decade) %>% select_at(9:12) %>% summarise_all(sum, na.rm = TRUE),
               data.table = dtwlddev[, lapply(.SD, sum, na.rm = TRUE), by = c("iso3c","decade"), .SDcols = 9:12],
               collap = collap(dtwlddev, ~ iso3c + decade, fsum, cols = 9:12),
               fast_fun = fsum(get_vars(dtwlddev, 9:12), GRP(dtwlddev, ~ iso3c + decade), use.g.names = FALSE)) # We can gain a bit coding it manually
# Unit: milliseconds
#        expr       min        lq      mean    median        uq       max neval cld
#       dplyr 12.560093 13.373157 15.545487 14.007945 14.527154 52.813437   100   c
#  data.table  3.087589  3.268543  3.785641  3.497691  3.942377  5.959194   100  b 
#      collap  1.279393  1.390732  1.502004  1.471502  1.533308  3.221463   100 a  
#    fast_fun  1.162922  1.260651  1.338360  1.332273  1.400103  1.850144   100 a

# Now going really small:
dtmtcars <- qDT(mtcars)
microbenchmark(dplyr = dtmtcars %>% group_by(cyl,vs,am) %>% summarise_all(sum, na.rm = TRUE),      # Large R overhead
               data.table = dtmtcars[, lapply(.SD, sum, na.rm = TRUE), by = c("cyl","vs","am")],   # Large R overhead
               collap = collap(dtmtcars, ~ cyl + vs + am, fsum),                                   # Now this is still quite efficient
               fast_fun = fsum(dtmtcars, GRP(dtmtcars, ~ cyl + vs + am), use.g.names = FALSE))     # And this is nearly the speed of a full C++ implementation
# Unit: microseconds
#        expr      min       lq      mean    median        uq      max neval  cld
#       dplyr 1591.766 1798.602 1931.9456 1925.7830 2024.8495 3150.956   100   c 
#  data.table 2880.976 2998.340 3238.6844 3190.2265 3397.0615 4434.366   100    d
#      collap  166.897  204.158  240.7019  246.3290  268.6415  338.256   100  b  
#    fast_fun   85.680  105.315  129.8584  128.2965  149.7165  263.733   100 a

In general, the smaller the problem, the greater advantage collapse has over other packages because it’s R overhead (i.e. the R code executed before the actual C-function doing the hard work is called) is carefully minimized. Most users working on typical datasets (< 1 Mio obs.) will find that their code runs significantly faster when implemented in collapse compared to other solutions.

4. Data Transformations

collapse also provides an ensemble of function to perform common data transformations extremely efficiently and user friendly. I start off this section by briefly introducing two apply functions I thought were missing in the base R ensemble, and then quickly move to the more involved functions to carry out extremely fast grouped transformations.

4.1. Row and Column Data Apply

dapply is an efficient apply command for matrices and data.frames. It can be used to apply functions to rows or (by default) columns of matrices or data.frames and by default returns objects of the same type and with the same attributes.

dapply(mtcars, median)
#     mpg     cyl    disp      hp    drat      wt    qsec      vs      am    gear    carb 
#  19.200   6.000 196.300 123.000   3.695   3.325  17.710   0.000   0.000   4.000   2.000

dapply(mtcars, median, MARGIN = 1) 
#           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive   Hornet Sportabout 
#               4.000               4.000               4.000               3.215               3.440 
#             Valiant          Duster 360           Merc 240D            Merc 230            Merc 280 
#               3.460               4.000               4.000               4.000               4.000 
#           Merc 280C          Merc 450SE          Merc 450SL         Merc 450SLC  Cadillac Fleetwood 
#               4.000               4.070               3.730               3.780               5.250 
# Lincoln Continental   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
#               5.424               5.345               4.000               4.000               4.000 
#       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28    Pontiac Firebird 
#               3.700               3.520               3.435               4.000               3.845 
#           Fiat X1-9       Porsche 914-2        Lotus Europa      Ford Pantera L        Ferrari Dino 
#               4.000               4.430               4.000               5.000               6.000 
#       Maserati Bora          Volvo 142E 
#               8.000               4.000

dapply(mtcars, quantile)
#         mpg cyl    disp    hp  drat      wt    qsec vs am gear carb
# 0%   10.400   4  71.100  52.0 2.760 1.51300 14.5000  0  0    3    1
# 25%  15.425   4 120.825  96.5 3.080 2.58125 16.8925  0  0    3    2
# 50%  19.200   6 196.300 123.0 3.695 3.32500 17.7100  0  0    4    2
# 75%  22.800   8 326.000 180.0 3.920 3.61000 18.9000  1  1    4    4
# 100% 33.900   8 472.000 335.0 4.930 5.42400 22.9000  1  1    5    8

head(dapply(mtcars, quantile, MARGIN = 1))
#                   0%    25%   50%    75% 100%
# Mazda RX4          0 3.2600 4.000 18.730  160
# Mazda RX4 Wag      0 3.3875 4.000 19.010  160
# Datsun 710         1 1.6600 4.000 20.705  108
# Hornet 4 Drive     0 2.0000 3.215 20.420  258
# Hornet Sportabout  0 2.5000 3.440 17.860  360
# Valiant            0 1.8800 3.460 19.160  225

head(dapply(mtcars, log)) # This is considerably more efficient than log(mtcars)
#                        mpg      cyl     disp       hp     drat        wt     qsec   vs   am     gear
# Mazda RX4         3.044522 1.791759 5.075174 4.700480 1.360977 0.9631743 2.800933 -Inf    0 1.386294
# Mazda RX4 Wag     3.044522 1.791759 5.075174 4.700480 1.360977 1.0560527 2.834389 -Inf    0 1.386294
# Datsun 710        3.126761 1.386294 4.682131 4.532599 1.348073 0.8415672 2.923699    0    0 1.386294
# Hornet 4 Drive    3.063391 1.791759 5.552960 4.700480 1.124930 1.1678274 2.967333    0 -Inf 1.098612
# Hornet Sportabout 2.928524 2.079442 5.886104 5.164786 1.147402 1.2354715 2.834389 -Inf -Inf 1.098612
# Valiant           2.895912 1.791759 5.416100 4.653960 1.015231 1.2412686 3.006672    0 -Inf 1.098612
#                        carb
# Mazda RX4         1.3862944
# Mazda RX4 Wag     1.3862944
# Datsun 710        0.0000000
# Hornet 4 Drive    0.0000000
# Hornet Sportabout 0.6931472
# Valiant           0.0000000

dapply preserves the data structure:

It also delivers seamless conversions, i.e. you can apply functions to data frame rows or columns and return a matrix or vice-versa:

I do not provide benchmarks here, but dapply is also very efficient. On data.frames, the performance is comparable to lapply, and dapply is about 2x faster than apply for row- or column-wise operations on matrices. The most important feature for me however is that it does not change the structure of the data at all: all attributes are preserved, so you can use dapply on a data table, grouped tibble, or on a time-series matrix and get a transformed object of the same class back (unless the result is a scalar in which case dapply by default simplifies and returns a vector).

4.2. Split-Apply-Combine Computing

BY is a generalization of dapply for grouped computations using functions that are not part of the Fast Statistical Functions introduced above. It fundamentally is a reimplementation of the lapply(split(x, g), FUN, ...) computing paradigm in base R, but substantially faster and more versatile than functions like tapply, by or aggregate. It is however not faster than dplyr which remains the best solution for larger grouped computations on data.frames requiring split-apply-combine computing.

BY is S3 generic with methods for vector, matrix, data.frame and grouped_df8. It also supports the same grouping (g) inputs as the Fast Statistical Functions (grouping vectors, factors, lists or GRP objects). Below I demonstrate the use if BY on vectors matrices and data.frames.

v <- iris$Sepal.Length   # A numeric vector
f <- iris$Species        # A factor

## default vector method
BY(v, f, sum)                          # Sum by species, about 2x faster than tapply(v, f, sum)
#     setosa versicolor  virginica 
#      250.3      296.8      329.4

BY(v, f, quantile)                     # Species quantiles: by default stacked
#       setosa.0%      setosa.25%      setosa.50%      setosa.75%     setosa.100%   versicolor.0% 
#           4.300           4.800           5.000           5.200           5.800           4.900 
#  versicolor.25%  versicolor.50%  versicolor.75% versicolor.100%    virginica.0%   virginica.25% 
#           5.600           5.900           6.300           7.000           4.900           6.225 
#   virginica.50%   virginica.75%  virginica.100% 
#           6.500           6.900           7.900

BY(v, f, quantile, expand.wide = TRUE) # Wide format
#             0%   25% 50% 75% 100%
# setosa     4.3 4.800 5.0 5.2  5.8
# versicolor 4.9 5.600 5.9 6.3  7.0
# virginica  4.9 6.225 6.5 6.9  7.9

## matrix method
miris <- qM(num_vars(iris))
BY(miris, f, sum)                          # Also returns as matrix
#            Sepal.Length Sepal.Width Petal.Length Petal.Width
# setosa            250.3       171.4         73.1        12.3
# versicolor        296.8       138.5        213.0        66.3
# virginica         329.4       148.7        277.6       101.3

head(BY(miris, f, quantile))
#               Sepal.Length Sepal.Width Petal.Length Petal.Width
# setosa.0%              4.3       2.300        1.000         0.1
# setosa.25%             4.8       3.200        1.400         0.2
# setosa.50%             5.0       3.400        1.500         0.2
# setosa.75%             5.2       3.675        1.575         0.3
# setosa.100%            5.8       4.400        1.900         0.6
# versicolor.0%          4.9       2.000        3.000         1.0

BY(miris, f, quantile, expand.wide = TRUE)[,1:5]
#            Sepal.Length.0% Sepal.Length.25% Sepal.Length.50% Sepal.Length.75% Sepal.Length.100%
# setosa                 4.3            4.800              5.0              5.2               5.8
# versicolor             4.9            5.600              5.9              6.3               7.0
# virginica              4.9            6.225              6.5              6.9               7.9

BY(miris, f, quantile, expand.wide = TRUE, return = "list")[1:2] # list of matrices
# $Sepal.Length
#             0%   25% 50% 75% 100%
# setosa     4.3 4.800 5.0 5.2  5.8
# versicolor 4.9 5.600 5.9 6.3  7.0
# virginica  4.9 6.225 6.5 6.9  7.9
# 
# $Sepal.Width
#             0%   25% 50%   75% 100%
# setosa     2.3 3.200 3.4 3.675  4.4
# versicolor 2.0 2.525 2.8 3.000  3.4
# virginica  2.2 2.800 3.0 3.175  3.8

## data.frame method
BY(num_vars(iris), f, sum)             # Also returns a data.fram etc...
#            Sepal.Length Sepal.Width Petal.Length Petal.Width
# setosa            250.3       171.4         73.1        12.3
# versicolor        296.8       138.5        213.0        66.3
# virginica         329.4       148.7        277.6       101.3

## Conversions
identical(BY(num_vars(iris), f, sum), BY(miris, f, sum, return = "data.frame"))
# [1] TRUE
identical(BY(miris, f, sum), BY(num_vars(iris), f, sum, return = "matrix"))
# [1] TRUE

4.3. Fast Replacing and Sweeping-out Statistics

TRA is an S3 generic that efficiently transforms data by either (column-wise) replacing data values with supplied statistics or sweeping the statistics out of the data. The 8 operations supported by TRA are:

TRA is also incorporated as an argument to all Fast Statistical Functions. Therefore it is only really necessary and advisable to use the TRA() function if both aggregate statistics and transformed data are required, or to sweep out statistics otherwise obtained (e.g. regression or correlation coefficients etc.). Below I compute the column means of the iris-matrix obtained above, and use them to demean that matrix.

The code below shows 3 identical ways to center data in the collapse package. For the very common centering and averaging tasks, collapse supplies 2 special functions fwithin and fbetween (discussed in section 4.5) which are slightly faster and more memory efficient than fmean(..., TRA = "-") and fmean(..., TRA = "replace").

All of the above is functionality also offered by base::sweep, although TRA is about 4x faster. The big advantage of TRA is that it also supports grouped operations:

A somewhat special operation performed by TRA is the grouped centering on the overall statistic (which for the mean is also performed more efficiently by fwithin):

This is the within transformation also computed by qsu discussed in section 1. It’s utility in the case of grouped centering is demonstrated visually in section 4.5.

4.4. Fast Standardizing

The function fscale can be used to efficiently standardize (i.e. scale and center) data using a numerically stable online algorithm. It’s structure is the same as the Fast Statistical Functions. The standardization-operator STD also exists as a wrapper around fscale. The difference is that by default STD adds a prefix to standardized variables and also provides an enhanced method for data.frames (more about operators in the next section).

Scaling with fscale / STD can also be done groupwise and / or weighted. For example the Groningen Growth and Development Center 10-Sector Database9 provides annual series of value added in local currency and persons employed for 10 broad sectors in several African, Asian, and Latin American countries.

If we wanted to correlate this data across countries and sectors, it needs to be standardized:

4.5. Fast Centering and Averaging

As a slightly faster alternative to fmean(x, g, w, TRA = "-"/"-+") or fmean(x, g, w, TRA = "replace"/"replace_fill"), fwithin and fbetween can be used to perform common (grouped, weighted) centering and averaging tasks (also known as between- and within- transformations in the language of panel-data econometrics, thus the names). The operators W and B also exist.

To demonstrate more clearly the utility of the operators which exists for all fast transformation and time-series functions, the code below implements the task of demeaning 4 series by country and saving the country-id using the within-operator W as opposed to fwithin which requires all input to be passed externally like the Fast Statistical Functions.

It is also possible to drop the id’s in W using the argument keep.by = FALSE. fbetween / B and fwithin / W each have one additional computational option:

# This replaces missing values with the group-mean: Same as fmean(x, g, TRA = "replace_fill")
head(B(wlddev, ~ iso3c, cols = 9:12, fill = TRUE))
#   iso3c  B.PCGDP B.LIFEEX B.GINI      B.ODA
# 1   AFG 482.1631 47.88216     NA 1351073448
# 2   AFG 482.1631 47.88216     NA 1351073448
# 3   AFG 482.1631 47.88216     NA 1351073448
# 4   AFG 482.1631 47.88216     NA 1351073448
# 5   AFG 482.1631 47.88216     NA 1351073448
# 6   AFG 482.1631 47.88216     NA 1351073448

# This adds back the global mean after subtracting out group means: Same as fmean(x, g, TRA = "-+")
head(W(wlddev, ~ iso3c, cols = 9:12, add.global.mean = TRUE))
#   iso3c W.PCGDP W.LIFEEX W.GINI      W.ODA
# 1   AFG      NA 48.25093     NA -807886980
# 2   AFG      NA 48.70093     NA -688976980
# 3   AFG      NA 49.14393     NA -807446980
# 4   AFG      NA 49.58293     NA -685876980
# 5   AFG      NA 50.01893     NA -619846980
# 6   AFG      NA 50.45393     NA -552076980
# Note: This is not just slightly faster than fmean(x, g, TRA = "-+"), but if weights are used, fmean(x, g, w, "-+")
# gives a wrong result: It subtracts weighted group means but then centers on the frequency-weighted average of those group means,
# whereas fwithin(x, g, w, add.global.mean = TRUE) will also center on the properly weighted overall mean. 

# Visual demonstration of centering on the global mean vs. simple centering
oldpar <- par(mfrow = c(1,3)) 
plot(iris[1:2], col = iris$Species, main = "Raw Data")                       # Raw data
plot(W(iris, ~ Species)[2:3], col = iris$Species, main = "Simple Centering") # Simple centering
plot(W(iris, ~ Species, add.global.mean = TRUE)[2:3], col = iris$Species,    # Centering on overall mean: Preserves level of data
     main = "add.global.mean") 

Another great utility of operators is that they can be employed in regression formulas in a manor that is both very efficient and pleasing to the eyes. Below I demonstrate the use of W and B to efficiently run fixed-effects regressions with lm.

In general I recommend calling the full functions (i.e. fwithin or fscale etc.) for programming since they are a bit more efficient on the R-side of things and require all input in terms of data. For all other purposes I find the operators are more convenient. It is important to note that the operators can do everything the functions can do (i.e. you can also pass grouping vectors or GRP objects to them). They are just simple wrappers that in the data.frame method add 4 additional features:

That’s it about operators! If you like this kind of parsimony use them, otherwise leave it.

4.6. HD Centering and Linear Prediction

Sometimes simple centering is not enough, for example if a linear model with multiple levels of fixed-effects needs to be estimated, potentially involving interactions with continuous covariates. For these purposes fHDwithin / HDW and fHDbetween / HDB were created as efficient multi-purpose functions for linear prediction and partialling out. They operate by splitting complex regression problems in 2 parts: Factors and factor-interactions are projected out using lfe::demeanlist, an efficient C routine for centering vectors on multiple factors, whereas continuous variables are dealt with using a standard qr decomposition in base R. The examples below show the use of the HDW operator in manually solving a regression problem with country and time fixed effects.

We may wish to test whether including time fixed-effects in the above regression actually impacts the fit. This can be done with the fast F-test:

The test shows that the time fixed-effects (accounted for like year dummies) are jointly significant.

One can also use fHDbetween / HDB and fHDwithin / HDW to project out interactions and continuous covariates. The interaction feature of HDW and HDB is still a bit experimental as lfe::demeanlist is not very fast at it.

I am hoping that the lfe package Author Simen Gaure will at some point improve the part of the algorithm projecting out interactions. Otherwise I will code something myself to improve this feature. There have also been several packages published recently to estimate heterogeneous slopes models. I might take some time to look at those implementations and update HDW and HDB at some point.

Transformation Benchmarks

Below I provide benchmarks for some very common data transformation tasks, again comparing collapse to dplyr and data.table:

# The average group size is 10, there are about 100000 groups
GRP(testdat, ~ g1 + g2) 
# collapse grouping object of length 1000000 with 99998 ordered groups
# 
# Call: GRP.default(X = testdat, by = ~g1 + g2), unordered
# 
# Distribution of group sizes: 
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#       1       8      10      10      12      26 
# 
# Groups with sizes: 
# 1.1 1.2 1.3 1.4 1.5 1.6 
#   7  13  10   5  16  18 
#   ---
#  1000.95  1000.96  1000.97  1000.98  1000.99 1000.100 
#       10        8       11       14       18        7

# get indices of grouping columns 
ind <- get_vars(testdat, c("g1","g2"), "indices")

# Centering
system.time(testdat %>% group_by(g1,g2) %>% mutate_all(function(x) x - mean.default(x, na.rm = TRUE)))
#    user  system elapsed 
#    8.31    0.02    8.35
system.time(testdat[, lapply(.SD, function(x) x - mean(x, na.rm = TRUE)), keyby = c("g1","g2")]) 
#    user  system elapsed 
#   10.71    0.01   10.94
system.time(W(testdat, ~ g1 + g2))
#    user  system elapsed 
#    0.21    0.00    0.22

# Weighted Centering
# Can't easily be done in dplyr.. 
system.time(testdat[, lapply(.SD, function(x) x - weighted.mean(x, w, na.rm = TRUE)), keyby = c("g1","g2")])
#    user  system elapsed 
#   13.78    0.00   13.91
system.time(W(testdat, ~ g1 + g2, ~ w))
#    user  system elapsed 
#    0.21    0.00    0.22

# Centering on the overall mean
# Can't easily be done in dplyr or data.table.
system.time(W(testdat, ~ g1 + g2, add.global.mean = TRUE))      # Ordinary 
#    user  system elapsed 
#    0.21    0.00    0.22
system.time(W(testdat, ~ g1 + g2, ~ w, add.global.mean = TRUE)) # Weighted
#    user  system elapsed 
#    0.21    0.00    0.22

# Centering on both grouping variables simultaneously
# Can't be done in dplyr or data.table at all!
system.time(HDW(testdat, ~ qF(g1) + qF(g2), variable.wise = TRUE))        # Ordinary
#    user  system elapsed 
#    0.82    0.02    0.82
system.time(HDW(testdat, ~ qF(g1) + qF(g2), w = w, variable.wise = TRUE)) # Weighted
#    user  system elapsed 
#    0.95    0.03    0.99

# Proportions
system.time(testdat %>% group_by(g1,g2) %>% mutate_all(function(x) x/sum(x, na.rm = TRUE)))
#    user  system elapsed 
#    4.70    0.00    4.71
system.time(testdat[, lapply(.SD, function(x) x/sum(x, na.rm = TRUE)), keyby = c("g1","g2")])
#    user  system elapsed 
#    2.10    0.00    2.09
system.time(fsum(get_vars(testdat, -ind), get_vars(testdat, ind), "/"))
#    user  system elapsed 
#    0.17    0.00    0.17

# Scaling
system.time(testdat %>% group_by(g1,g2) %>% mutate_all(function(x) x/sd(x, na.rm = TRUE)))
#    user  system elapsed 
#   19.48    0.06   19.58
system.time(testdat[, lapply(.SD, function(x) x/sd(x, na.rm = TRUE)), keyby = c("g1","g2")])
#    user  system elapsed 
#   15.81    0.05   15.83
system.time(fsd(get_vars(testdat, -ind), get_vars(testdat, ind), TRA = "/"))
#    user  system elapsed 
#    0.31    0.00    0.32
system.time(fsd(get_vars(testdat, -ind), get_vars(testdat, ind), w, "/")) # Weighted Scaling. Need a weighted sd to do in dplyr or data.table
#    user  system elapsed 
#    0.29    0.00    0.28

# Scaling and centering (i.e. standardizing)
system.time(testdat %>% group_by(g1,g2) %>% mutate_all(function(x) (x - mean.default(x, na.rm = TRUE))/sd(x, na.rm = TRUE)))
#    user  system elapsed 
#   23.89    0.03   24.00
system.time(testdat[, lapply(.SD, function(x) (x - mean(x, na.rm = TRUE))/sd(x, na.rm = TRUE)), keyby = c("g1","g2")])
#    user  system elapsed 
#   27.25    0.06   27.57
system.time(STD(testdat, ~ g1 + g2))
#    user  system elapsed 
#    0.33    0.00    0.33
system.time(STD(testdat, ~ g1 + g2, ~ w))  # Weighted standardizing: Also difficult to do in dplyr or data.table
#    user  system elapsed 
#    0.33    0.00    0.33

# Replacing data with any ststistic, here the sum:
system.time(testdat %>% group_by(g1,g2) %>% mutate_all(sum, na.rm = TRUE))
#    user  system elapsed 
#    0.85    0.03    0.88
system.time(testdat[, setdiff(names(testdat), c("g1","g2")) := lapply(.SD, sum, na.rm = TRUE), keyby = c("g1","g2")])
#    user  system elapsed 
#    1.36    0.05    1.30
system.time(fsum(get_vars(testdat, -ind), get_vars(testdat, ind), "replace_fill")) # dplyr and data.table also fill missing values. 
#    user  system elapsed 
#    0.07    0.00    0.07
system.time(fsum(get_vars(testdat, -ind), get_vars(testdat, ind), "replace")) # This preserves missing values, and is not easily implemented in dplyr or data.table
#    user  system elapsed 
#    0.07    0.00    0.07

The message is clear: collapse outperforms dplyr and data.table both in scope and speed when it comes to grouped and / or weighted transformations of data. This capacity of collapse should make it attractive to econometricians and people programming with complex panel-data. In the ‘collapse and plm’ vignette I provide a programming example by implementing a more general case of the Hausman and Taylor (1981) estimator with two levels of fixed effects, as well as further benchmarks.

5. Time-Series and Panel-Series

collapse also presents some essential contributions in the time-series domain, particularly in the area of panel-data and efficient and secure computations on unordered time-dependent vectors and panel-series.

5.1. Panel-Series to Array Conversions

Starting with data exploration and an improved data-access of panel data, psmat is an S3 generic to efficiently obtain matrices or 3D-arrays from panel data.

Passing a data.frame of panel-series to psmat generates a 3D array:

psmat can also output a list of panel-series matrices, which can be used amongst other things to reshape the data with unlist2d (discussed in more detail in List-Processing section).

5.2. Panel-Series ACF, PACF and CCF

The correlation structure of panel-data can also be explored with psacf, pspacf and psccf. These functions are exact analogues to stats::acf, stats::pacf and stats::ccf. They use fscale to group-scale panel-data by the panel-id provided, and then compute the covariance of a sequence of panel-lags (generated with flag discussed below) with the group-scaled level-series, dividing by the variance of the group-scaled level series. The Partial-ACF is generated from the ACF using a Yule-Walker decomposition (as in stats::pacf).

5.3. Fast Lags and Leads

flag and the corresponding lag- and lead- operators L and F are S3 generics to efficiently compute lags and leads on time-series and panel data. The code below shows how to compute simple lags and leads on the classic Box & Jenkins airline data that comes with R.

flag / L / F also work well on (time-series) matrices. Below I run a regression with daily closing prices of major European stock indices: Germany DAX (Ibis), Switzerland SMI, France CAC, and UK FTSE. The data are sampled in business time, i.e. weekends and holidays are omitted.


# 1 annual lead and 1 annual lag
head(L(EuStockMarkets, -1:1*freq))                       
#      F260.DAX     DAX L260.DAX F260.SMI    SMI L260.SMI F260.CAC    CAC L260.CAC F260.FTSE   FTSE
# [1,]  1755.98 1628.75       NA   1846.6 1678.1       NA   1907.3 1772.8       NA    2515.8 2443.6
# [2,]  1754.95 1613.63       NA   1854.8 1688.5       NA   1900.6 1750.5       NA    2521.2 2460.2
# [3,]  1759.90 1606.51       NA   1845.3 1678.6       NA   1880.9 1718.0       NA    2493.9 2448.2
# [4,]  1759.84 1621.04       NA   1854.5 1684.1       NA   1873.5 1708.1       NA    2476.1 2470.4
# [5,]  1776.50 1618.16       NA   1870.5 1686.6       NA   1883.6 1723.1       NA    2497.1 2484.7
# [6,]  1769.98 1610.61       NA   1862.6 1671.6       NA   1868.5 1714.3       NA    2469.0 2466.8
#      L260.FTSE
# [1,]        NA
# [2,]        NA
# [3,]        NA
# [4,]        NA
# [5,]        NA
# [6,]        NA

# DAX regressed on it's own 2 annual lags and the lags of the other indicators
summary(lm(DAX ~., data = L(EuStockMarkets, 0:2*freq))) 
# 
# Call:
# lm(formula = DAX ~ ., data = L(EuStockMarkets, 0:2 * freq))
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -240.46  -51.28  -12.01   45.19  358.02 
# 
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept) -564.02041   93.94903  -6.003 2.49e-09 ***
# L260.DAX      -0.12577    0.03002  -4.189 2.99e-05 ***
# L520.DAX      -0.12528    0.04103  -3.053  0.00231 ** 
# SMI            0.32601    0.01726  18.890  < 2e-16 ***
# L260.SMI       0.27499    0.02517  10.926  < 2e-16 ***
# L520.SMI       0.04602    0.02602   1.769  0.07721 .  
# CAC            0.59637    0.02349  25.389  < 2e-16 ***
# L260.CAC      -0.14283    0.02763  -5.169 2.72e-07 ***
# L520.CAC       0.05196    0.03657   1.421  0.15557    
# FTSE           0.01002    0.02403   0.417  0.67675    
# L260.FTSE      0.04509    0.02807   1.606  0.10843    
# L520.FTSE      0.10601    0.02717   3.902  0.00010 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 83.06 on 1328 degrees of freedom
#   (520 observations deleted due to missingness)
# Multiple R-squared:  0.9943,  Adjusted R-squared:  0.9942 
# F-statistic: 2.092e+04 on 11 and 1328 DF,  p-value: < 2.2e-16

The main innovation of flag / L / F is the ability to efficiently compute sequences of lags and leads on panel-data, and that this panel-data need not be ordered:

Behind the scenes this works by coercing the supplied panel-id (iso3c) and time-variable (year) to factor (or to GRP object if multiple panel-ids or time-variables are supplied) and creating an ordering vector of the data. Panel-lags are then computed through the ordering vector while keeping track of individual groups and inserting NA (or any other value passed to the fill argument) in the right places. Thus the data need not be sorted to compute a fully-identified panel-lag, which is a key advantage to, say, the shift function in data.table. All of this is written very efficiently in C++, and comes with an additional benefit: If anything is wrong with the panel, i.e. there are repeated time-values within a group or jumps in the time-variable within a group, flag / L / F will let you know. To give an example:

Note that all of this does not require the panel to be balanced. flag / L /F works fine on balanced and unbalanced panel data. One intended area of use, especially for the operators L and F, is to dramatically facilitate the implementation of dynamic models in various contexts. Below I show different ways L can be used to estimate a dynamic panel-model using lm:

# Different ways of regressing GDP on its's lags and life-Expectancy and it's lags

# 1 - Precomputing lags
summary(lm(PCGDP ~ ., L(wlddev, 0:2, PCGDP + LIFEEX ~ iso3c, ~ year, keep.ids = FALSE)))     
# 
# Call:
# lm(formula = PCGDP ~ ., data = L(wlddev, 0:2, PCGDP + LIFEEX ~ 
#     iso3c, ~year, keep.ids = FALSE))
# 
# Residuals:
#      Min       1Q   Median       3Q      Max 
# -16621.0   -100.0    -17.2     86.2  11935.3 
# 
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept) -321.51378   63.37246  -5.073    4e-07 ***
# L1.PCGDP       1.31801    0.01061 124.173   <2e-16 ***
# L2.PCGDP      -0.31550    0.01070 -29.483   <2e-16 ***
# LIFEEX        -1.93638   38.24878  -0.051    0.960    
# L1.LIFEEX     10.01163   71.20359   0.141    0.888    
# L2.LIFEEX     -1.66669   37.70885  -0.044    0.965    
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 791.3 on 7988 degrees of freedom
#   (4750 observations deleted due to missingness)
# Multiple R-squared:  0.9974,  Adjusted R-squared:  0.9974 
# F-statistic: 6.166e+05 on 5 and 7988 DF,  p-value: < 2.2e-16

# 2 - Ad-hoc computation in lm formula
summary(lm(PCGDP ~ L(PCGDP,1:2,iso3c,year) + L(LIFEEX,0:2,iso3c,year), wlddev))   
# 
# Call:
# lm(formula = PCGDP ~ L(PCGDP, 1:2, iso3c, year) + L(LIFEEX, 0:2, 
#     iso3c, year), data = wlddev)
# 
# Residuals:
#      Min       1Q   Median       3Q      Max 
# -16621.0   -100.0    -17.2     86.2  11935.3 
# 
# Coefficients:
#                                 Estimate Std. Error t value Pr(>|t|)    
# (Intercept)                   -321.51378   63.37246  -5.073    4e-07 ***
# L(PCGDP, 1:2, iso3c, year)L1     1.31801    0.01061 124.173   <2e-16 ***
# L(PCGDP, 1:2, iso3c, year)L2    -0.31550    0.01070 -29.483   <2e-16 ***
# L(LIFEEX, 0:2, iso3c, year)--   -1.93638   38.24878  -0.051    0.960    
# L(LIFEEX, 0:2, iso3c, year)L1   10.01163   71.20359   0.141    0.888    
# L(LIFEEX, 0:2, iso3c, year)L2   -1.66669   37.70885  -0.044    0.965    
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 791.3 on 7988 degrees of freedom
#   (4750 observations deleted due to missingness)
# Multiple R-squared:  0.9974,  Adjusted R-squared:  0.9974 
# F-statistic: 6.166e+05 on 5 and 7988 DF,  p-value: < 2.2e-16

# 3 - Precomputing panel-identifiers
g = qF(wlddev$iso3c, na.exclude = FALSE)
t = qF(wlddev$year, na.exclude = FALSE)
summary(lm(PCGDP ~ L(PCGDP,1:2,g,t) + L(LIFEEX,0:2,g,t), wlddev))                 
# 
# Call:
# lm(formula = PCGDP ~ L(PCGDP, 1:2, g, t) + L(LIFEEX, 0:2, g, 
#     t), data = wlddev)
# 
# Residuals:
#      Min       1Q   Median       3Q      Max 
# -16621.0   -100.0    -17.2     86.2  11935.3 
# 
# Coefficients:
#                          Estimate Std. Error t value Pr(>|t|)    
# (Intercept)            -321.51378   63.37246  -5.073    4e-07 ***
# L(PCGDP, 1:2, g, t)L1     1.31801    0.01061 124.173   <2e-16 ***
# L(PCGDP, 1:2, g, t)L2    -0.31550    0.01070 -29.483   <2e-16 ***
# L(LIFEEX, 0:2, g, t)--   -1.93638   38.24878  -0.051    0.960    
# L(LIFEEX, 0:2, g, t)L1   10.01163   71.20359   0.141    0.888    
# L(LIFEEX, 0:2, g, t)L2   -1.66669   37.70885  -0.044    0.965    
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 791.3 on 7988 degrees of freedom
#   (4750 observations deleted due to missingness)
# Multiple R-squared:  0.9974,  Adjusted R-squared:  0.9974 
# F-statistic: 6.166e+05 on 5 and 7988 DF,  p-value: < 2.2e-16

5.4. Fast Differences and Growth Rates

Similarly to flag / L / F, fdiff / D computes sequences of suitably lagged / leaded and iterated differences on ordered and unordered time-series and panel-data, and fgrowth / G computes growth rates or log-differences. Using again the Airpassengers data, the seasonal decomposition shows significant seasonality:

We can actually test the statistical significance of this seasonality around a cubic trend using again the fast F-test (same as running a regression with and without seasonal dummies and a cubic polynomial trend, but faster):

The test shows significant seasonality. We can plot the series and the ordinary and seasonal (12-month) growth rate using:

It is evident that taking the annualized growth rate removes most of the periodic behavior. We can also compute second differences or growth rates of growth rates. Below I plot the ordinary and annual first and second differences of the data:

In general, both fdiff / D and fgrowth / G can compute sequences of lagged / leaded and iterated growth rates, as the code below shows:

All of this also works for panel-data. The code below gives an example:

The attached class-attribute allows calls of flag / L / F, fdiff / D and fgrowth / G to be nested. In the example below, L.matrix is called on the right-half ob the above sequence:

If n * diff (or n in flag / L / F) exceeds the length of the data or the average group size in panel-computations, all of these functions will throw appropriate errors:

Of course fdiff / D and fgrowth / G also come with a data.frame method, making the computation of growth-variables on datasets very easy:

One could also add variables by reference using data.table:

When working with data.table it is important to realize that while collapse functions will work with data.table grouping using by or keyby, this is very slow because it will run a method-dispatch for every group. It is much better and more secure to utilize the functions fast internal grouping facilities, as I have done in the above example.

The code below estimates a dynamic panel model regressing the 10-year growth rate of GDP per capita on it’s 10-year lagged level and the 10-year growth rate of life-expectancy:

To go even a step further, the code below regresses the 10-year growth rate of GDP on the 10-year lagged levels and 10-year growth rates of GDP and life expectancy, with country and time-fixed effects projected out using HDW. The standard errors are unreliable without bootstrapping, but this example nicely demonstrates the potential for complex estimations brought by collapse.

How long did it take to run this computation? About 4 milliseconds on my laptop (2x 2.2 GHZ, 8 GB RAM), so there is plenty of room to do this with much larger data.

One of the inconveniences of the above computations is that it requires declaring the panel-identifiers iso3c and year again and again for each function. A great remedy here are the plm classes pseries and pdata.frame which collapse was built to support. To advocate for the use of these classes for panel-data, here I show how one could run the same regression with plm:

To learn more about the integration of collapse and plm, see the corresponding vignette.

Time-Computation Benchmarks

Below I provide some benchmarks for lags, differences and growth rates on panel-data. I will run microbenchmarks on the wlddev dataset. benchmarks on larger panels are already provided in the other vignettes. Again I compare collapse to dplyr and data.table:

# We have a balanced panel of 216 countries, each observed for 59 years
descr(wlddev, cols = c("iso3c", "year"))
# Dataset: wlddev, 2 Variables, N = 12744
# -----------------------------------------------------------------------------------------------------
# iso3c (factor): Country Code
# Stats: 
#       N  Ndist
#   12744    216
# Table: 
#        ABW   AFG   AGO   ALB   AND   ARE
# Freq    59    59    59    59    59    59
# Perc  0.46  0.46  0.46  0.46  0.46  0.46
#   ---
#        VUT   WSM   XKX   YEM   ZAF   ZMB   ZWE
# Freq    59    59    59    59    59    59    59
# Perc  0.46  0.46  0.46  0.46  0.46  0.46  0.46
# 
# Summary of Table: 
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#      59      59      59      59      59      59 
# -----------------------------------------------------------------------------------------------------
# year (numeric): 
# Stats: 
#       N  Ndist  Mean     SD   Min   Max  Skew  Kurt
#   12744     59  1989  17.03  1960  2018    -0   1.8
# Quant: 
#     1%    5%   25%   50%   75%   95%   99%
#   1960  1962  1974  1989  2004  2016  2018
# -----------------------------------------------------------------------------------------------------

# 1 Panel-Lag
suppressMessages(
microbenchmark(dplyr_not_ordered = wlddev %>% group_by(iso3c) %>% select_at(9:12) %>% mutate_all(lag),
               dplyr_ordered = wlddev %>% arrange(iso3c,year) %>% group_by(iso3c) %>% select_at(9:12) %>% mutate_all(lag),
               data.table_not_ordered = dtwlddev[, shift(.SD), keyby = iso3c, .SDcols = 9:12],
               data.table_ordered = dtwlddev[order(year), shift(.SD), keyby = iso3c, .SDcols = 9:12], 
               collapse_not_ordered = L(wlddev, 1, ~iso3c, cols = 9:12),
               collapse_ordered = L(wlddev, 1, ~iso3c, ~year, cols = 9:12),
               subtract_from_CNO = message("Panel-lag computed without timevar: Assuming ordered data")))
# Unit: microseconds
#                    expr       min         lq       mean     median         uq        max neval cld
#       dplyr_not_ordered 23116.533 23584.2005 30255.9278 24521.0985 26106.1705 307177.852   100   c
#           dplyr_ordered 28002.501 29062.3400 31068.8398 29727.4725 31726.6630  73125.748   100   c
#  data.table_not_ordered  4695.420  4914.7515  5993.6894  5087.4495  5271.9725  48162.196   100  b 
#      data.table_ordered  5737.410  6051.5675  6450.9372  6223.8195  6526.1525   8039.156   100  b 
#    collapse_not_ordered   320.852   425.9440   466.7711   475.2540   505.5990    592.617   100 a  
#        collapse_ordered   602.434   676.9585   709.4094   701.0560   752.3745    855.458   100 a  
#       subtract_from_CNO   166.004   230.4875   270.6094   289.6145   304.7875    348.073   100 a

# Sequence of 1 lead and 3 lags: Not possible in dplyr
microbenchmark(data.table_not_ordered = dtwlddev[, shift(.SD, -1:3), keyby = iso3c, .SDcols = 9:12],
               data.table_ordered = dtwlddev[order(year), shift(.SD, -1:3), keyby = iso3c, .SDcols = 9:12], 
               collapse_ordered = L(wlddev, -1:3, ~iso3c, ~year, cols = 9:12))    
# Unit: microseconds
#                    expr      min       lq     mean   median       uq       max neval cld
#  data.table_not_ordered 5970.351 6231.629 7077.991 6483.313 6663.374 64301.642   100   b
#      data.table_ordered 7121.224 7315.788 8157.786 7557.432 7740.840 67464.647   100   b
#        collapse_ordered  888.034  950.508 1006.896 1000.712 1073.673  1284.302   100  a

# 1 Panel-difference
microbenchmark(dplyr_not_ordered = wlddev %>% group_by(iso3c) %>% select_at(9:12) %>% mutate_all(function(x) x - lag(x)),
               dplyr_ordered = wlddev %>% arrange(iso3c,year) %>% group_by(iso3c) %>% select_at(9:12) %>% mutate_all(function(x) x - lag(x)), 
               data.table_not_ordered = dtwlddev[, lapply(.SD, function(x) x - shift(x)), keyby = iso3c, .SDcols = 9:12],
               data.table_ordered = dtwlddev[order(year), lapply(.SD, function(x) x - shift(x)), keyby = iso3c, .SDcols = 9:12], 
               collapse_ordered = D(wlddev, 1, 1, ~iso3c, ~year, cols = 9:12))                                                 
# Unit: microseconds
#                    expr       min        lq       mean     median        uq       max neval  cld
#       dplyr_not_ordered 24116.128 24905.317 28815.4084 25588.3000 27158.200 70433.981   100   c 
#           dplyr_ordered 29077.066 30000.129 33739.8743 30787.5340 32690.558 77775.651   100    d
#  data.table_not_ordered 14059.486 14690.257 16571.3306 15549.7310 15839.122 58180.460   100  b  
#      data.table_ordered 15069.791 15742.287 16955.6035 16578.3325 16998.475 55906.827   100  b  
#        collapse_ordered   624.301   712.212   754.2083   733.4085   794.098  1032.618   100 a

# Iterated Panel-Difference: Not straightforward in dplyr or data.table
microbenchmark(collapse_ordered = D(wlddev, 1, 2, ~iso3c, ~year, cols = 9:12))
# Unit: microseconds
#              expr     min      lq     mean   median       uq      max neval
#  collapse_ordered 763.977 780.934 817.3077 812.1715 840.7315 1004.951   100

# Sequence of Lagged/Leaded Differences: Not straightforward in dplyr or data.table
microbenchmark(collapse_ordered = D(wlddev, -1:3, 1, ~iso3c, ~year, cols = 9:12))
# Unit: microseconds
#              expr     min      lq     mean   median       uq     max neval
#  collapse_ordered 983.531 991.786 1070.527 1028.379 1118.744 1296.35   100

# Sequence of Lagged/Leaded and Iterated Differences: Not straightforward in dplyr or data.table
microbenchmark(collapse_ordered = D(wlddev, -1:3, 1:2, ~iso3c, ~year, cols = 9:12))
# Unit: milliseconds
#              expr      min       lq     mean   median       uq      max neval
#  collapse_ordered 2.015702 2.094688 2.284892 2.268056 2.360875 3.634243   100

# The same applies to growth rates or log-differences. 
microbenchmark(collapse_ordered_growth = G(wlddev, 1, 1, ~iso3c, ~year, cols = 9:12),
               collapse_ordered_logdiff = G(wlddev, 1, 1, ~iso3c, ~year, cols = 9:12, logdiff = TRUE))
# Unit: microseconds
#                      expr      min        lq      mean   median       uq      max neval cld
#   collapse_ordered_growth  761.299  778.4795  837.4469  828.459  880.001 1173.186   100  a 
#  collapse_ordered_logdiff 2897.041 2917.7920 3103.7793 3009.050 3302.681 3686.454   100   b

The results are similar to the grouped transformations: collapse dramatically facilitates and speeds up these complex operations in R. Again plm classes are very useful to avoid having to specify panel-identifiers all the time. See the ‘collapse and plm’ vignette for more details.

6. List Processing and a Panel-VAR Example

collapse also provides an ensemble of list-processing functions that grew out of a necessity of working with complex nested lists of data objects. The example provided in this section is also somewhat complex, but it demonstrates the utility of these functions while also providing a nice data-transformation task. When summarizing the GGDC10S data in section 1, it became clear that certain sectors have a high share of economic activity in almost all countries in the sample. The application I devised for this section is to see if there are common patterns in the interaction of these important sectors across countries. The approach for this will be an attempt of running a (Structural) Panel-Vector-Autoregression (SVAR) in value added with the 6 most important sectors (excluding government): Agriculture, manufacturing, wholesale and retail trade, construction, transport and storage and finance and real estate.

For this I will use the vars package10. Since vars natively does not support panel-VAR, we need to create the central varest object manually and then run the SVAR function to impose identification restrictions. We start off exploring and harmonizing the data:

library(vars)
# The 6 most important non-government sectors (see section 1)
sec <- c("AGR","MAN","WRT","CON","TRA","FIRE")
# This creates a data.table containing the value added of the 6 most important non-government sectors 
data <- qDT(GGDC10S)[Variable == "VA"] %>% get_vars(c("Country","Year", sec)) %>% na.omit
# Let's look at the log VA in agriculture across countries:
AGRmat <- log(psmat(data, AGR ~ Country, ~ Year, transpose = TRUE))   # Converting to panel-series matrix
plot(AGRmat)

The plot shows quite some heterogeneity both in the levels (VA is in local currency) and in trend growth rates. In the panel-VAR estimation we are only really interested in the sectoral relationships within countries. Thus we need to harmonize this sectoral data further. One way would be taking growth rates or log-differences of the data, but VAR’s are usually estimated in levels unless the data are cointegrated (and value added series do not, in general, exhibit unit-root behavior). Thus to harmonize the data further I opt for subtracting a country-sector specific cubic trend from the data in logs:

# Subtracting a country specific cubic growth trend
AGRmat <- dapply(AGRmat, fHDwithin, poly(seq_row(AGRmat), 3), fill = TRUE)

plot(AGRmat)

This seems to have done a decent job in curbing some of that heterogeneity. Some series however have a high variance around that cubic trend. Therefore as a final step I standardize the data to bring the variances in line:

# Standadizing the cubic log-detrended data
AGRmat <- fscale(AGRmat)
plot(AGRmat)

Now this looks pretty good, and is about the most we can do in terms of harmonization without differencing the data. Below I apply these transformations to all sectors:

# Taking logs
get_vars(data, 3:8) <- dapply(get_vars(data, 3:8), log)
# Iteratively projecting out country FE and cubic trends from complete cases (still very slow)
get_vars(data, 3:8) <- HDW(data, ~ qF(Country)*poly(Year, 3), fill = TRUE)
# Scaling 
get_vars(data, 3:8) <- STD(data, ~ Country, cols = 3:8, keep.by = FALSE)

# Check the plot
plot(psmat(data, ~Country, ~Year))

Since the data is annual, let us estimate the Panel-VAR with one lag:

# This adds one lag of all series to the data 
add_vars(data) <- L(data, 1, ~ Country, ~ Year, keep.ids = FALSE) 
# This removes missing values from all but the first row and drops identifier columns (vars is made for time-series without gaps)
data <- rbind(data[1, -(1:2)], na.omit(data[-1, -(1:2)])) 
head(data)
#    STD.HDW.AGR STD.HDW.MAN STD.HDW.WRT STD.HDW.CON STD.HDW.TRA STD.HDW.FIRE L1.STD.HDW.AGR
# 1:  0.65713943   2.2350583    1.946383 -0.03574399   1.0877811    1.0476507             NA
# 2: -0.14377115   1.8693570    1.905081  1.23225734   1.0542315    0.9105622     0.65713943
# 3: -0.09209879  -0.8212004    1.997253 -0.01783824   0.6718465    0.6134260    -0.14377115
# 4: -0.25213869  -1.7830320   -1.970855 -2.68332505  -1.8475551    0.4382902    -0.09209879
# 5: -0.31623401  -4.2931567   -1.822211 -2.75551916  -0.7066491   -2.1982640    -0.25213869
# 6: -0.72691916  -1.3219387   -2.079333 -0.12148295  -1.1398220   -2.2230474    -0.31623401
#    L1.STD.HDW.MAN L1.STD.HDW.WRT L1.STD.HDW.CON L1.STD.HDW.TRA L1.STD.HDW.FIRE
# 1:             NA             NA             NA             NA              NA
# 2:      2.2350583       1.946383    -0.03574399      1.0877811       1.0476507
# 3:      1.8693570       1.905081     1.23225734      1.0542315       0.9105622
# 4:     -0.8212004       1.997253    -0.01783824      0.6718465       0.6134260
# 5:     -1.7830320      -1.970855    -2.68332505     -1.8475551       0.4382902
# 6:     -4.2931567      -1.822211    -2.75551916     -0.7066491      -2.1982640

Having prepared the data, the code below estimates the panel-VAR using lm and creates the varest object:

# saving the names of the 6 sectors
nam <- names(data)[1:6]

pVAR <- list(varresult = setNames(lapply(seq_len(6), function(i)    # list of 6 lm's each regressing
               lm(as.formula(paste0(nam[i], "~ -1 + . ")),          # the sector on all lags of 
               get_vars(data, c(i, 7:length(data)))[-1])), nam),    # itself and other sectors, removing the missing first row
             datamat = data[-1],                                    # The full data containing levels and lags of the sectors, removing the missing first row
             y = do.call(cbind, get_vars(data, 1:6)),               # Only the levels data as matrix
             type = "none",                                         # No constant or tend term: We harmonized the data already
             p = 1,                                                 # The lag-order
             K = 6,                                                 # The number of variables
             obs = nrow(data)-1,                                    # The number of non-missing obs
             totobs = nrow(data),                                   # The total number of obs
             restrictions = NULL, 
             call = quote(VAR(y = data)))

class(pVAR) <- "varest"

The significant serial-correlation test below suggests that the panel-VAR with one lag is ill-identified, but the sample size is also quite large so the test is prone to reject, and the test is likely also still picking up remaining cross-sectional heterogeneity. For the purposes of this vignette this shall not bother us.

serial.test(pVAR)
# 
#   Portmanteau Test (asymptotic)
# 
# data:  Residuals of VAR object pVAR
# Chi-squared = 1678.9, df = 540, p-value < 2.2e-16

By default the VAR is identified using a Choleski ordering of the direct impact matrix in which the first variable (here Agriculture) is assumed to not be directly impacted by any other sector in the current period, and this descends down to the last variable (Finance and Real Estate), which is assumed to be impacted by all other sectors in the current period. For structural identification it usually necessary to impose restrictions on the direct impact matrix in line with economic theory. I do not have any theories on the average worldwide interaction of broad economic sectors, but to aid identification I will compute the correlation matrix in growth rates and restrict the lowest coefficients to be 0, which should be better than just imposing a random Choleski ordering. This will also enable me to give a demonstration of the grouped tibble methods for collapse functions, discussed in more detail in the ‘collapse and dplyr’ vignette:

# This computes the pairwise correlations between standardized sectoral growth rates across countries
corr <- filter(GGDC10S, Variable == "VA") %>%   # Subset rows: Only VA
           group_by(Country) %>%                # Group by country
                get_vars(sec) %>%               # Select the 6 sectors
                   fgrowth %>%                  # Compute Sectoral growth rates (a time-variable can be passsed, but not necessary here as the data is ordered)
                      fscale %>%                # Scale and center (i.e. standardize)
                         pwcor                  # Compute Pairwise correlations

corr
#         G1.AGR G1.MAN G1.WRT G1.CON G1.TRA G1.FIRE
# G1.AGR      1     .55    .59    .39    .52     .41
# G1.MAN     .55     1     .67    .54    .65     .48
# G1.WRT     .59    .67     1     .56    .66     .52
# G1.CON     .39    .54    .56     1     .53     .46
# G1.TRA     .52    .65    .66    .53     1      .51
# G1.FIRE    .41    .48    .52    .46    .51      1

# We need to impose K*(K-1)/2 = 15 (with K = 6 variables) restrictions for identification
corr[corr <= sort(corr)[15]] <- 0
corr
#         G1.AGR G1.MAN G1.WRT G1.CON G1.TRA G1.FIRE
# G1.AGR      1     .55    .59    .00    .00     .00
# G1.MAN     .55     1     .67    .54    .65     .00
# G1.WRT     .59    .67     1     .56    .66     .00
# G1.CON     .00    .54    .56     1     .00     .00
# G1.TRA     .00    .65    .66    .00     1      .00
# G1.FIRE    .00    .00    .00    .00    .00      1

# The rest is unknown (i.e. will be estimated)
corr[corr > 0 & corr < 1] <- NA

# This estimates the Panel-SVAR using Maximum Likelihood:
pSVAR <- SVAR(pVAR, Amat = unclass(corr), estmethod = "direct")
pSVAR
# 
# SVAR Estimation Results:
# ======================== 
# 
# 
# Estimated A matrix:
#              STD.HDW.AGR STD.HDW.MAN STD.HDW.WRT STD.HDW.CON STD.HDW.TRA STD.HDW.FIRE
# STD.HDW.AGR      1.00000    -0.58705     -0.2490      0.0000     0.00000            0
# STD.HDW.MAN      0.45708     1.00000      0.2374      0.1524    -1.23083            0
# STD.HDW.WRT      0.09161    -1.31439      1.0000      2.2581    -0.08235            0
# STD.HDW.CON      0.00000     0.01723     -1.3247      1.0000     0.00000            0
# STD.HDW.TRA      0.00000     0.90374     -0.3327      0.0000     1.00000            0
# STD.HDW.FIRE     0.00000     0.00000      0.0000      0.0000     0.00000            1

Now this object is quite involved, which brings us to the actual subject of this section:

# psVAR$var$varresult is a list containing the 6 linear models fitted above, it is not displayed in full here.
str(pSVAR, give.attr = FALSE, max.level = 3)
# List of 13
#  $ A      : num [1:6, 1:6] 1 0.4571 0.0916 0 0 ...
#  $ Ase    : num [1:6, 1:6] 0 0 0 0 0 0 0 0 0 0 ...
#  $ B      : num [1:6, 1:6] 1 0 0 0 0 0 0 1 0 0 ...
#  $ Bse    : num [1:6, 1:6] 0 0 0 0 0 0 0 0 0 0 ...
#  $ LRIM   : NULL
#  $ Sigma.U: num [1:6, 1:6] 97.705 12.717 13.025 0.984 26.992 ...
#  $ LR     :List of 5
#   ..$ statistic: Named num 6218
#   ..$ parameter: Named num 7
#   ..$ p.value  : Named num 0
#   ..$ method   : chr "LR overidentification"
#   ..$ data.name: symbol data
#  $ opt    :List of 5
#   ..$ par        : num [1:14] 0.4571 0.0916 -0.587 -1.3144 0.0172 ...
#   ..$ value      : num 11538
#   ..$ counts     : Named int [1:2] 501 NA
#   ..$ convergence: int 1
#   ..$ message    : NULL
#  $ start  : num [1:14] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
#  $ type   : chr "A-model"
#  $ var    :List of 10
#   ..$ varresult   :List of 6
#   .. ..$ STD.HDW.AGR :List of 12
#   .. ..$ STD.HDW.MAN :List of 12
#   .. ..$ STD.HDW.WRT :List of 12
#   .. ..$ STD.HDW.CON :List of 12
#   .. ..$ STD.HDW.TRA :List of 12
#   .. ..$ STD.HDW.FIRE:List of 12
#   ..$ datamat     :Classes 'data.table' and 'data.frame': 2060 obs. of  12 variables:
#   .. ..$ STD.HDW.AGR    : num [1:2060] -0.1438 -0.0921 -0.2521 -0.3162 -0.7269 ...
#   .. ..$ STD.HDW.MAN    : num [1:2060] 1.869 -0.821 -1.783 -4.293 -1.322 ...
#   .. ..$ STD.HDW.WRT    : num [1:2060] 1.91 2 -1.97 -1.82 -2.08 ...
#   .. ..$ STD.HDW.CON    : num [1:2060] 1.2323 -0.0178 -2.6833 -2.7555 -0.1215 ...
#   .. ..$ STD.HDW.TRA    : num [1:2060] 1.054 0.672 -1.848 -0.707 -1.14 ...
#   .. ..$ STD.HDW.FIRE   : num [1:2060] 0.911 0.613 0.438 -2.198 -2.223 ...
#   .. ..$ L1.STD.HDW.AGR : num [1:2060] 0.6571 -0.1438 -0.0921 -0.2521 -0.3162 ...
#   .. ..$ L1.STD.HDW.MAN : num [1:2060] 2.235 1.869 -0.821 -1.783 -4.293 ...
#   .. ..$ L1.STD.HDW.WRT : num [1:2060] 1.95 1.91 2 -1.97 -1.82 ...
#   .. ..$ L1.STD.HDW.CON : num [1:2060] -0.0357 1.2323 -0.0178 -2.6833 -2.7555 ...
#   .. ..$ L1.STD.HDW.TRA : num [1:2060] 1.088 1.054 0.672 -1.848 -0.707 ...
#   .. ..$ L1.STD.HDW.FIRE: num [1:2060] 1.048 0.911 0.613 0.438 -2.198 ...
#   ..$ y           : num [1:2061, 1:6] 0.6571 -0.1438 -0.0921 -0.2521 -0.3162 ...
#   ..$ type        : chr "none"
#   ..$ p           : num 1
#   ..$ K           : num 6
#   ..$ obs         : num 2060
#   ..$ totobs      : int 2061
#   ..$ restrictions: NULL
#   ..$ call        : language VAR(y = data)
#  $ iter   : Named int 501
#  $ call   : language SVAR(x = pVAR, estmethod = "direct", Amat = unclass(corr))

6.1 List Search and Identification

When dealing with such a list-like object, we might be interested in its complexity by measuring the level of nesting. This can be done with ldepth:

Further we might be interested in knowing whether this list-object contains non-atomic elements like call, terms or formulas. The function is.regular in the collapse package checks if an object is atomic or list-like, and the recursive version is.unlistable checks whether all objects in a nested structure are atomic or list-like:

Evidently this object is not unlistable, from viewing its structure we know that it contains several call and terms objects. We might also want to know if this object saves some kind of residuals or fitted values. This can be done using has_elem, which also supports regular expression search of element names:

We might also want to know whether the object contains some kind of data-matrix. This can be checked by calling:

These functions can sometimes be helpful in exploring object, although for all practical purposes the viewer in Rstudio is very informative. A much greater advantage of having functions to search and check lists is the ability to write more complex programs with them (which I will not demonstrate here).

6.2 List Subsetting

Having gathered some information about the pSVAR object in the previous section, this section introduces several extractor functions to pull out elements from such lists: get_elem can be used to pull out elements from lists in a simplified format11.

Similarly, we could pull out and plot the fitted values:

Below I compute the main quantities of interest in SVAR analysis: The impulse response functions (IRF’s) and forecast error variance decompositions (FEVD’s):

The pIRF object contains the IRF’s with lower and upper confidence bounds and some atomic elements providing information about the object:

# See the structure of a vars IRF object: 
str(pIRF, give.attr = FALSE)
# List of 11
#  $ irf       :List of 6
#   ..$ STD.HDW.AGR : num [1:11, 1:6] 0.87 0.531 0.33 0.21 0.138 ...
#   ..$ STD.HDW.MAN : num [1:11, 1:6] 0.274 0.1892 0.1385 0.1059 0.0833 ...
#   ..$ STD.HDW.WRT : num [1:11, 1:6] 0.0526 0.0514 0.0463 0.0399 0.0335 ...
#   ..$ STD.HDW.CON : num [1:11, 1:6] -0.1605 -0.1051 -0.0688 -0.0451 -0.0297 ...
#   ..$ STD.HDW.TRA : num [1:11, 1:6] 0.342 0.258 0.199 0.155 0.123 ...
#   ..$ STD.HDW.FIRE: num [1:11, 1:6] 0 0.0217 0.0263 0.0239 0.0191 ...
#  $ Lower     :List of 6
#   ..$ STD.HDW.AGR : num [1:11, 1:6] 0.429 0.298 0.208 0.136 0.088 ...
#   ..$ STD.HDW.MAN : num [1:11, 1:6] -0.4794 -0.289 -0.1769 -0.1105 -0.0665 ...
#   ..$ STD.HDW.WRT : num [1:11, 1:6] -0.489 -0.317 -0.23 -0.159 -0.123 ...
#   ..$ STD.HDW.CON : num [1:11, 1:6] -0.417 -0.272 -0.193 -0.141 -0.101 ...
#   ..$ STD.HDW.TRA : num [1:11, 1:6] -0.3445 -0.1904 -0.12 -0.0926 -0.0715 ...
#   ..$ STD.HDW.FIRE: num [1:11, 1:6] 0 -0.015 -0.0245 -0.0294 -0.0304 ...
#  $ Upper     :List of 6
#   ..$ STD.HDW.AGR : num [1:11, 1:6] 1.084 0.69 0.467 0.322 0.234 ...
#   ..$ STD.HDW.MAN : num [1:11, 1:6] 0.568 0.377 0.278 0.206 0.16 ...
#   ..$ STD.HDW.WRT : num [1:11, 1:6] 0.3814 0.2363 0.1618 0.1193 0.0944 ...
#   ..$ STD.HDW.CON : num [1:11, 1:6] 0.273 0.229 0.17 0.153 0.123 ...
#   ..$ STD.HDW.TRA : num [1:11, 1:6] 0.349 0.26 0.203 0.159 0.127 ...
#   ..$ STD.HDW.FIRE: num [1:11, 1:6] 0 0.0564 0.0734 0.0719 0.063 ...
#  $ response  : chr [1:6] "STD.HDW.AGR" "STD.HDW.MAN" "STD.HDW.WRT" "STD.HDW.CON" ...
#  $ impulse   : chr [1:6] "STD.HDW.AGR" "STD.HDW.MAN" "STD.HDW.WRT" "STD.HDW.CON" ...
#  $ ortho     : logi TRUE
#  $ cumulative: logi FALSE
#  $ runs      : num 100
#  $ ci        : num 0.05
#  $ boot      : logi TRUE
#  $ model     : chr "svarest"

We could separately access the top-level atomic or list elements using atomic_elem or list_elem:

There are also recursive versions of atomic_elem and list_elem named reg_elem and irreg_elem which can be used to split nested lists into the atomic and non-atomic parts. These are not covered in this vignette.

6.3 Data Apply and Unlisting in 2D

vars supplies plot methods for IRF and FEVD objects using base graphics, for example:

plot(pIRF) would give us 6 charts of all sectoral responses to each sectoral shock. In this section I however want to generate nicer plots using ggplot2 and also compute some statistics on the IRF data. Starting with the latter, the code below sums the 10-period impulse response coefficients of each sector in response to each sectoral impulse and stores them in a data.frame:

The function rapply2d used here is very similar to base::rapply, with the difference that the result is not simplified / unlisted by default and that rapply2d will treat data.frame’s like atomic objects and apply functions to them. unlist2d is an efficient generalization of base::unlist to 2-dimensions, or one could also think of it as a recursive generalization of do.call(rbind, ...). It efficiently unlists nested lists of data objects and creates a data.frame with identifier columns for each level of nesting on the left, and the content of the list in columns on the right.

The above cumulative coefficients suggest that Agriculture responds mostly to it’s own shock, and a bit to shocks in Transport and Storage, Wholesale and Retail Trade and Manufacturing. The Finance and Real Estate sector seems even more independent and really only responds to it’s own dynamics. Manufacturing and Transport and Storage seem to be pretty interlinked with the other broad sectors. Wholesale and Retail Trade and Construction exhibit some strange dynamics (i.e. WRT responds more to the CON shock that to it’s own shock, and CON responds strongly negatively to the WRT shock).

Let us use ggplot2 to create nice compact plots of the IRF’s and FEVD’s. For this task unlist2d will again be extremely helpful in creating the data.frame representation required. Starting with the IRF’s, we will discard the upper and lower bounds and just use the impulses converted to a data.frame:

# This binds the matrices after adding integer row-names to them to a data.table

data <- pIRF$irf %>%                      # Get only the coefficient matrices, discard the confidence bounds
         lapply(setRownames) %>%          # Add integer rownames: setRownames(object, nm = seq_row(object))
           unlist2d(idcols = "Impulse",   # Recursive unlisting to data.table creating a factor id-column
                    row.names = "Time",   # and saving the generated rownames in a variable called 'Time'
                    id.factor = TRUE,     # -> Create Id column ('Impulse') as factor
                    DT = TRUE)            # -> Output as data.table (default is data.frame)

head(data)
#        Impulse Time STD.HDW.AGR  STD.HDW.MAN STD.HDW.WRT STD.HDW.CON STD.HDW.TRA STD.HDW.FIRE
# 1: STD.HDW.AGR    1  0.86996584 -0.187344923 -0.08054962 -0.10347400  0.14250867  0.000000000
# 2: STD.HDW.AGR    2  0.53115414 -0.004310463  0.02878308 -0.05730488  0.11385039 -0.027348364
# 3: STD.HDW.AGR    3  0.33034056  0.068887682  0.07763893 -0.02047398  0.09576924 -0.018846740
# 4: STD.HDW.AGR    4  0.21048997  0.088854762  0.09274910  0.00486506  0.08235669 -0.002035014
# 5: STD.HDW.AGR    5  0.13808095  0.085218106  0.09045699  0.02015986  0.07106835  0.012522131
# 6: STD.HDW.AGR    6  0.09352334  0.072843227  0.08031275  0.02789669  0.06095866  0.021967877

# Coercing Time to numeric (from character)
data$Time <- as.numeric(data$Time)

# Using data.table's melt
data <- melt(data, 1:2)
head(data)
#        Impulse Time    variable      value
# 1: STD.HDW.AGR    1 STD.HDW.AGR 0.86996584
# 2: STD.HDW.AGR    2 STD.HDW.AGR 0.53115414
# 3: STD.HDW.AGR    3 STD.HDW.AGR 0.33034056
# 4: STD.HDW.AGR    4 STD.HDW.AGR 0.21048997
# 5: STD.HDW.AGR    5 STD.HDW.AGR 0.13808095
# 6: STD.HDW.AGR    6 STD.HDW.AGR 0.09352334

# Here comes the plot:
  ggplot(data, aes(x = Time, y = value, color = Impulse)) + 
    geom_line(size = I(1)) + geom_hline(yintercept = 0) + 
    labs(y = NULL, title = "Orthogonal Impulse Response Functions") +
    scale_color_manual(values = rainbow(6)) + 
    facet_wrap(~ variable) +
    theme_light(base_size = 14) + 
    scale_x_continuous(breaks = scales::pretty_breaks(n=7), expand = c(0, 0))+
    scale_y_continuous(breaks = scales::pretty_breaks(n=7), expand = c(0, 0))+
    theme(axis.text = element_text(colour = "black"),
      plot.title = element_text(hjust = 0.5),
      strip.background = element_rect(fill = "white", colour = NA),
      strip.text = element_text(face = "bold", colour = "grey30"),
      axis.ticks = element_line(colour = "black"),
      panel.border = element_rect(colour = "black"))

To round things off, below I do the same thing for the FEVD’s:

Both the IRF’s and the FEVD’s show some strange behavior for Manufacturing, Wholesale and Retail Trade and Construction. There are also not much dynamics in the FEVD, suggesting that longer lag-lengths might be appropriate. The most important point of critique for this analysis is the structural identification strategy which is highly dubious (as correlation does not imply causation and I am also restricting sectoral relationships with a lower correlation to be 0 in the current period). A better method could be to aggregate the World Input-Output Database and use those shares for identification (which would be another very nice collapse exercise, but not for this vignette).

Going Further

To learn more about collapse, I recommend just examining the documentation help("collapse-documentation") which is hierarchically organized, extensive and contains lots of examples.

References

Timmer, M. P., de Vries, G. J., & de Vries, K. (2015). “Patterns of Structural Change in Developing Countries.” . In J. Weiss, & M. Tribe (Eds.), Routledge Handbook of Industry and Development. (pp. 65-83). Routledge.

Mundlak, Yair. 1978. “On the Pooling of Time Series and Cross Section Data.” Econometrica 46 (1): 69–85.


  1. in the Within data, the overall mean was added back after subtracting out country means, to preserve the level of the data, see also section 4.5.

  2. qsu uses a numerically stable online algorithm generalized from Welford’s Algorithm to compute variances.

  3. Because missing values are stored as the smallest integer in C++, and the values of the factor are used directly to index result vectors in grouped computations. Subsetting a vector with the smallest integer would break the C++ code of the Fast Statistical Functions and terminate the R session, which must be avoided.

  4. You may wonder why with weights the standard-deviations in the group ‘4.0.1’ are 0 while they were NA without weights. This stirrs from the fact that group ‘4.0.1’ only has one observation, and in the bessel-corrected estimate of the variance there is a n - 1 in the denominator which becomes 0 if n = 1 and division by 0 becomes NA in this case (fvar was designed that way to match the behavior or stats::var). In the weighted version the denominator is sum(w) - 1, and if sum(w) is not 1, then the denominator is not 0. The standard-deviation however is still 0 because the sum of squares in the numerator is 0. In other words this means that in a weighted aggregation singleton-groups are not treated like singleton groups unless the corresponding weight is 1.

  5. One can also add a weight-argument w = weights here, but fmin and fmax don’t support weights and all S3 methods in this package give errors when encountering unknown arguments. To do a weighted aggregation one would have to either only use fmean and fsd, or employ a named list of functions wrapping fmin and fmax in a way that additional arguments are silently swallowed.

  6. I.e. the most frequent value, if all values inside a group are either all equal or all distinct, fmode returns the first value instead

  7. If the list is unnamed, collap uses all.vars(substitute(list(FUN1, FUN2, ...))) to get the function names. Alternatively it is also possible to pass a character vector of function names)

  8. BY.grouped_df is probably only useful together with the expand.wide = TRUE argument which dplyr does not have, because otherwise dplyr’s summarize and mutate are substantially faster on larger data.

  9. Included as example data in collapse and summarized in section 1

  10. I noticed there is a panelvar package, but I am more familiar with vars and panelvar can be pretty slow in my experience. We also have about 50 years of data here, so dynamic panel-bias is not a big issue.

  11. The vars package also provides convenient extractor functions for some quantities, but get_elem of course works in a much broader range of contexts.