collapse and dplyr

Fast (weighted) Aggregations, Transformations and Time-Series/Panel Computations in a dplyr Workflow

Sebastian Krantz

2020-03-12

collapse is a C/C++ based package for data manipulation in R. It’s aims are

  1. to facilitate complex data transformation and exploration tasks and

  2. to help make R code fast, flexible, parsimonious and programmer friendly.

This vignette focuses on the integration of collapse and the popular dplyr package by Hadley Wickham. In particular it will demonstrate how using collapse’s fast functions can facilitate and speed up grouped and weighted aggregations and transformations, as well as panel-data computations (i.e. between- and within-transformations, panel-lags, differences and growth rates) in a dplyr workflow.


Note: This vignette is targeted at dplyr users. collapse is a standalone package and delivers even faster performance using it’s own grouping mechanism (based on data.table internals) and it’s own set of functions to efficiently select and replace variables. The ‘Introduction to collapse’ vignette provides a thorough introduction to the package and a built-in structured documentation is available under help("collapse-documentation") after installing the package. In addition help("collapse-package") provides a compact set of examples for quick-start.


1. Fast Aggregations

A key feature of collapse is it’s broad set of Fast Statistical Functions (fsum, fprod, fmean, fmedian, fmode, fvar, fsd, fmin, fmax, ffirst, flast, fNobs, fNdistinct) which are able to dramatically speed-up column-wise, grouped and weighted computations on vectors, matrices or data.frame’s. The functions are S3 generic, with a default (vector), matrix and data.frame method, as well as a grouped_df method for grouped tibbles used by dplyr. The grouped tibble method has the following arguments:

FUN.grouped_df(x, [w = NULL,] TRA = NULL, [na.rm = TRUE,]
               use.g.names = FALSE, keep.group_vars = TRUE, [keep.w = TRUE,] ...)

where w is a weight variable (available only to fmean, fmode, fvar and fsd), and TRA and can be used to transform x using the computed statistics and one of 8 available transformations ("replace_fill", "replace", "-", "-+", "/", "%", "+", "*"). These transformations perform grouped replacing or sweeping out of the statistics computed by the function (discussed in section 2). na.rm efficiently removes missing values and is TRUE by default. use.g.names generates new row-names from the unique combinations of groups (default: disabled), whereas keep.group_vars (default: enabled) will keep the grouping columns as is custom in the native data %>% group_by(...) %>% summarize(...) workflow in dplyr. Finally, keep.w regulates whether a weighting variable used is also aggregated and saved in a column. For fmean, fvar and fsd this will compute the sum of the weights in each group, whereas fmode will return the maximum weight (corresponding to the mode) in each group.

With that in mind, let’s consider some straightforward applications:

1.1 Simple Aggregations

Consider the Groningen Growth and Development Center 10-Sector Database included in collapse:

Simple column-wise computations using the fast functions and pipe operators are performed as follows:

Moving on to grouped statistics, we can compute the average value added and employment by sector and country using:

Similarly we can obtain the median or the standard deviation:

GGDC10S %>% 
  group_by(Variable,Country) %>%
  select_at(6:16) %>% fmedian
# # A tibble: 85 x 13
#    Variable Country     AGR     MIN     MAN     PU    CON    WRT    TRA   FIRE     GOV    OTH    SUM
#    <chr>    <chr>     <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
#  1 EMP      ARG       1325.   47.4   1988.  1.05e2 7.82e2 1.85e3 5.80e2  464.   1739.   866.  9.74e3
#  2 EMP      BOL        943.   53.5    167.  4.46e0 6.60e1 1.32e2 9.70e1   15.3    NA    384.  1.84e3
#  3 EMP      BRA      17481.  225.    7208.  3.76e2 4.05e3 6.45e3 1.58e3 4355.   4450.  4479.  5.19e4
#  4 EMP      BWA        175.   12.2     13.1 3.71e0 1.90e1 2.11e1 6.75e0   10.4    53.8   31.2 3.61e2
#  5 EMP      CHL        690.   93.9    607.  2.58e1 2.30e2 4.84e2 2.05e2  106.     NA    900.  3.31e3
#  6 EMP      CHN     293915  8150.   61761.  1.14e3 1.06e4 1.70e4 9.56e3 4328.  19468.  9954.  4.45e5
#  7 EMP      COL       3006.   84.0   1033.  3.71e1 4.19e2 1.55e3 3.91e2  655.     NA   1430.  8.63e3
#  8 EMP      CRI        216.    1.49   114.  7.92e0 5.50e1 8.98e1 2.55e1   19.6   122.    60.6 7.19e2
#  9 EMP      DEW       2178   320.    8459.  2.47e2 2.10e3 4.45e3 1.53e3 1656    3700    900   2.65e4
# 10 EMP      DNK        187.    3.75   508.  1.36e1 1.65e2 4.61e2 1.61e2  169.    642.   104.  2.42e3
# # ... with 75 more rows

GGDC10S %>% 
  group_by(Variable,Country) %>%
  select_at(6:16) %>% fsd
# # A tibble: 85 x 13
#    Variable Country     AGR      MIN    MAN     PU    CON    WRT    TRA   FIRE     GOV    OTH    SUM
#    <chr>    <chr>     <dbl>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
#  1 EMP      ARG       242.    19.1   1.73e2 2.23e1 2.78e2 8.33e2 1.81e2  453.   1073.  4.89e2 3.05e3
#  2 EMP      BOL        92.6   18.2   1.50e2 4.48e0 1.08e2 2.88e2 8.96e1   58.7    NA   2.49e2 9.47e2
#  3 EMP      BRA      1975.    83.1   3.28e3 1.14e2 2.04e3 6.35e3 1.28e3 3144.   3787.  4.33e3 2.52e4
#  4 EMP      BWA        31.3    4.70  1.52e1 1.91e0 1.90e1 3.69e1 6.09e0   13.4    42.2 1.14e1 1.67e2
#  5 EMP      CHL        68.6   32.3   1.38e2 1.12e1 1.81e2 5.09e2 1.30e2  286.     NA   4.18e2 1.61e3
#  6 EMP      CHN     64477.  3450.    4.23e4 1.27e3 1.90e4 2.41e4 9.40e3 2910.  11973.  3.54e4 1.95e5
#  7 EMP      COL       710.   127.    6.00e2 1.68e1 3.76e2 1.82e3 3.30e2  409.     NA   9.96e2 5.27e3
#  8 EMP      CRI        48.7    0.876 8.78e1 1.30e1 3.54e1 1.53e2 3.93e1   68.8    82.4 4.29e1 5.46e2
#  9 EMP      DEW      1220.   188.    7.93e2 5.39e1 2.07e2 6.34e2 2.21e2  704.   1241.  3.55e2 2.23e3
# 10 EMP      DNK       144.     7.96  7.67e1 1.96e0 2.63e1 5.43e1 1.61e1   91.1   255.  1.91e1 2.23e2
# # ... with 75 more rows

It is important to not use dplyr’s summarize together with the fast functions since that would totally eliminate their speed gain. These functions are fast because they are executed only once and carry out the grouped computations in C++, whereas summarize will split and then apply the function to each group in the grouped tibble. - It will also work with the fast functions, but is slower than using primitive base functions since the fast functions are S3 generic -.


Excursus: What is Happening Behind the Scenes?

To drive this point home it is perhaps good to shed some light on what is happening behind the scenes of dplyr and collapse. Fundamentally both packages follow different computing paradigms:

dplyr is an efficient implementation of the Split-Apply-Combine computing paradigm. Data is split into groups, these data-chunks are then passed to a function carrying out the computation, and finally recombined to produce the aggregated data.frame. This modus operandi is evident in the grouping mechanism of dplyr. When a data.frame is passed through group_by, a ‘groups’ attribute is attached:

This object is a data.frame giving the unique groups and in the third (last) column vectors containing the indices of the rows belonging to that group. A command like summarize uses this information to split the data.frame into groups which are then passed sequentially to the function used and later recombined.

Now collapse is based around one-pass grouped computations at the C++ level, in other words the data is not split and recombined but the entire computation is performed in a single C++ loop running through that data and completing the computations for each group simultaneously. This modus operandi is also evident in collapse grouping objects. The method GRP.grouped_df takes a dplyr grouping object from a grouped tibble and efficiently converts it to a collapse grouping object:

This object is a list where the first three elements give the number of groups, the group-id to which each row belongs and a vector of group-sizes. A function like fsum uses this information to (for each column) create a result vector of size ‘N.groups’ and the run through the column using the ‘group.id’ vector to add the i’th data point to the ’group.id[i]’th element of the result vector. When the loop is finished, the grouped computation is also finished.

It is clear that collapse is prone to be faster than dplyr since it’s method of computing involves less (and less computationally intensive) steps. A slight qualifier added to this is the additional conversion cost incurred by GRP.grouped_df when using the fast functions on grouped tibbles. This cost is however quite low as the benchmarks at the end show (since GRP.grouped_df is also implemented in C++).


1.2 Multi-Function Aggregations

One can also aggregate with multiple functions at the same time, but then the programming becomes a bit verbose. In particular, one needs to use curly braces { to prevent first argument injection so that %>% cbind(FUN1(.), FUN2(.)) does not evaluate as %>% cbind(., FUN1(.), FUN2(.)):

The function add_stub used above is a collapse function adding a prefix (default) or suffix to variables names.

A slightly more elegant solution to such multi-function aggregations can be found using get_vars, a collapse predicate to efficiently select variables. In contrast to select_at, get_vars does not automatically add the grouping columns to the selection. Next to get_vars, collapse also introduces the predicates num_vars, cat_vars, char_vars, fact_vars, logi_vars and Date_vars to select columns by type. Finally, the predicate add_vars provides a more efficient alternative to cbind.data.frame. The idea here is ‘adding’ variables to the data.frame in the first argument i.e. the attributes of the first argument are preserved, so the expression below still gives a tibble instead of a data.frame:

Another nice feature of add_vars is that it can also very efficiently reorder columns i.e. bind columns in a different order than they are passed. This can be done by simply specifying the positions the added columns should have in the final data.frame, and then add_vars shifts the first argument columns (the group_keys in the example below) to the right to fill in the gaps.

A much more compact solution to multi-function and multi-type aggregation with dplyr is offered by the function collapg:

By default it aggregates numeric columns using the mean and categorical columns using the mode, and preserves the order of all columns. Changing these defaults is very easy:

One can apply multiple functions to both numeric and/or categorical data:

Applying multiple functions to only numeric (or only categorical) data allows return in a long format:

Finally, collapg also makes it very easy to apply aggregator functions to certain columns only:

To understand more about collapg, look it up in the documentation (?collapg).

1.3 Weighted Aggregations

Weighted aggregations are possible with the functions fmean, fmode, fvar and fsd. The implementation is such that by default (option keep.w = TRUE) these functions also aggregate the weights, so that further weighted computations can be performed on the aggregated data. fmean, fsd and fvar compute a grouped sum of the weight column and place it next to the group-identifiers, whereas fmode computes the maximum weight (corresponding to the mode).

# This compute a frequency-weighted grouped standard-deviation, taking the total EMP / VA as weight
GGDC10S %>% 
  group_by(Variable,Country) %>%
  select_at(6:16) %>% fsd(SUM)
# # A tibble: 85 x 13
#    Variable Country  sum.SUM     AGR    MIN    MAN     PU    CON    WRT    TRA   FIRE     GOV    OTH
#    <chr>    <chr>      <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>
#  1 EMP      ARG       6.54e5   225.  2.22e1 1.76e2 2.05e1 2.85e2 8.56e2 1.95e2  493.   1123.  5.06e2
#  2 EMP      BOL       1.35e5    99.7 1.71e1 1.68e2 4.87e0 1.23e2 3.24e2 9.81e1   69.8    NA   2.58e2
#  3 EMP      BRA       3.36e6  1587.  7.38e1 2.95e3 9.38e1 1.86e3 6.28e3 1.31e3 3003.   3621.  4.26e3
#  4 EMP      BWA       1.85e4    32.2 3.72e0 1.48e1 1.59e0 1.80e1 3.87e1 6.02e0   13.5    39.8 8.94e0
#  5 EMP      CHL       2.51e5    71.0 3.99e1 1.29e2 1.24e1 1.88e2 5.51e2 1.34e2  313.     NA   4.26e2
#  6 EMP      CHN       2.91e7 56281.  3.09e3 4.04e4 1.27e3 1.92e4 2.45e4 9.26e3 2853.  11541.  3.74e4
#  7 EMP      COL       6.03e5   637.  1.48e2 5.94e2 1.52e1 3.97e2 1.89e3 3.62e2  435.     NA   1.01e3
#  8 EMP      CRI       5.50e4    40.4 1.04e0 7.93e1 1.37e1 3.44e1 1.68e2 4.53e1   79.8    80.7 4.34e1
#  9 EMP      DEW       1.10e6  1175.  1.83e2 7.42e2 5.32e1 1.94e2 6.06e2 2.12e2  699.   1225.  3.55e2
# 10 EMP      DNK       1.53e5   139.  7.45e0 7.73e1 1.92e0 2.56e1 5.33e1 1.57e1   91.6   248.  1.95e1
# # ... with 75 more rows

# This compute a weighted grouped mode, taking the total EMP / VA as weight
GGDC10S %>% 
  group_by(Variable,Country) %>%
  select_at(6:16) %>% fmode(SUM)
# # A tibble: 85 x 13
#    Variable Country max.SUM     AGR     MIN     MAN     PU    CON    WRT    TRA   FIRE    GOV    OTH
#    <chr>    <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#  1 EMP      ARG      17929.  1.16e3  127.    2.16e3 1.52e2 1.41e3  3768. 1.06e3 1.75e3  4336. 2.00e3
#  2 EMP      BOL       4508.  8.19e2   37.6   6.04e2 1.08e1 4.33e2   893. 3.33e2 3.21e2    NA  1.06e3
#  3 EMP      BRA     102572.  1.65e4  313.    1.18e4 3.88e2 8.15e3 21860. 5.17e3 1.20e4 12149. 1.42e4
#  4 EMP      BWA        668.  1.71e2   13.1   4.33e1 3.93e0 1.81e1   129. 2.10e1 4.67e1   113. 2.62e1
#  5 EMP      CHL       7559.  6.30e2  249.    7.42e2 6.07e1 6.71e2  1989. 4.81e2 8.54e2    NA  1.88e3
#  6 EMP      CHN     764200   2.66e5 9247.    1.43e5 3.53e3 6.99e4 84165. 3.12e4 1.08e4 43240. 1.03e5
#  7 EMP      COL      21114.  3.93e3  513.    2.37e3 5.89e1 1.41e3  6069. 1.36e3 1.82e3    NA  3.57e3
#  8 EMP      CRI       2058.  2.83e2    2.42  2.49e2 4.38e1 1.20e2   489. 1.44e2 2.25e2   328. 1.75e2
#  9 EMP      DEW      31261   1.03e3  260     8.73e3 2.91e2 2.06e3  4398  1.63e3 3.26e3  6129  1.79e3
# 10 EMP      DNK       2823.  7.85e1    3.12  3.99e2 1.14e1 1.95e2   579. 1.87e2 3.82e2   835. 1.50e2
# # ... with 75 more rows

The weighted variance / standard deviation is currently only implemented with frequency weights. Reliability weights may be implemented in a further update of collapse, if this is a strongly requested feature.

Weighted aggregations may also be performed with collapg, although this does not aggregate and save the weights.

Thus to aggregate the entire data and save the weights one would need to opt for a manual solution:

Benchmarks

Below I provide a set of benchmarks for the standard set of functions commonly used in aggregations. For this purpose I duplicate and row-bind the GGDC10S dataset used so far 200 times to yield a dataset of approx. 1 million observations, while keeping the groups unique. My windows laptop on which these benchmarks were run has a 2x 2.2 GHZ Intel i5 processor, 8GB DDR3 RAM and a Samsung SSD hard drive (so a decent laptop but nothing fancy).

# This replicates the data 200 times while keeping Country and Variable (columns 1 and 4) unique
data <- replicate(200, GGDC10S, simplify = FALSE) # gv and gv<- are shortcuts for get_vars and get_vars<-
uniquify <- function(x, i) `gv<-`(x, c(1,4), value = lapply(gv(x, c(1,4)), paste0, i))
data <- unlist2d(Map(uniquify, data, as.list(1:200)), idcols = FALSE)

dim(data)
# [1] 1005400      16
GRP(data, c(1,4))$N.groups # This shows the number of groups. 
# [1] 17000

# Grouping: This is still a key bottleneck of dplyr compared to data.table and collapse
system.time(group_by(data,Variable,Country))
#    user  system elapsed 
#    0.14    0.00    0.14
system.time(GRP(data, c(1,4)))               
#    user  system elapsed 
#    0.04    0.00    0.05

library(microbenchmark)

# Selection 
microbenchmark(select_at(data, 6:16))
# Unit: milliseconds
#                   expr     min       lq     mean   median       uq      max neval
#  select_at(data, 6:16) 11.5846 11.74948 12.32206 11.99961 12.48735 15.25186   100
microbenchmark(get_vars(data, 6:16))
# Unit: microseconds
#                  expr   min    lq    mean median    uq    max neval
#  get_vars(data, 6:16) 7.586 8.479 9.07241  8.479 8.925 44.178   100

data <- data %>% group_by(Variable,Country) %>% select_at(6:16)

# Conversion of Grouping object: This time is also required in all computations below using collapse fast functions
microbenchmark(GRP(data)) 
# Unit: milliseconds
#       expr      min       lq     mean   median       uq      max neval
#  GRP(data) 2.947021 4.238463 4.817924 4.545704 4.588767 25.67264   100

# Sum 
system.time(fsum(data))
#    user  system elapsed 
#    0.04    0.00    0.04
system.time(summarise_all(data, sum, na.rm = TRUE))
#    user  system elapsed 
#     0.1     0.0     0.1

# Product
system.time(fprod(data))
#    user  system elapsed 
#    0.05    0.00    0.04
system.time(summarise_all(data, prod, na.rm = TRUE))
#    user  system elapsed 
#    0.45    0.00    0.45

# Mean
system.time(fmean(data))
#    user  system elapsed 
#    0.05    0.00    0.04
system.time(summarise_all(data, mean, na.rm = TRUE))
#    user  system elapsed 
#    1.92    0.01    1.94

# Weighted Mean
system.time(fmean(data, SUM)) # This cannot easily be performed in dplyr
#    user  system elapsed 
#    0.07    0.00    0.06

# Median
system.time(fmedian(data))
#    user  system elapsed 
#    0.08    0.00    0.08
system.time(summarise_all(data, median, na.rm = TRUE))
#    user  system elapsed 
#    8.72    0.00    8.72

# Standard-Deviation
system.time(fsd(data))
#    user  system elapsed 
#    0.08    0.01    0.09
system.time(summarise_all(data, sd, na.rm = TRUE))
#    user  system elapsed 
#    3.18    0.00    3.17

# Weighted Standard-Deviation
system.time(fsd(data, SUM))
#    user  system elapsed 
#    0.08    0.00    0.07

# Maximum
system.time(fmax(data))
#    user  system elapsed 
#    0.03    0.00    0.03
system.time(summarise_all(data, max, na.rm = TRUE))
#    user  system elapsed 
#    0.04    0.00    0.05

# First Value
system.time(ffirst(data, na.rm = FALSE))
#    user  system elapsed 
#    0.03    0.00    0.03
system.time(summarise_all(data, first))
#    user  system elapsed 
#    0.60    0.00    0.59

# Distinct Values
system.time(fNdistinct(data))
#    user  system elapsed 
#    0.25    0.08    0.33
system.time(summarise_all(data, n_distinct, na.rm = TRUE))
#    user  system elapsed 
#    2.33    0.00    2.33

# Mode
system.time(fmode(data))
#    user  system elapsed 
#    0.23    0.11    0.34

# Weighted Mode
system.time(fmode(data, SUM))
#    user  system elapsed 
#    0.36    0.11    0.47

The benchmarks show that at this data size efficient primitives like base::sum or base::max can still deliver very decent performance with summarize. Less optimized base functions like mean, median and sd however take multiple seconds to compute, and here collapse fast functions really prove to be very useful complements to the dplyr system.

Weighted statistics are also performed extremely fast by collapse functions. I would not know how to compute weighted statistics by groups in dplyr, as it would require the weighting variable to be split as well, which seems impossible in native dplyr.

A further highlight of collapse is the extremely fast statistical mode function, which can also compute a weighted mode. Fast categorical aggregation has been an issue in R, and defining a mode function from base R and applying it to 17000 groups will probably let it run at least a minute. fmode reduces this time to half a second.

Thus in terms of data aggregation collapse fast functions are able to speed up dplyr to a level that makes it attractive again to R users working on medium-sized or larger data, and everyone programming with dplyr. I however strongly recommend collapse itself for easy and speedy programming as it does not rely on non-standard evaluation and has less R-overhead than dplyr.

In all of this the grouping system of dplyr remains the central bottleneck. For example grouping 10 million observations in 1 million groups takes around 10 second with group_by, whereas GRP takes around 1.5 seconds, and this difference grows exponentially as data get larger. Rewriting group_by using GRP / data.table’s forderv and then writing a simple C++ conversion program for the grouping object could be a quick remedy for this issue, but that is at the discretion of Hadley Wickham and coauthors.

2. Fast Transformations

Fast aggregation’s are just the tip of the iceberg compared to what collapse can bring to dplyr in terms of grouped transformations.

2.1 Replacing and Sweeping out Statistics

All statistical (scalar-valued) functions in the collapse package (fsum, fprod, fmean, fmedian, fmode, fvar, fsd, fmin, fmax, ffirst, flast, fNobs, fNdistinct) have a TRA argument which can be used to efficiently transforms data by either (column-wise) replacing data values with supplied statistics or sweeping the statistics out of the data. Operations can be specified using either an integer or quoted operator / string. The 8 operations supported by TRA are:

For functions supporting weights (fmean, fmode, fvar and fsd) the TRA argument is in the third position following the data and weight vector (in the grouped_df method), whereas functions not supporting weights have the argument in the second position.

Simple transformations are again straightforward to specify:

We can also easily specify code to demean, scale or compute percentages1 by groups:

# Demeaning sectoral data by Variable and Country (within transformation)
GGDC10S %>% 
  group_by(Variable,Country) %>%
  select_at(6:16) %>% fmean(TRA = "-")
# # A tibble: 5,027 x 13
# # Groups:   Variable, Country [85]
#    Variable Country   AGR    MIN   MAN    PU   CON    WRT   TRA   FIRE    GOV   OTH     SUM
#  * <chr>    <chr>   <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl>  <dbl> <dbl>   <dbl>
#  1 VA       BWA       NA     NA    NA    NA    NA     NA    NA     NA     NA    NA      NA 
#  2 VA       BWA       NA     NA    NA    NA    NA     NA    NA     NA     NA    NA      NA 
#  3 VA       BWA       NA     NA    NA    NA    NA     NA    NA     NA     NA    NA      NA 
#  4 VA       BWA       NA     NA    NA    NA    NA     NA    NA     NA     NA    NA      NA 
#  5 VA       BWA     -446. -4505. -941. -216. -895. -1942. -634. -1358. -2368. -771. -14074.
#  6 VA       BWA     -446. -4506. -941. -216. -894. -1941. -633. -1357. -2367. -770. -14072.
#  7 VA       BWA     -444. -4507. -941. -216. -894. -1940. -633. -1357. -2366. -770. -14069.
#  8 VA       BWA     -443. -4506. -941. -216. -894. -1944. -634. -1357. -2366. -770. -14070.
#  9 VA       BWA     -441. -4507. -941. -216. -894. -1943. -633. -1358. -2368. -771. -14071.
# 10 VA       BWA     -440. -4503. -939. -216. -892. -1942. -633. -1357. -2367. -770. -14061.
# # ... with 5,017 more rows

# Scaling sectoral data by Variable and Country
GGDC10S %>% 
  group_by(Variable,Country) %>%
    select_at(6:16) %>% fsd(TRA = "/")
# # A tibble: 5,027 x 13
# # Groups:   Variable, Country [85]
#    Variable Country     AGR      MIN      MAN       PU      CON      WRT      TRA     FIRE      GOV
#  * <chr>    <chr>     <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
#  1 VA       BWA     NA      NA       NA       NA       NA       NA       NA       NA       NA      
#  2 VA       BWA     NA      NA       NA       NA       NA       NA       NA       NA       NA      
#  3 VA       BWA     NA      NA       NA       NA       NA       NA       NA       NA       NA      
#  4 VA       BWA     NA      NA       NA       NA       NA       NA       NA       NA       NA      
#  5 VA       BWA      0.0270  5.56e-4  5.23e-4  3.88e-4  5.11e-4  0.00194  0.00154  5.23e-4  0.00134
#  6 VA       BWA      0.0260  3.97e-4  7.23e-4  5.03e-4  1.04e-3  0.00220  0.00180  5.83e-4  0.00158
#  7 VA       BWA      0.0293  3.13e-4  5.71e-4  7.54e-4  1.04e-3  0.00257  0.00200  6.35e-4  0.00176
#  8 VA       BWA      0.0317  3.66e-4  6.66e-4  7.54e-4  6.94e-4  0.00134  0.00160  7.19e-4  0.00195
#  9 VA       BWA      0.0349  2.93e-4  5.33e-4  7.54e-4  9.42e-4  0.00161  0.00227  4.83e-4  0.00139
# 10 VA       BWA      0.0362  8.34e-4  1.52e-3  2.15e-3  2.69e-3  0.00179  0.00253  5.77e-4  0.00155
# # ... with 5,017 more rows, and 2 more variables: OTH <dbl>, SUM <dbl>

# Computing sercentages of sectoral data by Variable and Country
GGDC10S %>% 
  group_by(Variable,Country) %>%
    select_at(6:16) %>% fsum("%")
# # A tibble: 5,027 x 13
# # Groups:   Variable, Country [85]
#    Variable Country     AGR      MIN      MAN       PU      CON      WRT      TRA     FIRE      GOV
#  * <chr>    <chr>     <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
#  1 VA       BWA     NA      NA       NA       NA       NA       NA       NA       NA       NA      
#  2 VA       BWA     NA      NA       NA       NA       NA       NA       NA       NA       NA      
#  3 VA       BWA     NA      NA       NA       NA       NA       NA       NA       NA       NA      
#  4 VA       BWA     NA      NA       NA       NA       NA       NA       NA       NA       NA      
#  5 VA       BWA      0.0750  1.65e-3  0.00166  0.00103  0.00157  0.00682  0.00556  0.00175  0.00432
#  6 VA       BWA      0.0724  1.18e-3  0.00230  0.00133  0.00320  0.00772  0.00649  0.00195  0.00511
#  7 VA       BWA      0.0814  9.30e-4  0.00182  0.00199  0.00320  0.00903  0.00722  0.00213  0.00571
#  8 VA       BWA      0.0881  1.08e-3  0.00212  0.00199  0.00213  0.00471  0.00577  0.00241  0.00631
#  9 VA       BWA      0.0971  8.68e-4  0.00170  0.00199  0.00289  0.00565  0.00818  0.00162  0.00451
# 10 VA       BWA      0.101   2.47e-3  0.00483  0.00568  0.00825  0.00628  0.00910  0.00193  0.00501
# # ... with 5,017 more rows, and 2 more variables: OTH <dbl>, SUM <dbl>

Weighted demeaning and scaling can be computed using:

# Weighted demeaning (within transformation)
GGDC10S %>% 
  group_by(Variable,Country) %>%
  select_at(6:16) %>% fmean(SUM, "-")
# # A tibble: 5,027 x 13
# # Groups:   Variable, Country [85]
#    Variable Country   SUM    AGR     MIN    MAN    PU    CON    WRT    TRA   FIRE    GOV    OTH
#  * <chr>    <chr>   <dbl>  <dbl>   <dbl>  <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#  1 VA       BWA      NA      NA      NA     NA    NA     NA     NA     NA     NA     NA     NA 
#  2 VA       BWA      NA      NA      NA     NA    NA     NA     NA     NA     NA     NA     NA 
#  3 VA       BWA      NA      NA      NA     NA    NA     NA     NA     NA     NA     NA     NA 
#  4 VA       BWA      NA      NA      NA     NA    NA     NA     NA     NA     NA     NA     NA 
#  5 VA       BWA      37.5 -1301. -13317. -2965. -529. -2746. -6540. -2157. -4431. -7551. -2613.
#  6 VA       BWA      39.3 -1302. -13318. -2964. -529. -2745. -6540. -2156. -4431. -7550. -2613.
#  7 VA       BWA      43.1 -1300. -13319. -2965. -528. -2745. -6538. -2156. -4431. -7550. -2612.
#  8 VA       BWA      41.4 -1298. -13318. -2964. -528. -2746. -6542. -2156. -4431. -7549. -2612.
#  9 VA       BWA      41.1 -1296. -13319. -2965. -528. -2745. -6541. -2156. -4431. -7551. -2613.
# 10 VA       BWA      51.2 -1296. -13315. -2963. -528. -2743. -6541. -2155. -4431. -7550. -2613.
# # ... with 5,017 more rows

# Weighted scaling
GGDC10S %>% 
  group_by(Variable,Country) %>%
  select_at(6:16) %>% fsd(SUM, "/")
# # A tibble: 5,027 x 13
# # Groups:   Variable, Country [85]
#    Variable Country   SUM     AGR      MIN      MAN       PU      CON      WRT      TRA     FIRE
#  * <chr>    <chr>   <dbl>   <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
#  1 VA       BWA      NA   NA      NA       NA       NA       NA       NA       NA       NA      
#  2 VA       BWA      NA   NA      NA       NA       NA       NA       NA       NA       NA      
#  3 VA       BWA      NA   NA      NA       NA       NA       NA       NA       NA       NA      
#  4 VA       BWA      NA   NA      NA       NA       NA       NA       NA       NA       NA      
#  5 VA       BWA      37.5  0.0221  5.29e-4  4.49e-4  4.71e-4  4.56e-4  0.00155  0.00117  4.63e-4
#  6 VA       BWA      39.3  0.0214  3.78e-4  6.21e-4  6.10e-4  9.30e-4  0.00175  0.00137  5.15e-4
#  7 VA       BWA      43.1  0.0240  2.98e-4  4.90e-4  9.15e-4  9.30e-4  0.00205  0.00152  5.62e-4
#  8 VA       BWA      41.4  0.0260  3.48e-4  5.72e-4  9.15e-4  6.20e-4  0.00107  0.00122  6.35e-4
#  9 VA       BWA      41.1  0.0287  2.78e-4  4.57e-4  9.15e-4  8.41e-4  0.00128  0.00173  4.27e-4
# 10 VA       BWA      51.2  0.0297  7.93e-4  1.30e-3  2.61e-3  2.40e-3  0.00143  0.00192  5.10e-4
# # ... with 5,017 more rows, and 2 more variables: GOV <dbl>, OTH <dbl>

Alternatively we could also replace data points with their groupwise weighted mean or standard deviation:

# This conducts a weighted between transformation (replacing with weighted mean)
GGDC10S %>% 
  group_by(Variable,Country) %>%
    select_at(6:16) %>% fmean(SUM, "replace")
# # A tibble: 5,027 x 13
# # Groups:   Variable, Country [85]
#    Variable Country   SUM   AGR    MIN   MAN    PU   CON   WRT   TRA  FIRE   GOV   OTH
#  * <chr>    <chr>   <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#  1 VA       BWA      NA     NA     NA    NA    NA    NA    NA    NA    NA    NA    NA 
#  2 VA       BWA      NA     NA     NA    NA    NA    NA    NA    NA    NA    NA    NA 
#  3 VA       BWA      NA     NA     NA    NA    NA    NA    NA    NA    NA    NA    NA 
#  4 VA       BWA      NA     NA     NA    NA    NA    NA    NA    NA    NA    NA    NA 
#  5 VA       BWA      37.5 1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
#  6 VA       BWA      39.3 1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
#  7 VA       BWA      43.1 1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
#  8 VA       BWA      41.4 1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
#  9 VA       BWA      41.1 1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
# 10 VA       BWA      51.2 1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
# # ... with 5,017 more rows

# This also replaces missing values in each group
GGDC10S %>% 
  group_by(Variable,Country) %>%
    select_at(6:16) %>% fmean(SUM, "replace_fill")
# # A tibble: 5,027 x 13
# # Groups:   Variable, Country [85]
#    Variable Country   SUM   AGR    MIN   MAN    PU   CON   WRT   TRA  FIRE   GOV   OTH
#  * <chr>    <chr>   <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#  1 VA       BWA      NA   1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
#  2 VA       BWA      NA   1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
#  3 VA       BWA      NA   1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
#  4 VA       BWA      NA   1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
#  5 VA       BWA      37.5 1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
#  6 VA       BWA      39.3 1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
#  7 VA       BWA      43.1 1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
#  8 VA       BWA      41.4 1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
#  9 VA       BWA      41.1 1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
# 10 VA       BWA      51.2 1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
# # ... with 5,017 more rows

It is also possible to center data points on the global mean, which is achieved by subtracting out group means and adding the overall mean of the data again:

Sequential operations such as scaling and then centering are also easily performed:

Of course it is also possible to combine multiple functions as in the aggregation section, or to add variables to existing data, as shown below:

Certainly There are lots of other examples one could construct using the 8 operations and 13 functions listed above, the examples provided just outline the suggested programming basics.

2.2 More Control using the TRA Function

Behind the scenes of the TRA = ... argument, the fast functions first compute the grouped statistics on all columns of the data, and these statistics are then directly fed into a C++ function that uses them to replace or sweep them out of data points in one of the 8 ways described above. This function can however also be called directly by the name of TRA (shorthand for ‘transforming’ data by replacing or sweeping out statistics). Fundamentally, TRA is a generalization of base::sweep for column-wise grouped operations2. Direct calls to TRA enable more control over inputs and outputs.

The two operations below are equivalent, although the first is slightly more efficient as it only requires one method dispatch and one check of the inputs:

# This divides by the product
GGDC10S %>% 
  group_by(Variable,Country) %>%
    select_at(6:16) %>% fprod(TRA = "/")
# # A tibble: 5,027 x 13
# # Groups:   Variable, Country [85]
#    Variable Country        AGR        MIN        MAN        PU        CON        WRT       TRA
#  * <chr>    <chr>        <dbl>      <dbl>      <dbl>     <dbl>      <dbl>      <dbl>     <dbl>
#  1 VA       BWA     NA         NA         NA         NA        NA         NA         NA       
#  2 VA       BWA     NA         NA         NA         NA        NA         NA         NA       
#  3 VA       BWA     NA         NA         NA         NA        NA         NA         NA       
#  4 VA       BWA     NA         NA         NA         NA        NA         NA         NA       
#  5 VA       BWA      1.29e-105  2.81e-127  1.40e-101  4.44e-74  4.19e-102  3.97e-113  6.91e-92
#  6 VA       BWA      1.24e-105  2.00e-127  1.94e-101  5.75e-74  8.55e-102  4.49e-113  8.08e-92
#  7 VA       BWA      1.39e-105  1.58e-127  1.53e-101  8.62e-74  8.55e-102  5.26e-113  8.98e-92
#  8 VA       BWA      1.51e-105  1.85e-127  1.78e-101  8.62e-74  5.70e-102  2.74e-113  7.18e-92
#  9 VA       BWA      1.66e-105  1.48e-127  1.43e-101  8.62e-74  7.74e-102  3.29e-113  1.02e-91
# 10 VA       BWA      1.72e-105  4.21e-127  4.07e-101  2.46e-73  2.21e-101  3.66e-113  1.13e-91
# # ... with 5,017 more rows, and 4 more variables: FIRE <dbl>, GOV <dbl>, OTH <dbl>, SUM <dbl>

# Same thing 
GGDC10S %>% 
  group_by(Variable,Country) %>%
    select_at(6:16) %>% TRA(fprod(.),"/") # [same as TRA(.,fprod(.),"/")]
# # A tibble: 5,027 x 13
# # Groups:   Variable, Country [85]
#    Variable Country        AGR        MIN        MAN        PU        CON        WRT       TRA
#  * <chr>    <chr>        <dbl>      <dbl>      <dbl>     <dbl>      <dbl>      <dbl>     <dbl>
#  1 VA       BWA     NA         NA         NA         NA        NA         NA         NA       
#  2 VA       BWA     NA         NA         NA         NA        NA         NA         NA       
#  3 VA       BWA     NA         NA         NA         NA        NA         NA         NA       
#  4 VA       BWA     NA         NA         NA         NA        NA         NA         NA       
#  5 VA       BWA      1.29e-105  2.81e-127  1.40e-101  4.44e-74  4.19e-102  3.97e-113  6.91e-92
#  6 VA       BWA      1.24e-105  2.00e-127  1.94e-101  5.75e-74  8.55e-102  4.49e-113  8.08e-92
#  7 VA       BWA      1.39e-105  1.58e-127  1.53e-101  8.62e-74  8.55e-102  5.26e-113  8.98e-92
#  8 VA       BWA      1.51e-105  1.85e-127  1.78e-101  8.62e-74  5.70e-102  2.74e-113  7.18e-92
#  9 VA       BWA      1.66e-105  1.48e-127  1.43e-101  8.62e-74  7.74e-102  3.29e-113  1.02e-91
# 10 VA       BWA      1.72e-105  4.21e-127  4.07e-101  2.46e-73  2.21e-101  3.66e-113  1.13e-91
# # ... with 5,017 more rows, and 4 more variables: FIRE <dbl>, GOV <dbl>, OTH <dbl>, SUM <dbl>

TRA.grouped_df was designed such that it matches the columns of statistics (aggregated columns) to those of the original data, and only transforms matching columns while returning the whole data.frame. Thus it is easily possible to only apply a transformation to the first two sectors:

Another potential use of TRA is to do computations in two- or more steps, for example if both aggregated and transformed data are needed, or if computations are more complex and involve other manipulations in between the aggregating and sweeping part:

# Get grouped tibble
gGGDC <- GGDC10S %>% group_by(Variable,Country)

# Get aggregated data
gsumGGDC <- gGGDC %>% select_at(6:16) %>% fsum
gsumGGDC
# # A tibble: 85 x 13
#    Variable Country     AGR     MIN     MAN     PU    CON    WRT    TRA   FIRE     GOV    OTH    SUM
#    <chr>    <chr>     <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
#  1 EMP      ARG      8.80e4   3230.  1.20e5  6307. 4.60e4 1.23e5 4.02e4 3.89e4  1.27e5 6.15e4 6.54e5
#  2 EMP      BOL      5.88e4   3418.  1.43e4   326. 7.49e3 1.72e4 7.04e3 2.72e3 NA      2.41e4 1.35e5
#  3 EMP      BRA      1.07e6  12773.  4.33e5 22604. 2.19e5 5.28e5 1.27e5 2.74e5  3.29e5 3.54e5 3.36e6
#  4 EMP      BWA      8.84e3    493.  8.49e2   145. 1.19e3 1.71e3 3.93e2 7.21e2  2.87e3 1.30e3 1.85e4
#  5 EMP      CHL      4.42e4   6389.  3.94e4  1850. 1.86e4 4.38e4 1.63e4 1.72e4 NA      6.32e4 2.51e5
#  6 EMP      CHN      1.73e7 422972.  4.03e6 96364. 1.25e6 1.73e6 8.36e5 2.96e5  1.36e6 1.86e6 2.91e7
#  7 EMP      COL      1.89e5   8843.  7.17e4  2068. 3.20e4 1.26e5 2.86e4 3.96e4 NA      1.06e5 6.03e5
#  8 EMP      CRI      1.43e4    106.  8.44e3   884. 3.57e3 9.71e3 2.63e3 3.40e3  7.94e3 4.04e3 5.50e4
#  9 EMP      DEW      1.05e5  17083.  3.56e5  9499. 8.79e4 1.87e5 6.23e4 7.09e4  1.66e5 4.20e4 1.10e6
# 10 EMP      DNK      1.51e4    514.  3.25e4   881. 1.10e4 2.91e4 1.03e4 1.16e4  3.51e4 7.13e3 1.53e5
# # ... with 75 more rows

# Get transformed (scaled) data 
TRA(gGGDC, gsumGGDC, "/")
# # A tibble: 5,027 x 16
# # Groups:   Variable, Country [85]
#    Country Regioncode Region Variable  Year      AGR      MIN      MAN       PU      CON      WRT
#  * <chr>   <chr>      <chr>  <chr>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
#  1 BWA     SSA        Sub-s~ VA        1960 NA       NA       NA       NA       NA       NA      
#  2 BWA     SSA        Sub-s~ VA        1961 NA       NA       NA       NA       NA       NA      
#  3 BWA     SSA        Sub-s~ VA        1962 NA       NA       NA       NA       NA       NA      
#  4 BWA     SSA        Sub-s~ VA        1963 NA       NA       NA       NA       NA       NA      
#  5 BWA     SSA        Sub-s~ VA        1964  7.50e-4  1.65e-5  1.66e-5  1.03e-5  1.57e-5  6.82e-5
#  6 BWA     SSA        Sub-s~ VA        1965  7.24e-4  1.18e-5  2.30e-5  1.33e-5  3.20e-5  7.72e-5
#  7 BWA     SSA        Sub-s~ VA        1966  8.14e-4  9.30e-6  1.82e-5  1.99e-5  3.20e-5  9.03e-5
#  8 BWA     SSA        Sub-s~ VA        1967  8.81e-4  1.08e-5  2.12e-5  1.99e-5  2.13e-5  4.71e-5
#  9 BWA     SSA        Sub-s~ VA        1968  9.71e-4  8.68e-6  1.70e-5  1.99e-5  2.89e-5  5.65e-5
# 10 BWA     SSA        Sub-s~ VA        1969  1.01e-3  2.47e-5  4.83e-5  5.68e-5  8.25e-5  6.28e-5
# # ... with 5,017 more rows, and 5 more variables: TRA <dbl>, FIRE <dbl>, GOV <dbl>, OTH <dbl>,
# #   SUM <dbl>

I have already noted above that whether using the argument to fast statistical functions or TRA directly, these data transformations are essentially a two-step process: Statistics are first computed and then used to transform this original data. This process is already very efficient since all functions are written in C++, and programmatically separating the computation of statistics and data transformation tasks allows for unlimited combinations and drastically simplifies the code base of this package.

Nonetheless there are of course more memory efficient and faster ways to program such data transformations, which principally involve doing them column-by-column with a single C++ function. To ensure that this package lives up to the highest standards of performance for common uses, I have implemented such slightly more efficient algorithms for the very commonly applied tasks of centering and averaging data by groups (widely known as ‘between’-group and ‘within’-group transformations), and scaling and centering data by groups (also known as ‘standardizing’ data).

2.3 Faster Centering, Averaging and Standardizing

The functions fbetween and fwithin are faster implementations of fmean invoked with different TRA options:

GGDC10S %>% # Same as ... %>% fmean(TRA = "replace")
  group_by(Variable,Country) %>% select_at(6:16) %>% fbetween %>% head(2)
# # A tibble: 2 x 13
# # Groups:   Variable, Country [1]
#   Variable Country   AGR   MIN   MAN    PU   CON   WRT   TRA  FIRE   GOV   OTH   SUM
#   <chr>    <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA       BWA        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# 2 VA       BWA        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA

GGDC10S %>% # Same as ... %>% fmean(TRA = "replace_fill")
  group_by(Variable,Country) %>% select_at(6:16) %>% fbetween(fill = TRUE) %>% head(2)
# # A tibble: 2 x 13
# # Groups:   Variable, Country [1]
#   Variable Country   AGR   MIN   MAN    PU   CON   WRT   TRA  FIRE   GOV   OTH    SUM
#   <chr>    <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
# 1 VA       BWA      462. 4509.  942.  216.  895. 1948.  635. 1359. 2373.  773. 14112.
# 2 VA       BWA      462. 4509.  942.  216.  895. 1948.  635. 1359. 2373.  773. 14112.

GGDC10S %>% # Same as ... %>% fmean(TRA = "-")
  group_by(Variable,Country) %>% select_at(6:16) %>% fwithin %>% head(2)
# # A tibble: 2 x 13
# # Groups:   Variable, Country [1]
#   Variable Country   AGR   MIN   MAN    PU   CON   WRT   TRA  FIRE   GOV   OTH   SUM
#   <chr>    <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA       BWA        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# 2 VA       BWA        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA

GGDC10S %>% # Same as ... %>% fmean(TRA = "-+")
  group_by(Variable,Country) %>% select_at(6:16) %>% fwithin(add.global.mean = TRUE) %>% head(2)
# # A tibble: 2 x 13
# # Groups:   Variable, Country [1]
#   Variable Country   AGR   MIN   MAN    PU   CON   WRT   TRA  FIRE   GOV   OTH   SUM
#   <chr>    <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA       BWA        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# 2 VA       BWA        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA

Apart from higher speed, there is one additional advantage of using fwithin in particular, which regards the joint use of weights and the add.global.mean option: ... %>% fmean(w = SUM, TRA = "-+") will not properly group-center the data on the overall weighted mean. Instead, it will group-center data on a frequency weighted average of the weighted group-means, thus not taking into account different aggregated weights attached to those weighted group-means themselves. The reason for this shortcoming is simply that TRA was not designed to take a separate weight vector as input. fwithin(w = SUM, add.global.mean = TRUE) does a better job and properly centers data on the weighted overall mean after subtracting out weighted group means:

GGDC10S %>% # This does not center data on a properly computed weighted overall mean
  group_by(Variable,Country) %>% select_at(6:16) %>% fmean(SUM, TRA = "-+") 
# # A tibble: 5,027 x 13
# # Groups:   Variable, Country [85]
#    Variable Country   SUM     AGR     MIN     MAN      PU     CON     WRT     TRA    FIRE     GOV
#  * <chr>    <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#  1 VA       BWA      NA   NA      NA      NA      NA      NA      NA      NA      NA      NA     
#  2 VA       BWA      NA   NA      NA      NA      NA      NA      NA      NA      NA      NA     
#  3 VA       BWA      NA   NA      NA      NA      NA      NA      NA      NA      NA      NA     
#  4 VA       BWA      NA   NA      NA      NA      NA      NA      NA      NA      NA      NA     
#  5 VA       BWA      37.5  8.72e6  7.25e6  1.74e7  1.01e6  6.43e6  1.05e7  4.86e6  4.85e6  4.99e6
#  6 VA       BWA      39.3  8.72e6  7.25e6  1.74e7  1.01e6  6.43e6  1.05e7  4.86e6  4.85e6  4.99e6
#  7 VA       BWA      43.1  8.72e6  7.25e6  1.74e7  1.01e6  6.43e6  1.05e7  4.86e6  4.85e6  4.99e6
#  8 VA       BWA      41.4  8.72e6  7.25e6  1.74e7  1.01e6  6.43e6  1.05e7  4.86e6  4.85e6  4.99e6
#  9 VA       BWA      41.1  8.72e6  7.25e6  1.74e7  1.01e6  6.43e6  1.05e7  4.86e6  4.85e6  4.99e6
# 10 VA       BWA      51.2  8.72e6  7.25e6  1.74e7  1.01e6  6.43e6  1.05e7  4.86e6  4.85e6  4.99e6
# # ... with 5,017 more rows, and 1 more variable: OTH <dbl>

GGDC10S %>% # This does a proper job by both subtracting weighted group-means and adding a weighted overall mean
  group_by(Variable,Country) %>% select_at(6:16) %>% fwithin(SUM, add.global.mean = TRUE) 
# # A tibble: 5,027 x 13
# # Groups:   Variable, Country [85]
#    Variable Country   SUM     AGR     MIN     MAN      PU     CON     WRT     TRA    FIRE     GOV
#  * <chr>    <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#  1 VA       BWA      NA   NA      NA      NA      NA      NA      NA      NA      NA      NA     
#  2 VA       BWA      NA   NA      NA      NA      NA      NA      NA      NA      NA      NA     
#  3 VA       BWA      NA   NA      NA      NA      NA      NA      NA      NA      NA      NA     
#  4 VA       BWA      NA   NA      NA      NA      NA      NA      NA      NA      NA      NA     
#  5 VA       BWA      37.5  4.29e8  3.70e8  7.38e8  2.73e7  2.83e8  4.33e8  1.97e8  1.55e8  2.10e8
#  6 VA       BWA      39.3  4.29e8  3.70e8  7.38e8  2.73e7  2.83e8  4.33e8  1.97e8  1.55e8  2.10e8
#  7 VA       BWA      43.1  4.29e8  3.70e8  7.38e8  2.73e7  2.83e8  4.33e8  1.97e8  1.55e8  2.10e8
#  8 VA       BWA      41.4  4.29e8  3.70e8  7.38e8  2.73e7  2.83e8  4.33e8  1.97e8  1.55e8  2.10e8
#  9 VA       BWA      41.1  4.29e8  3.70e8  7.38e8  2.73e7  2.83e8  4.33e8  1.97e8  1.55e8  2.10e8
# 10 VA       BWA      51.2  4.29e8  3.70e8  7.38e8  2.73e7  2.83e8  4.33e8  1.97e8  1.55e8  2.10e8
# # ... with 5,017 more rows, and 1 more variable: OTH <dbl>

The sequential scaling and centering ... %>% fsd(TRA = "/") %>% fmean(TRA = "-") shown in an earlier example is also not the best way of doing things. The function fscale does this much quicker in a single step:

2.4 Lags / Leads, Differences and Growth Rates

It was suggested some time ago that leaving the best wine for the end is not he best strategy when giving a feast. Considering the marriage of collapse and dplyr the 3 functions for time-computations introduced in this section combine great flexibility with precision and computing power, and feature amongst the highlights of collapse.

The first function, flag, computes sequences of lags and leads on time-series and panel-data. fdiff computes sequences of lagged-leaded and iterated differences on time-series and panel-data, and fgrowth computes lagged-leaded and iterated growth-rates obtained via the exact computation method or through log-differencing. In addition: None of these functions require the data to be sorted, they can carry out fast computations on completely unordered data as long as a time-variable is supplied that uniquely identifies the data.

Beginning with flag, the following code computes 1 fully-identified panel-lag and 1 fully identified panel-lead of each variable in the data:

If the time-variable passed does not exactly identify the data (i.e. because of gaps or repeated values in each group), all 3 functions will issue appropriate error messages. It is also possible to omit the time-variable if one is certain that the data is sorted:

fdiff can compute continuous sequences of lagged, leaded and iterated differences. The code below computes the 1 and 10 year first and second differences of each variable in the data:

Finally, fgrowth computes growth rates in the same way. By default exact growth rates are computed, but the user can also request growth rates obtained by log-differencing:

# Exact growth rates, computed as: (x - lag(x)) / lag(x) * 100
GGDC10S %>% 
  group_by(Variable,Country) %>%
     select_at(5:16) %>% fgrowth(c(1, 10), 1:2, Year)
# # A tibble: 5,027 x 47
# # Groups:   Variable, Country [85]
#    Variable Country  Year G1.AGR G2.AGR L10G1.AGR L10G2.AGR G1.MIN  G2.MIN L10G1.MIN L10G2.MIN
#  * <chr>    <chr>   <dbl>  <dbl>  <dbl>     <dbl>     <dbl>  <dbl>   <dbl>     <dbl>     <dbl>
#  1 VA       BWA      1960  NA      NA          NA        NA   NA      NA          NA        NA
#  2 VA       BWA      1961  NA      NA          NA        NA   NA      NA          NA        NA
#  3 VA       BWA      1962  NA      NA          NA        NA   NA      NA          NA        NA
#  4 VA       BWA      1963  NA      NA          NA        NA   NA      NA          NA        NA
#  5 VA       BWA      1964  NA      NA          NA        NA   NA      NA          NA        NA
#  6 VA       BWA      1965  -3.52   NA          NA        NA  -28.6    NA          NA        NA
#  7 VA       BWA      1966  12.4  -452.         NA        NA  -21.1   -26.3        NA        NA
#  8 VA       BWA      1967   8.29  -33.3        NA        NA   16.7  -179.         NA        NA
#  9 VA       BWA      1968  10.2    23.1        NA        NA  -20    -220.         NA        NA
# 10 VA       BWA      1969   3.61  -64.6        NA        NA  185.  -1026.         NA        NA
# # ... with 5,017 more rows, and 36 more variables: G1.MAN <dbl>, G2.MAN <dbl>, L10G1.MAN <dbl>,
# #   L10G2.MAN <dbl>, G1.PU <dbl>, G2.PU <dbl>, L10G1.PU <dbl>, L10G2.PU <dbl>, G1.CON <dbl>,
# #   G2.CON <dbl>, L10G1.CON <dbl>, L10G2.CON <dbl>, G1.WRT <dbl>, G2.WRT <dbl>, L10G1.WRT <dbl>,
# #   L10G2.WRT <dbl>, G1.TRA <dbl>, G2.TRA <dbl>, L10G1.TRA <dbl>, L10G2.TRA <dbl>, G1.FIRE <dbl>,
# #   G2.FIRE <dbl>, L10G1.FIRE <dbl>, L10G2.FIRE <dbl>, G1.GOV <dbl>, G2.GOV <dbl>, L10G1.GOV <dbl>,
# #   L10G2.GOV <dbl>, G1.OTH <dbl>, G2.OTH <dbl>, L10G1.OTH <dbl>, L10G2.OTH <dbl>, G1.SUM <dbl>,
# #   G2.SUM <dbl>, L10G1.SUM <dbl>, L10G2.SUM <dbl>

# Log-difference growth rates, computed as: log(x / lag(x)) * 100
GGDC10S %>% 
  group_by(Variable,Country) %>%
     select_at(5:16) %>% fgrowth(c(1, 10), 1:2, Year, logdiff = TRUE)
# # A tibble: 5,027 x 47
# # Groups:   Variable, Country [85]
#    Variable Country  Year Dlog1.AGR Dlog2.AGR L10Dlog1.AGR L10Dlog2.AGR Dlog1.MIN Dlog2.MIN
#  * <chr>    <chr>   <dbl>     <dbl>     <dbl>        <dbl>        <dbl>     <dbl>     <dbl>
#  1 VA       BWA      1960     NA         NA             NA           NA      NA          NA
#  2 VA       BWA      1961    NaN         NA             NA           NA     NaN          NA
#  3 VA       BWA      1962    NaN        NaN             NA           NA     NaN         NaN
#  4 VA       BWA      1963    NaN        NaN             NA           NA     NaN         NaN
#  5 VA       BWA      1964    NaN        NaN             NA           NA     NaN         NaN
#  6 VA       BWA      1965     -3.59     NaN             NA           NA     -33.6       NaN
#  7 VA       BWA      1966     11.7      NaN             NA           NA     -23.6       NaN
#  8 VA       BWA      1967      7.96     -38.6           NA           NA      15.4       NaN
#  9 VA       BWA      1968      9.72      19.9           NA           NA     -22.3       NaN
# 10 VA       BWA      1969      3.55    -101.            NA           NA     105.        NaN
# # ... with 5,017 more rows, and 38 more variables: L10Dlog1.MIN <dbl>, L10Dlog2.MIN <dbl>,
# #   Dlog1.MAN <dbl>, Dlog2.MAN <dbl>, L10Dlog1.MAN <dbl>, L10Dlog2.MAN <dbl>, Dlog1.PU <dbl>,
# #   Dlog2.PU <dbl>, L10Dlog1.PU <dbl>, L10Dlog2.PU <dbl>, Dlog1.CON <dbl>, Dlog2.CON <dbl>,
# #   L10Dlog1.CON <dbl>, L10Dlog2.CON <dbl>, Dlog1.WRT <dbl>, Dlog2.WRT <dbl>, L10Dlog1.WRT <dbl>,
# #   L10Dlog2.WRT <dbl>, Dlog1.TRA <dbl>, Dlog2.TRA <dbl>, L10Dlog1.TRA <dbl>, L10Dlog2.TRA <dbl>,
# #   Dlog1.FIRE <dbl>, Dlog2.FIRE <dbl>, L10Dlog1.FIRE <dbl>, L10Dlog2.FIRE <dbl>, Dlog1.GOV <dbl>,
# #   Dlog2.GOV <dbl>, L10Dlog1.GOV <dbl>, L10Dlog2.GOV <dbl>, Dlog1.OTH <dbl>, Dlog2.OTH <dbl>,
# #   L10Dlog1.OTH <dbl>, L10Dlog2.OTH <dbl>, Dlog1.SUM <dbl>, Dlog2.SUM <dbl>, L10Dlog1.SUM <dbl>,
# #   L10Dlog2.SUM <dbl>

fdiff and fgrowth can also perform leaded (forward) differences and growth rates, although I have never come to employ these in my personal work (i.e. ... %>% fgrowth(-c(1, 10), 1:2, Year) would compute one and 10-year leaded first and second differences). Again it is possible to perform sequential operations:

# This computes the 1 and 10-year growth rates, for the current period and lagged by one period
GGDC10S %>% 
  group_by(Variable,Country) %>%
     select_at(5:16) %>% fgrowth(c(1, 10), 1, Year) %>% flag(0:1, Year)
# # A tibble: 5,027 x 47
# # Groups:   Variable, Country [85]
#    Variable Country  Year G1.AGR L1.G1.AGR L10G1.AGR L1.L10G1.AGR G1.MIN L1.G1.MIN L10G1.MIN
#  * <chr>    <chr>   <dbl>  <dbl>     <dbl>     <dbl>        <dbl>  <dbl>     <dbl>     <dbl>
#  1 VA       BWA      1960  NA        NA           NA           NA   NA        NA          NA
#  2 VA       BWA      1961  NA        NA           NA           NA   NA        NA          NA
#  3 VA       BWA      1962  NA        NA           NA           NA   NA        NA          NA
#  4 VA       BWA      1963  NA        NA           NA           NA   NA        NA          NA
#  5 VA       BWA      1964  NA        NA           NA           NA   NA        NA          NA
#  6 VA       BWA      1965  -3.52     NA           NA           NA  -28.6      NA          NA
#  7 VA       BWA      1966  12.4      -3.52        NA           NA  -21.1     -28.6        NA
#  8 VA       BWA      1967   8.29     12.4         NA           NA   16.7     -21.1        NA
#  9 VA       BWA      1968  10.2       8.29        NA           NA  -20        16.7        NA
# 10 VA       BWA      1969   3.61     10.2         NA           NA  185.      -20          NA
# # ... with 5,017 more rows, and 37 more variables: L1.L10G1.MIN <dbl>, G1.MAN <dbl>,
# #   L1.G1.MAN <dbl>, L10G1.MAN <dbl>, L1.L10G1.MAN <dbl>, G1.PU <dbl>, L1.G1.PU <dbl>,
# #   L10G1.PU <dbl>, L1.L10G1.PU <dbl>, G1.CON <dbl>, L1.G1.CON <dbl>, L10G1.CON <dbl>,
# #   L1.L10G1.CON <dbl>, G1.WRT <dbl>, L1.G1.WRT <dbl>, L10G1.WRT <dbl>, L1.L10G1.WRT <dbl>,
# #   G1.TRA <dbl>, L1.G1.TRA <dbl>, L10G1.TRA <dbl>, L1.L10G1.TRA <dbl>, G1.FIRE <dbl>,
# #   L1.G1.FIRE <dbl>, L10G1.FIRE <dbl>, L1.L10G1.FIRE <dbl>, G1.GOV <dbl>, L1.G1.GOV <dbl>,
# #   L10G1.GOV <dbl>, L1.L10G1.GOV <dbl>, G1.OTH <dbl>, L1.G1.OTH <dbl>, L10G1.OTH <dbl>,
# #   L1.L10G1.OTH <dbl>, G1.SUM <dbl>, L1.G1.SUM <dbl>, L10G1.SUM <dbl>, L1.L10G1.SUM <dbl>

Benchmarks

Using the same data as in section 1.4 (1 million obs in 17000 groups), I run benchmarks of collapse functions against native dplyr solutions:

dim(data)
# [1] 1005400      13
GRP(data)
# collapse grouping object of length 1005400 with 17000 ordered groups
# 
# Call: GRP.grouped_df(X = data), ordered
# 
# Distribution of group sizes: 
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#    4.00   53.00   62.00   59.14   63.00   65.00 
# 
# Groups with sizes: 
# EMP1.ARG1 EMP1.BOL1 EMP1.BRA1 EMP1.BWA1 EMP1.CHL1 EMP1.CHN1 
#        62        61        62        52        63        62 
#   ---
# VA99.TWN99 VA99.TZA99 VA99.USA99 VA99.VEN99 VA99.ZAF99 VA99.ZMB99 
#         63         52         65         63         52         52

# Grouped Sum (mutate does not have an option to preserve missing values as given by "replace")
system.time(fsum(data, TRA = "replace_fill"))
#    user  system elapsed 
#    0.08    0.00    0.08
system.time(mutate_all(data, sum, na.rm = TRUE))
#    user  system elapsed 
#    0.21    0.05    0.25

# Dviding by grouped sum
system.time(fsum(data, TRA = "/"))
#    user  system elapsed 
#    0.13    0.02    0.14
system.time(mutate_all(data, function(x) x/sum(x, na.rm = TRUE)))
#    user  system elapsed 
#    0.86    0.05    0.91

# Mean (between transformation)
system.time(fmean(data, TRA = "replace_fill"))
#    user  system elapsed 
#    0.05    0.05    0.09
system.time(fbetween(data, fill = TRUE))
#    user  system elapsed 
#    0.05    0.04    0.10
system.time(mutate_all(data, mean, na.rm = TRUE))
#    user  system elapsed 
#    2.75    0.03    2.78

# De-Mean (within transformation)
system.time(fmean(data, TRA = "-"))
#    user  system elapsed 
#    0.08    0.01    0.10
system.time(fwithin(data))
#    user  system elapsed 
#    0.06    0.03    0.10
system.time(mutate_all(data, function(x) x - mean(x, na.rm = TRUE)))
#    user  system elapsed 
#    2.31    0.08    2.39

# Centering on global mean
system.time(fwithin(data, add.global.mean = TRUE))
#    user  system elapsed 
#    0.08    0.00    0.08

# Weighted Demeaning
system.time(fwithin(data, SUM))
#    user  system elapsed 
#    0.08    0.00    0.08
system.time(fwithin(data, SUM, add.global.mean = TRUE))
#    user  system elapsed 
#    0.06    0.01    0.08

# Scaling
system.time(fsd(data, TRA = "/"))
#    user  system elapsed 
#    0.12    0.03    0.15
system.time(mutate_all(data, function(x) x/sd(x, na.rm = TRUE)))
#    user  system elapsed 
#    3.72    0.02    3.75

# Standardizing
system.time(fscale(data))
#    user  system elapsed 
#    0.10    0.02    0.12
# system.time(mutate_all(data, scale)) This takes 32 seconds to compute.. 

# Weighted Scaling and standardizing
system.time(fsd(data, SUM, TRA = "/"))
#    user  system elapsed 
#    0.12    0.02    0.14
system.time(fscale(data, SUM))
#    user  system elapsed 
#    0.07    0.03    0.11

# Lags and Leads
system.time(flag(data))
#    user  system elapsed 
#    0.01    0.04    0.04
system.time(mutate_all(data, lag))
#    user  system elapsed 
#    0.18    0.01    0.19
system.time(flag(data, -1))
#    user  system elapsed 
#    0.02    0.03    0.04
system.time(mutate_all(data, lead))
#    user  system elapsed 
#    0.17    0.01    0.19
system.time(flag(data, -1:1))
#    user  system elapsed 
#    0.07    0.04    0.10

# Differences
system.time(fdiff(data))
#    user  system elapsed 
#    0.04    0.02    0.06
system.time(fdiff(data,1,1:2))
#    user  system elapsed 
#    0.11    0.04    0.16
system.time(fdiff(data, c(1,10)))
#    user  system elapsed 
#    0.04    0.06    0.11
system.time(fdiff(data, c(1,10), 1:2))
#    user  system elapsed 
#    0.38    0.08    0.46

# Growth Rates
system.time(fgrowth(data))
#    user  system elapsed 
#    0.06    0.03    0.09
system.time(fgrowth(data,1,1:2))
#    user  system elapsed 
#    0.17    0.05    0.22
system.time(fgrowth(data, c(1,10)))
#    user  system elapsed 
#    0.12    0.04    0.17
system.time(fgrowth(data, c(1,10), 1:2))
#    user  system elapsed 
#    0.36    0.20    0.57

Again the benchmarks show stunning performance gains using collapse functions.

References

Timmer, M. P., de Vries, G. J., & de Vries, K. (2015). “Patterns of Structural Change in Developing Countries.” . In J. Weiss, & M. Tribe (Eds.), Routledge Handbook of Industry and Development. (pp. 65-83). Routledge.


  1. 100% being the sum of all VA/EMP in a given sector and country across all years, not the sectoral output share which would gave to be obtained using sweep(GGDC10S[6:16], 1, GGDC10S$SUM, "/")

  2. Row-wise operations are not supported by TRA.