collapse is a C/C++ based package for data manipulation in R. It’s aims are
to facilitate complex data transformation and exploration tasks and
to help make R code fast, flexible, parsimonious and programmer friendly.
This vignette focuses on the integration of collapse and the popular dplyr package by Hadley Wickham. In particular it will demonstrate how using collapse’s fast functions can facilitate and speed up grouped and weighted aggregations and transformations, as well as panel-data computations (i.e. between- and within-transformations, panel-lags, differences and growth rates) in a dplyr workflow.
Note: This vignette is targeted at dplyr users. collapse is a standalone package and delivers even faster performance using it’s own grouping mechanism (based on data.table internals) and it’s own set of functions to efficiently select and replace variables. The ‘Introduction to collapse’ vignette provides a thorough introduction to the package and a built-in structured documentation is available under help("collapse-documentation")
after installing the package. In addition help("collapse-package")
provides a compact set of examples for quick-start.
A key feature of collapse is it’s broad set of Fast Statistical Functions (fsum, fprod, fmean, fmedian, fmode, fvar, fsd, fmin, fmax, ffirst, flast, fNobs, fNdistinct
) which are able to dramatically speed-up column-wise, grouped and weighted computations on vectors, matrices or data.frame’s. The functions are S3 generic, with a default (vector), matrix and data.frame method, as well as a grouped_df method for grouped tibbles used by dplyr. The grouped tibble method has the following arguments:
FUN.grouped_df(x, [w = NULL,] TRA = NULL, [na.rm = TRUE,]
use.g.names = FALSE, keep.group_vars = TRUE, [keep.w = TRUE,] ...)
where w
is a weight variable (available only to fmean, fmode, fvar
and fsd
), and TRA
and can be used to transform x
using the computed statistics and one of 8 available transformations ("replace_fill", "replace", "-", "-+", "/", "%", "+", "*"
). These transformations perform grouped replacing or sweeping out of the statistics computed by the function (discussed in section 2). na.rm
efficiently removes missing values and is TRUE
by default. use.g.names
generates new row-names from the unique combinations of groups (default: disabled), whereas keep.group_vars
(default: enabled) will keep the grouping columns as is custom in the native data %>% group_by(...) %>% summarize(...)
workflow in dplyr. Finally, keep.w
regulates whether a weighting variable used is also aggregated and saved in a column. For fmean, fvar and fsd
this will compute the sum of the weights in each group, whereas fmode
will return the maximum weight (corresponding to the mode) in each group.
With that in mind, let’s consider some straightforward applications:
Consider the Groningen Growth and Development Center 10-Sector Database included in collapse:
library(collapse)
head(GGDC10S)
# # A tibble: 6 x 16
# Country Regioncode Region Variable Year AGR MIN MAN PU CON WRT TRA FIRE GOV
# <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA SSA Sub-s~ VA 1960 NA NA NA NA NA NA NA NA NA
# 2 BWA SSA Sub-s~ VA 1961 NA NA NA NA NA NA NA NA NA
# 3 BWA SSA Sub-s~ VA 1962 NA NA NA NA NA NA NA NA NA
# 4 BWA SSA Sub-s~ VA 1963 NA NA NA NA NA NA NA NA NA
# 5 BWA SSA Sub-s~ VA 1964 16.3 3.49 0.737 0.104 0.660 6.24 1.66 1.12 4.82
# 6 BWA SSA Sub-s~ VA 1965 15.7 2.50 1.02 0.135 1.35 7.06 1.94 1.25 5.70
# # ... with 2 more variables: OTH <dbl>, SUM <dbl>
# Summarize the Data:
# descr(GGDC10S, cols = is.categorical)
# aperm(qsu(GGDC10S, ~Variable, cols = is.numeric))
Simple column-wise computations using the fast functions and pipe operators are performed as follows:
library(dplyr)
GGDC10S %>% fNobs # Number of Observations
# Country Regioncode Region Variable Year AGR MIN MAN PU
# 5027 5027 5027 5027 5027 4364 4355 4355 4354
# CON WRT TRA FIRE GOV OTH SUM
# 4355 4355 4355 4355 3482 4248 4364
GGDC10S %>% fNdistinct # Number of distinct values
# Country Regioncode Region Variable Year AGR MIN MAN PU
# 43 6 6 2 67 4353 4224 4353 4237
# CON WRT TRA FIRE GOV OTH SUM
# 4339 4344 4334 4349 3470 4238 4364
GGDC10S %>% select_at(6:16) %>% fmedian # Median
# AGR MIN MAN PU CON WRT TRA FIRE GOV
# 4394.5194 173.2234 3718.0981 167.9500 1473.4470 3773.6430 1174.8000 960.1251 3928.5127
# OTH SUM
# 1433.1722 23186.1936
GGDC10S %>% fmode # Mode
# Country Regioncode Region Variable Year
# "USA" "ASI" "Asia" "EMP" "2010"
# AGR MIN MAN PU CON
# "171.315882316326" "0" "4645.12507642586" "0" "1.34623115930777"
# WRT TRA FIRE GOV OTH
# "21.8380052682527" "8.97743416914571" "40.0701608636442" "0" "3626.84423577048"
# SUM
# "37.4822945751317"
GGDC10S %>% fmode(drop = FALSE) # Keep data structure intact
# # A tibble: 1 x 16
# Country Regioncode Region Variable Year AGR MIN MAN PU CON WRT TRA FIRE GOV
# * <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 USA ASI Asia EMP 2010 171. 0 4645. 0 1.35 21.8 8.98 40.1 0
# # ... with 2 more variables: OTH <dbl>, SUM <dbl>
Moving on to grouped statistics, we can compute the average value added and employment by sector and country using:
GGDC10S %>%
group_by(Variable,Country) %>%
select_at(6:16) %>% fmean
# # A tibble: 85 x 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 1420. 52.1 1932. 1.02e2 7.42e2 1.98e3 6.49e2 628. 2043. 9.92e2 1.05e4
# 2 EMP BOL 964. 56.0 235. 5.35e0 1.23e2 2.82e2 1.15e2 44.6 NA 3.96e2 2.22e3
# 3 EMP BRA 17191. 206. 6991. 3.65e2 3.52e3 8.51e3 2.05e3 4414. 5307. 5.71e3 5.43e4
# 4 EMP BWA 188. 10.5 18.1 3.09e0 2.53e1 3.63e1 8.36e0 15.3 61.1 2.76e1 3.94e2
# 5 EMP CHL 702. 101. 625. 2.94e1 2.96e2 6.95e2 2.58e2 272. NA 1.00e3 3.98e3
# 6 EMP CHN 287744. 7050. 67144. 1.61e3 2.09e4 2.89e4 1.39e4 4929. 22669. 3.10e4 4.86e5
# 7 EMP COL 3091. 145. 1175. 3.39e1 5.24e2 2.07e3 4.70e2 649. NA 1.73e3 9.89e3
# 8 EMP CRI 231. 1.70 136. 1.43e1 5.76e1 1.57e2 4.24e1 54.9 128. 6.51e1 8.87e2
# 9 EMP DEW 2490. 407. 8473. 2.26e2 2.09e3 4.44e3 1.48e3 1689. 3945. 9.99e2 2.62e4
# 10 EMP DNK 236. 8.03 507. 1.38e1 1.71e2 4.55e2 1.61e2 181. 549. 1.11e2 2.39e3
# # ... with 75 more rows
Similarly we can obtain the median or the standard deviation:
GGDC10S %>%
group_by(Variable,Country) %>%
select_at(6:16) %>% fmedian
# # A tibble: 85 x 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 1325. 47.4 1988. 1.05e2 7.82e2 1.85e3 5.80e2 464. 1739. 866. 9.74e3
# 2 EMP BOL 943. 53.5 167. 4.46e0 6.60e1 1.32e2 9.70e1 15.3 NA 384. 1.84e3
# 3 EMP BRA 17481. 225. 7208. 3.76e2 4.05e3 6.45e3 1.58e3 4355. 4450. 4479. 5.19e4
# 4 EMP BWA 175. 12.2 13.1 3.71e0 1.90e1 2.11e1 6.75e0 10.4 53.8 31.2 3.61e2
# 5 EMP CHL 690. 93.9 607. 2.58e1 2.30e2 4.84e2 2.05e2 106. NA 900. 3.31e3
# 6 EMP CHN 293915 8150. 61761. 1.14e3 1.06e4 1.70e4 9.56e3 4328. 19468. 9954. 4.45e5
# 7 EMP COL 3006. 84.0 1033. 3.71e1 4.19e2 1.55e3 3.91e2 655. NA 1430. 8.63e3
# 8 EMP CRI 216. 1.49 114. 7.92e0 5.50e1 8.98e1 2.55e1 19.6 122. 60.6 7.19e2
# 9 EMP DEW 2178 320. 8459. 2.47e2 2.10e3 4.45e3 1.53e3 1656 3700 900 2.65e4
# 10 EMP DNK 187. 3.75 508. 1.36e1 1.65e2 4.61e2 1.61e2 169. 642. 104. 2.42e3
# # ... with 75 more rows
GGDC10S %>%
group_by(Variable,Country) %>%
select_at(6:16) %>% fsd
# # A tibble: 85 x 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 242. 19.1 1.73e2 2.23e1 2.78e2 8.33e2 1.81e2 453. 1073. 4.89e2 3.05e3
# 2 EMP BOL 92.6 18.2 1.50e2 4.48e0 1.08e2 2.88e2 8.96e1 58.7 NA 2.49e2 9.47e2
# 3 EMP BRA 1975. 83.1 3.28e3 1.14e2 2.04e3 6.35e3 1.28e3 3144. 3787. 4.33e3 2.52e4
# 4 EMP BWA 31.3 4.70 1.52e1 1.91e0 1.90e1 3.69e1 6.09e0 13.4 42.2 1.14e1 1.67e2
# 5 EMP CHL 68.6 32.3 1.38e2 1.12e1 1.81e2 5.09e2 1.30e2 286. NA 4.18e2 1.61e3
# 6 EMP CHN 64477. 3450. 4.23e4 1.27e3 1.90e4 2.41e4 9.40e3 2910. 11973. 3.54e4 1.95e5
# 7 EMP COL 710. 127. 6.00e2 1.68e1 3.76e2 1.82e3 3.30e2 409. NA 9.96e2 5.27e3
# 8 EMP CRI 48.7 0.876 8.78e1 1.30e1 3.54e1 1.53e2 3.93e1 68.8 82.4 4.29e1 5.46e2
# 9 EMP DEW 1220. 188. 7.93e2 5.39e1 2.07e2 6.34e2 2.21e2 704. 1241. 3.55e2 2.23e3
# 10 EMP DNK 144. 7.96 7.67e1 1.96e0 2.63e1 5.43e1 1.61e1 91.1 255. 1.91e1 2.23e2
# # ... with 75 more rows
It is important to not use dplyr’s summarize
together with the fast functions since that would totally eliminate their speed gain. These functions are fast because they are executed only once and carry out the grouped computations in C++, whereas summarize
will split and then apply the function to each group in the grouped tibble. - It will also work with the fast functions, but is slower than using primitive base functions since the fast functions are S3 generic -.
To drive this point home it is perhaps good to shed some light on what is happening behind the scenes of dplyr and collapse. Fundamentally both packages follow different computing paradigms:
dplyr is an efficient implementation of the Split-Apply-Combine computing paradigm. Data is split into groups, these data-chunks are then passed to a function carrying out the computation, and finally recombined to produce the aggregated data.frame. This modus operandi is evident in the grouping mechanism of dplyr. When a data.frame is passed through group_by, a ‘groups’ attribute is attached:
GGDC10S %>% group_by(Variable,Country) %>% attr("groups")
# # A tibble: 85 x 3
# Variable Country .rows
# <chr> <chr> <list>
# 1 EMP ARG <int [62]>
# 2 EMP BOL <int [61]>
# 3 EMP BRA <int [62]>
# 4 EMP BWA <int [52]>
# 5 EMP CHL <int [63]>
# 6 EMP CHN <int [62]>
# 7 EMP COL <int [61]>
# 8 EMP CRI <int [62]>
# 9 EMP DEW <int [61]>
# 10 EMP DNK <int [64]>
# # ... with 75 more rows
This object is a data.frame giving the unique groups and in the third (last) column vectors containing the indices of the rows belonging to that group. A command like summarize
uses this information to split the data.frame into groups which are then passed sequentially to the function used and later recombined.
Now collapse is based around one-pass grouped computations at the C++ level, in other words the data is not split and recombined but the entire computation is performed in a single C++ loop running through that data and completing the computations for each group simultaneously. This modus operandi is also evident in collapse grouping objects. The method GRP.grouped_df
takes a dplyr grouping object from a grouped tibble and efficiently converts it to a collapse grouping object:
GGDC10S %>% group_by(Variable,Country) %>% GRP %>% str
# List of 8
# $ N.groups : int 85
# $ group.id : int [1:5027] 46 46 46 46 46 46 46 46 46 46 ...
# $ group.sizes: int [1:85] 62 61 62 52 63 62 61 62 61 64 ...
# $ groups :List of 2
# ..$ Variable: chr [1:85] "EMP" "EMP" "EMP" "EMP" ...
# .. ..- attr(*, "label")= chr "Variable"
# .. ..- attr(*, "format.stata")= chr "%9s"
# ..$ Country : chr [1:85] "ARG" "BOL" "BRA" "BWA" ...
# .. ..- attr(*, "label")= chr "Country"
# .. ..- attr(*, "format.stata")= chr "%9s"
# $ group.vars : chr [1:2] "Variable" "Country"
# $ ordered : logi [1:2] TRUE TRUE
# $ order : NULL
# $ call : language GRP.grouped_df(X = .)
# - attr(*, "class")= chr "GRP"
This object is a list where the first three elements give the number of groups, the group-id to which each row belongs and a vector of group-sizes. A function like fsum
uses this information to (for each column) create a result vector of size ‘N.groups’ and the run through the column using the ‘group.id’ vector to add the i’th data point to the ’group.id[i]’th element of the result vector. When the loop is finished, the grouped computation is also finished.
It is clear that collapse is prone to be faster than dplyr since it’s method of computing involves less (and less computationally intensive) steps. A slight qualifier added to this is the additional conversion cost incurred by GRP.grouped_df
when using the fast functions on grouped tibbles. This cost is however quite low as the benchmarks at the end show (since GRP.grouped_df
is also implemented in C++).
One can also aggregate with multiple functions at the same time, but then the programming becomes a bit verbose. In particular, one needs to use curly braces {
to prevent first argument injection so that %>% cbind(FUN1(.), FUN2(.))
does not evaluate as %>% cbind(., FUN1(.), FUN2(.))
:
GGDC10S %>%
group_by(Variable,Country) %>%
select_at(6:16) %>% {
cbind(fmedian(.),
add_stub(fmean(., keep.group_vars = FALSE), "mean_"))
} %>% head(3)
# Variable Country AGR MIN MAN PU CON WRT TRA
# 1 EMP ARG 1324.5255 47.35255 1987.5912 104.738825 782.40283 1854.612 579.93982
# 2 EMP BOL 943.1612 53.53538 167.1502 4.457895 65.97904 132.225 96.96828
# 3 EMP BRA 17480.9810 225.43693 7207.7915 375.851832 4054.66103 6454.523 1580.81120
# FIRE GOV OTH SUM mean_AGR mean_MIN mean_MAN mean_PU mean_CON
# 1 464.39920 1738.836 866.1119 9743.223 1419.8013 52.08903 1931.7602 101.720936 742.4044
# 2 15.34259 NA 384.0678 1842.055 964.2103 56.03295 235.0332 5.346433 122.7827
# 3 4354.86210 4449.942 4478.6927 51881.110 17191.3529 206.02389 6991.3710 364.573404 3524.7384
# mean_WRT mean_TRA mean_FIRE mean_GOV mean_OTH mean_SUM
# 1 1982.1775 648.5119 627.79291 2043.471 992.4475 10542.177
# 2 281.5164 115.4728 44.56442 NA 395.5650 2220.524
# 3 8509.4612 2054.3731 4413.54448 5307.280 5710.2665 54272.985
The function add_stub
used above is a collapse function adding a prefix (default) or suffix to variables names.
A slightly more elegant solution to such multi-function aggregations can be found using get_vars
, a collapse predicate to efficiently select variables. In contrast to select_at
, get_vars
does not automatically add the grouping columns to the selection. Next to get_vars
, collapse also introduces the predicates num_vars
, cat_vars
, char_vars
, fact_vars
, logi_vars
and Date_vars
to select columns by type. Finally, the predicate add_vars
provides a more efficient alternative to cbind.data.frame
. The idea here is ‘adding’ variables to the data.frame in the first argument i.e. the attributes of the first argument are preserved, so the expression below still gives a tibble instead of a data.frame:
GGDC10S %>%
group_by(Variable,Country) %>% {
add_vars(group_keys(.),
ffirst(get_vars(., "Reg", regex = TRUE)), # Regular expression matching column names
add_stub(fmean(num_vars(.)), "mean_"), # num_vars selects all numeric variables
add_stub(fmedian(get_vars(., 9:12)), "median_"), # columns 9-12
add_stub(fmin(get_vars(., 9:10)), "min_")) # columns 9:10
}
# # A tibble: 85 x 22
# Variable Country Regioncode Region mean_Year mean_AGR mean_MIN mean_MAN mean_PU mean_CON mean_WRT
# * <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG LAM Latin~ 1980. 1420. 52.1 1932. 102. 742. 1982.
# 2 EMP BOL LAM Latin~ 1980 964. 56.0 235. 5.35 123. 282.
# 3 EMP BRA LAM Latin~ 1980. 17191. 206. 6991. 365. 3525. 8509.
# 4 EMP BWA SSA Sub-s~ 1986. 188. 10.5 18.1 3.09 25.3 36.3
# 5 EMP CHL LAM Latin~ 1981 702. 101. 625. 29.4 296. 695.
# 6 EMP CHN ASI Asia 1980. 287744. 7050. 67144. 1606. 20852. 28908.
# 7 EMP COL LAM Latin~ 1980 3091. 145. 1175. 33.9 524. 2071.
# 8 EMP CRI LAM Latin~ 1980. 231. 1.70 136. 14.3 57.6 157.
# 9 EMP DEW EUR Europe 1980 2490. 407. 8473. 226. 2093. 4442.
# 10 EMP DNK EUR Europe 1980. 236. 8.03 507. 13.8 171. 455.
# # ... with 75 more rows, and 11 more variables: mean_TRA <dbl>, mean_FIRE <dbl>, mean_GOV <dbl>,
# # mean_OTH <dbl>, mean_SUM <dbl>, median_PU <dbl>, median_CON <dbl>, median_WRT <dbl>,
# # median_TRA <dbl>, min_PU <dbl>, min_CON <dbl>
Another nice feature of add_vars
is that it can also very efficiently reorder columns i.e. bind columns in a different order than they are passed. This can be done by simply specifying the positions the added columns should have in the final data.frame, and then add_vars
shifts the first argument columns (the group_keys
in the example below) to the right to fill in the gaps.
GGDC10S %>%
group_by(Variable,Country) %>% {
add_vars(group_keys(.),
add_stub(fmean(get_vars(., c("AGR","SUM"))), "mean_"),
add_stub(fsd(get_vars(., c("AGR","SUM"))), "sd_"),
pos = c(2,4,3,5))
}
# # A tibble: 85 x 6
# Variable mean_AGR sd_AGR mean_SUM sd_SUM Country
# * <chr> <dbl> <dbl> <dbl> <dbl> <chr>
# 1 EMP 1420. 242. 10542. 3048. ARG
# 2 EMP 964. 92.6 2221. 947. BOL
# 3 EMP 17191. 1975. 54273. 25237. BRA
# 4 EMP 188. 31.3 394. 167. BWA
# 5 EMP 702. 68.6 3982. 1608. CHL
# 6 EMP 287744. 64477. 485820. 195284. CHN
# 7 EMP 3091. 710. 9892. 5265. COL
# 8 EMP 231. 48.7 887. 546. CRI
# 9 EMP 2490. 1220. 26247. 2231. DEW
# 10 EMP 236. 144. 2394. 223. DNK
# # ... with 75 more rows
A much more compact solution to multi-function and multi-type aggregation with dplyr is offered by the function collapg:
# This aggregates numeric colums using the mean (fmean) and categorical columns with the mode (fmode)
GGDC10S %>% group_by(Variable,Country) %>% collapg
# # A tibble: 85 x 16
# Variable Country Regioncode Region Year AGR MIN MAN PU CON WRT TRA FIRE
# <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG LAM Latin~ 1980. 1.42e3 5.21e1 1.93e3 1.02e2 7.42e2 1.98e3 6.49e2 628.
# 2 EMP BOL LAM Latin~ 1980 9.64e2 5.60e1 2.35e2 5.35e0 1.23e2 2.82e2 1.15e2 44.6
# 3 EMP BRA LAM Latin~ 1980. 1.72e4 2.06e2 6.99e3 3.65e2 3.52e3 8.51e3 2.05e3 4414.
# 4 EMP BWA SSA Sub-s~ 1986. 1.88e2 1.05e1 1.81e1 3.09e0 2.53e1 3.63e1 8.36e0 15.3
# 5 EMP CHL LAM Latin~ 1981 7.02e2 1.01e2 6.25e2 2.94e1 2.96e2 6.95e2 2.58e2 272.
# 6 EMP CHN ASI Asia 1980. 2.88e5 7.05e3 6.71e4 1.61e3 2.09e4 2.89e4 1.39e4 4929.
# 7 EMP COL LAM Latin~ 1980 3.09e3 1.45e2 1.18e3 3.39e1 5.24e2 2.07e3 4.70e2 649.
# 8 EMP CRI LAM Latin~ 1980. 2.31e2 1.70e0 1.36e2 1.43e1 5.76e1 1.57e2 4.24e1 54.9
# 9 EMP DEW EUR Europe 1980 2.49e3 4.07e2 8.47e3 2.26e2 2.09e3 4.44e3 1.48e3 1689.
# 10 EMP DNK EUR Europe 1980. 2.36e2 8.03e0 5.07e2 1.38e1 1.71e2 4.55e2 1.61e2 181.
# # ... with 75 more rows, and 3 more variables: GOV <dbl>, OTH <dbl>, SUM <dbl>
By default it aggregates numeric columns using the mean and categorical columns using the mode, and preserves the order of all columns. Changing these defaults is very easy:
# This aggregates numeric colums using the median and categorical columns using the first value
GGDC10S %>% group_by(Variable,Country) %>% collapg(fmedian, flast)
# # A tibble: 85 x 16
# Variable Country Regioncode Region Year AGR MIN MAN PU CON WRT TRA FIRE
# <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG LAM Latin~ 1980. 1.32e3 4.74e1 1.99e3 1.05e2 7.82e2 1.85e3 5.80e2 464.
# 2 EMP BOL LAM Latin~ 1980 9.43e2 5.35e1 1.67e2 4.46e0 6.60e1 1.32e2 9.70e1 15.3
# 3 EMP BRA LAM Latin~ 1980. 1.75e4 2.25e2 7.21e3 3.76e2 4.05e3 6.45e3 1.58e3 4355.
# 4 EMP BWA SSA Sub-s~ 1986. 1.75e2 1.22e1 1.31e1 3.71e0 1.90e1 2.11e1 6.75e0 10.4
# 5 EMP CHL LAM Latin~ 1981 6.90e2 9.39e1 6.07e2 2.58e1 2.30e2 4.84e2 2.05e2 106.
# 6 EMP CHN ASI Asia 1980. 2.94e5 8.15e3 6.18e4 1.14e3 1.06e4 1.70e4 9.56e3 4328.
# 7 EMP COL LAM Latin~ 1980 3.01e3 8.40e1 1.03e3 3.71e1 4.19e2 1.55e3 3.91e2 655.
# 8 EMP CRI LAM Latin~ 1980. 2.16e2 1.49e0 1.14e2 7.92e0 5.50e1 8.98e1 2.55e1 19.6
# 9 EMP DEW EUR Europe 1980 2.18e3 3.20e2 8.46e3 2.47e2 2.10e3 4.45e3 1.53e3 1656
# 10 EMP DNK EUR Europe 1980. 1.87e2 3.75e0 5.08e2 1.36e1 1.65e2 4.61e2 1.61e2 169.
# # ... with 75 more rows, and 3 more variables: GOV <dbl>, OTH <dbl>, SUM <dbl>
One can apply multiple functions to both numeric and/or categorical data:
GGDC10S %>% group_by(Variable,Country) %>%
collapg(list(fmean, fmedian), list(first, fmode, flast))
# # A tibble: 85 x 32
# Variable Country first.Regioncode fmode.Regioncode flast.Regioncode first.Region fmode.Region
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 EMP ARG LAM LAM LAM Latin Ameri~ Latin Ameri~
# 2 EMP BOL LAM LAM LAM Latin Ameri~ Latin Ameri~
# 3 EMP BRA LAM LAM LAM Latin Ameri~ Latin Ameri~
# 4 EMP BWA SSA SSA SSA Sub-saharan~ Sub-saharan~
# 5 EMP CHL LAM LAM LAM Latin Ameri~ Latin Ameri~
# 6 EMP CHN ASI ASI ASI Asia Asia
# 7 EMP COL LAM LAM LAM Latin Ameri~ Latin Ameri~
# 8 EMP CRI LAM LAM LAM Latin Ameri~ Latin Ameri~
# 9 EMP DEW EUR EUR EUR Europe Europe
# 10 EMP DNK EUR EUR EUR Europe Europe
# # ... with 75 more rows, and 25 more variables: flast.Region <chr>, fmean.Year <dbl>,
# # fmedian.Year <dbl>, fmean.AGR <dbl>, fmedian.AGR <dbl>, fmean.MIN <dbl>, fmedian.MIN <dbl>,
# # fmean.MAN <dbl>, fmedian.MAN <dbl>, fmean.PU <dbl>, fmedian.PU <dbl>, fmean.CON <dbl>,
# # fmedian.CON <dbl>, fmean.WRT <dbl>, fmedian.WRT <dbl>, fmean.TRA <dbl>, fmedian.TRA <dbl>,
# # fmean.FIRE <dbl>, fmedian.FIRE <dbl>, fmean.GOV <dbl>, fmedian.GOV <dbl>, fmean.OTH <dbl>,
# # fmedian.OTH <dbl>, fmean.SUM <dbl>, fmedian.SUM <dbl>
Applying multiple functions to only numeric (or only categorical) data allows return in a long format:
GGDC10S %>% group_by(Variable,Country) %>%
collapg(list(fmean, fmedian), cols = is.numeric, return = "long")
# # A tibble: 170 x 15
# Function Variable Country Year AGR MIN MAN PU CON WRT TRA FIRE GOV
# <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 fmean EMP ARG 1980. 1.42e3 5.21e1 1.93e3 1.02e2 7.42e2 1.98e3 6.49e2 628. 2043.
# 2 fmean EMP BOL 1980 9.64e2 5.60e1 2.35e2 5.35e0 1.23e2 2.82e2 1.15e2 44.6 NA
# 3 fmean EMP BRA 1980. 1.72e4 2.06e2 6.99e3 3.65e2 3.52e3 8.51e3 2.05e3 4414. 5307.
# 4 fmean EMP BWA 1986. 1.88e2 1.05e1 1.81e1 3.09e0 2.53e1 3.63e1 8.36e0 15.3 61.1
# 5 fmean EMP CHL 1981 7.02e2 1.01e2 6.25e2 2.94e1 2.96e2 6.95e2 2.58e2 272. NA
# 6 fmean EMP CHN 1980. 2.88e5 7.05e3 6.71e4 1.61e3 2.09e4 2.89e4 1.39e4 4929. 22669.
# 7 fmean EMP COL 1980 3.09e3 1.45e2 1.18e3 3.39e1 5.24e2 2.07e3 4.70e2 649. NA
# 8 fmean EMP CRI 1980. 2.31e2 1.70e0 1.36e2 1.43e1 5.76e1 1.57e2 4.24e1 54.9 128.
# 9 fmean EMP DEW 1980 2.49e3 4.07e2 8.47e3 2.26e2 2.09e3 4.44e3 1.48e3 1689. 3945.
# 10 fmean EMP DNK 1980. 2.36e2 8.03e0 5.07e2 1.38e1 1.71e2 4.55e2 1.61e2 181. 549.
# # ... with 160 more rows, and 2 more variables: OTH <dbl>, SUM <dbl>
Finally, collapg
also makes it very easy to apply aggregator functions to certain columns only:
GGDC10S %>% group_by(Variable,Country) %>%
collapg(custom = list(fmean = 6:8, fmedian = 10:12))
# # A tibble: 85 x 8
# Variable Country fmean.MAN fmean.PU fmean.CON fmedian.TRA fmedian.FIRE fmedian.GOV
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 1932. 102. 742. 580. 464. 1739.
# 2 EMP BOL 235. 5.35 123. 97.0 15.3 NA
# 3 EMP BRA 6991. 365. 3525. 1581. 4355. 4450.
# 4 EMP BWA 18.1 3.09 25.3 6.75 10.4 53.8
# 5 EMP CHL 625. 29.4 296. 205. 106. NA
# 6 EMP CHN 67144. 1606. 20852. 9564. 4328. 19468.
# 7 EMP COL 1175. 33.9 524. 391. 655. NA
# 8 EMP CRI 136. 14.3 57.6 25.5 19.6 122.
# 9 EMP DEW 8473. 226. 2093. 1525. 1656 3700
# 10 EMP DNK 507. 13.8 171. 161. 169. 642.
# # ... with 75 more rows
To understand more about collapg
, look it up in the documentation (?collapg
).
Weighted aggregations are possible with the functions fmean, fmode, fvar
and fsd
. The implementation is such that by default (option keep.w = TRUE
) these functions also aggregate the weights, so that further weighted computations can be performed on the aggregated data. fmean
, fsd
and fvar
compute a grouped sum of the weight column and place it next to the group-identifiers, whereas fmode
computes the maximum weight (corresponding to the mode).
# This compute a frequency-weighted grouped standard-deviation, taking the total EMP / VA as weight
GGDC10S %>%
group_by(Variable,Country) %>%
select_at(6:16) %>% fsd(SUM)
# # A tibble: 85 x 13
# Variable Country sum.SUM AGR MIN MAN PU CON WRT TRA FIRE GOV OTH
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 6.54e5 225. 2.22e1 1.76e2 2.05e1 2.85e2 8.56e2 1.95e2 493. 1123. 5.06e2
# 2 EMP BOL 1.35e5 99.7 1.71e1 1.68e2 4.87e0 1.23e2 3.24e2 9.81e1 69.8 NA 2.58e2
# 3 EMP BRA 3.36e6 1587. 7.38e1 2.95e3 9.38e1 1.86e3 6.28e3 1.31e3 3003. 3621. 4.26e3
# 4 EMP BWA 1.85e4 32.2 3.72e0 1.48e1 1.59e0 1.80e1 3.87e1 6.02e0 13.5 39.8 8.94e0
# 5 EMP CHL 2.51e5 71.0 3.99e1 1.29e2 1.24e1 1.88e2 5.51e2 1.34e2 313. NA 4.26e2
# 6 EMP CHN 2.91e7 56281. 3.09e3 4.04e4 1.27e3 1.92e4 2.45e4 9.26e3 2853. 11541. 3.74e4
# 7 EMP COL 6.03e5 637. 1.48e2 5.94e2 1.52e1 3.97e2 1.89e3 3.62e2 435. NA 1.01e3
# 8 EMP CRI 5.50e4 40.4 1.04e0 7.93e1 1.37e1 3.44e1 1.68e2 4.53e1 79.8 80.7 4.34e1
# 9 EMP DEW 1.10e6 1175. 1.83e2 7.42e2 5.32e1 1.94e2 6.06e2 2.12e2 699. 1225. 3.55e2
# 10 EMP DNK 1.53e5 139. 7.45e0 7.73e1 1.92e0 2.56e1 5.33e1 1.57e1 91.6 248. 1.95e1
# # ... with 75 more rows
# This compute a weighted grouped mode, taking the total EMP / VA as weight
GGDC10S %>%
group_by(Variable,Country) %>%
select_at(6:16) %>% fmode(SUM)
# # A tibble: 85 x 13
# Variable Country max.SUM AGR MIN MAN PU CON WRT TRA FIRE GOV OTH
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 17929. 1.16e3 127. 2.16e3 1.52e2 1.41e3 3768. 1.06e3 1.75e3 4336. 2.00e3
# 2 EMP BOL 4508. 8.19e2 37.6 6.04e2 1.08e1 4.33e2 893. 3.33e2 3.21e2 NA 1.06e3
# 3 EMP BRA 102572. 1.65e4 313. 1.18e4 3.88e2 8.15e3 21860. 5.17e3 1.20e4 12149. 1.42e4
# 4 EMP BWA 668. 1.71e2 13.1 4.33e1 3.93e0 1.81e1 129. 2.10e1 4.67e1 113. 2.62e1
# 5 EMP CHL 7559. 6.30e2 249. 7.42e2 6.07e1 6.71e2 1989. 4.81e2 8.54e2 NA 1.88e3
# 6 EMP CHN 764200 2.66e5 9247. 1.43e5 3.53e3 6.99e4 84165. 3.12e4 1.08e4 43240. 1.03e5
# 7 EMP COL 21114. 3.93e3 513. 2.37e3 5.89e1 1.41e3 6069. 1.36e3 1.82e3 NA 3.57e3
# 8 EMP CRI 2058. 2.83e2 2.42 2.49e2 4.38e1 1.20e2 489. 1.44e2 2.25e2 328. 1.75e2
# 9 EMP DEW 31261 1.03e3 260 8.73e3 2.91e2 2.06e3 4398 1.63e3 3.26e3 6129 1.79e3
# 10 EMP DNK 2823. 7.85e1 3.12 3.99e2 1.14e1 1.95e2 579. 1.87e2 3.82e2 835. 1.50e2
# # ... with 75 more rows
The weighted variance / standard deviation is currently only implemented with frequency weights. Reliability weights may be implemented in a further update of collapse, if this is a strongly requested feature.
Weighted aggregations may also be performed with collapg
, although this does not aggregate and save the weights.
# This aggregates numeric colums using the weighted mean and categorical columns using the weighted mode
GGDC10S %>% group_by(Variable,Country) %>% collapg(w = .$SUM)
# # A tibble: 85 x 16
# Variable Country Regioncode Region Year AGR MIN MAN PU CON WRT TRA FIRE
# <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG LAM Latin~ 1985. 1.36e3 5.65e1 1.93e3 1.05e2 8.11e2 2.22e3 6.95e2 754.
# 2 EMP BOL LAM Latin~ 1987. 9.77e2 5.79e1 2.96e2 7.07e0 1.67e2 4.00e2 1.52e2 67.3
# 3 EMP BRA LAM Latin~ 1989. 1.77e4 2.38e2 8.47e3 3.89e2 4.44e3 1.14e4 2.62e3 5841.
# 4 EMP BWA SSA Sub-s~ 1993. 2.00e2 1.21e1 2.43e1 3.70e0 3.14e1 5.08e1 1.08e1 20.8
# 5 EMP CHL LAM Latin~ 1988. 6.93e2 1.07e2 6.68e2 3.35e1 3.67e2 8.95e2 3.09e2 382.
# 6 EMP CHN ASI Asia 1988. 3.09e5 8.23e3 8.34e4 2.09e3 2.80e4 3.80e4 1.75e4 6048.
# 7 EMP COL LAM Latin~ 1989. 3.44e3 2.04e2 1.49e3 4.20e1 7.18e2 3.02e3 6.39e2 854.
# 8 EMP CRI LAM Latin~ 1991. 2.54e2 2.10e0 1.87e2 2.19e1 7.84e1 2.47e2 6.50e1 94.2
# 9 EMP DEW EUR Europe 1971. 2.40e3 3.95e2 8.51e3 2.29e2 2.10e3 4.49e3 1.50e3 1740.
# 10 EMP DNK EUR Europe 1981. 2.23e2 7.41e0 5.03e2 1.39e1 1.72e2 4.60e2 1.62e2 189.
# # ... with 75 more rows, and 3 more variables: GOV <dbl>, OTH <dbl>, SUM <dbl>
Thus to aggregate the entire data and save the weights one would need to opt for a manual solution:
GGDC10S %>%
group_by(Variable,Country) %>% {
add_vars(fmean(select_at(., 6:16), SUM), # Again select_at preserves grouping columns,
fmode(get_vars(., c(2:3,16)), SUM), # get_vars does not! Both preserve attributes
pos = c(5, 2:3))
}
# # A tibble: 85 x 16
# Variable Regioncode Region Country max.SUM sum.SUM AGR MIN MAN PU CON WRT
# * <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP LAM Latin~ ARG 17929. 6.54e5 1.36e3 5.65e1 1.93e3 1.05e2 8.11e2 2.22e3
# 2 EMP LAM Latin~ BOL 4508. 1.35e5 9.77e2 5.79e1 2.96e2 7.07e0 1.67e2 4.00e2
# 3 EMP LAM Latin~ BRA 102572. 3.36e6 1.77e4 2.38e2 8.47e3 3.89e2 4.44e3 1.14e4
# 4 EMP SSA Sub-s~ BWA 668. 1.85e4 2.00e2 1.21e1 2.43e1 3.70e0 3.14e1 5.08e1
# 5 EMP LAM Latin~ CHL 7559. 2.51e5 6.93e2 1.07e2 6.68e2 3.35e1 3.67e2 8.95e2
# 6 EMP ASI Asia CHN 764200 2.91e7 3.09e5 8.23e3 8.34e4 2.09e3 2.80e4 3.80e4
# 7 EMP LAM Latin~ COL 21114. 6.03e5 3.44e3 2.04e2 1.49e3 4.20e1 7.18e2 3.02e3
# 8 EMP LAM Latin~ CRI 2058. 5.50e4 2.54e2 2.10e0 1.87e2 2.19e1 7.84e1 2.47e2
# 9 EMP EUR Europe DEW 31261 1.10e6 2.40e3 3.95e2 8.51e3 2.29e2 2.10e3 4.49e3
# 10 EMP EUR Europe DNK 2823. 1.53e5 2.23e2 7.41e0 5.03e2 1.39e1 1.72e2 4.60e2
# # ... with 75 more rows, and 4 more variables: TRA <dbl>, FIRE <dbl>, GOV <dbl>, OTH <dbl>
Below I provide a set of benchmarks for the standard set of functions commonly used in aggregations. For this purpose I duplicate and row-bind the GGDC10S
dataset used so far 200 times to yield a dataset of approx. 1 million observations, while keeping the groups unique. My windows laptop on which these benchmarks were run has a 2x 2.2 GHZ Intel i5 processor, 8GB DDR3 RAM and a Samsung SSD hard drive (so a decent laptop but nothing fancy).
# This replicates the data 200 times while keeping Country and Variable (columns 1 and 4) unique
data <- replicate(200, GGDC10S, simplify = FALSE) # gv and gv<- are shortcuts for get_vars and get_vars<-
uniquify <- function(x, i) `gv<-`(x, c(1,4), value = lapply(gv(x, c(1,4)), paste0, i))
data <- unlist2d(Map(uniquify, data, as.list(1:200)), idcols = FALSE)
dim(data)
# [1] 1005400 16
GRP(data, c(1,4))$N.groups # This shows the number of groups.
# [1] 17000
# Grouping: This is still a key bottleneck of dplyr compared to data.table and collapse
system.time(group_by(data,Variable,Country))
# user system elapsed
# 0.14 0.00 0.14
system.time(GRP(data, c(1,4)))
# user system elapsed
# 0.04 0.00 0.05
library(microbenchmark)
# Selection
microbenchmark(select_at(data, 6:16))
# Unit: milliseconds
# expr min lq mean median uq max neval
# select_at(data, 6:16) 11.5846 11.74948 12.32206 11.99961 12.48735 15.25186 100
microbenchmark(get_vars(data, 6:16))
# Unit: microseconds
# expr min lq mean median uq max neval
# get_vars(data, 6:16) 7.586 8.479 9.07241 8.479 8.925 44.178 100
data <- data %>% group_by(Variable,Country) %>% select_at(6:16)
# Conversion of Grouping object: This time is also required in all computations below using collapse fast functions
microbenchmark(GRP(data))
# Unit: milliseconds
# expr min lq mean median uq max neval
# GRP(data) 2.947021 4.238463 4.817924 4.545704 4.588767 25.67264 100
# Sum
system.time(fsum(data))
# user system elapsed
# 0.04 0.00 0.04
system.time(summarise_all(data, sum, na.rm = TRUE))
# user system elapsed
# 0.1 0.0 0.1
# Product
system.time(fprod(data))
# user system elapsed
# 0.05 0.00 0.04
system.time(summarise_all(data, prod, na.rm = TRUE))
# user system elapsed
# 0.45 0.00 0.45
# Mean
system.time(fmean(data))
# user system elapsed
# 0.05 0.00 0.04
system.time(summarise_all(data, mean, na.rm = TRUE))
# user system elapsed
# 1.92 0.01 1.94
# Weighted Mean
system.time(fmean(data, SUM)) # This cannot easily be performed in dplyr
# user system elapsed
# 0.07 0.00 0.06
# Median
system.time(fmedian(data))
# user system elapsed
# 0.08 0.00 0.08
system.time(summarise_all(data, median, na.rm = TRUE))
# user system elapsed
# 8.72 0.00 8.72
# Standard-Deviation
system.time(fsd(data))
# user system elapsed
# 0.08 0.01 0.09
system.time(summarise_all(data, sd, na.rm = TRUE))
# user system elapsed
# 3.18 0.00 3.17
# Weighted Standard-Deviation
system.time(fsd(data, SUM))
# user system elapsed
# 0.08 0.00 0.07
# Maximum
system.time(fmax(data))
# user system elapsed
# 0.03 0.00 0.03
system.time(summarise_all(data, max, na.rm = TRUE))
# user system elapsed
# 0.04 0.00 0.05
# First Value
system.time(ffirst(data, na.rm = FALSE))
# user system elapsed
# 0.03 0.00 0.03
system.time(summarise_all(data, first))
# user system elapsed
# 0.60 0.00 0.59
# Distinct Values
system.time(fNdistinct(data))
# user system elapsed
# 0.25 0.08 0.33
system.time(summarise_all(data, n_distinct, na.rm = TRUE))
# user system elapsed
# 2.33 0.00 2.33
# Mode
system.time(fmode(data))
# user system elapsed
# 0.23 0.11 0.34
# Weighted Mode
system.time(fmode(data, SUM))
# user system elapsed
# 0.36 0.11 0.47
The benchmarks show that at this data size efficient primitives like base::sum
or base::max
can still deliver very decent performance with summarize
. Less optimized base functions like mean
, median
and sd
however take multiple seconds to compute, and here collapse
fast functions really prove to be very useful complements to the dplyr system.
Weighted statistics are also performed extremely fast by collapse functions. I would not know how to compute weighted statistics by groups in dplyr, as it would require the weighting variable to be split as well, which seems impossible in native dplyr.
A further highlight of collapse is the extremely fast statistical mode function, which can also compute a weighted mode. Fast categorical aggregation has been an issue in R, and defining a mode function from base R and applying it to 17000 groups will probably let it run at least a minute. fmode
reduces this time to half a second.
Thus in terms of data aggregation collapse fast functions are able to speed up dplyr to a level that makes it attractive again to R users working on medium-sized or larger data, and everyone programming with dplyr. I however strongly recommend collapse itself for easy and speedy programming as it does not rely on non-standard evaluation and has less R-overhead than dplyr.
In all of this the grouping system of dplyr remains the central bottleneck. For example grouping 10 million observations in 1 million groups takes around 10 second with group_by
, whereas GRP
takes around 1.5 seconds, and this difference grows exponentially as data get larger. Rewriting group_by
using GRP
/ data.table’s forderv
and then writing a simple C++ conversion program for the grouping object could be a quick remedy for this issue, but that is at the discretion of Hadley Wickham and coauthors.
Fast aggregation’s are just the tip of the iceberg compared to what collapse can bring to dplyr in terms of grouped transformations.
All statistical (scalar-valued) functions in the collapse package (fsum, fprod, fmean, fmedian, fmode, fvar, fsd, fmin, fmax, ffirst, flast, fNobs, fNdistinct
) have a TRA
argument which can be used to efficiently transforms data by either (column-wise) replacing data values with supplied statistics or sweeping the statistics out of the data. Operations can be specified using either an integer or quoted operator / string. The 8 operations supported by TRA
are:
1 - “replace_fill” : replace and overwrite missing values
2 - “replace” : replace but preserve missing values
3 - “-” : subtract (center)
4 - “-+” : subtract group-statistics but add average of group statistics
5 - “/” : divide (scale)
6 - “%” : compute percentages (divide and multiply by 100)
7 - “+” : add
8 - "*" : multiply
For functions supporting weights (fmean, fmode, fvar
and fsd
) the TRA
argument is in the third position following the data and weight vector (in the grouped_df method), whereas functions not supporting weights have the argument in the second position.
Simple transformations are again straightforward to specify:
# This subtracts the median value from all data points i.e. centers on the median
GGDC10S %>% num_vars %>% fmedian(TRA = "-")
# # A tibble: 5,027 x 12
# Year AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 -22 NA NA NA NA NA NA NA NA NA NA NA
# 2 -21 NA NA NA NA NA NA NA NA NA NA NA
# 3 -20 NA NA NA NA NA NA NA NA NA NA NA
# 4 -19 NA NA NA NA NA NA NA NA NA NA NA
# 5 -18 -4378. -170. -3717. -168. -1473. -3767. -1173. -959. -3924. -1431. -23149.
# 6 -17 -4379. -171. -3717. -168. -1472. -3767. -1173. -959. -3923. -1430. -23147.
# 7 -16 -4377. -171. -3717. -168. -1472. -3765. -1173. -959. -3922. -1430. -23143.
# 8 -15 -4375. -171. -3717. -168. -1473. -3769. -1173. -959. -3921. -1430. -23145.
# 9 -14 -4373. -171. -3717. -168. -1472. -3768. -1172. -959. -3923. -1431. -23145.
# 10 -13 -4373. -168. -3716. -167. -1470. -3768. -1172. -959. -3923. -1431. -23135.
# # ... with 5,017 more rows
# This replaces all data points with the mode
GGDC10S %>% char_vars %>% fmode(TRA = "replace")
# # A tibble: 5,027 x 4
# Country Regioncode Region Variable
# * <chr> <chr> <chr> <chr>
# 1 USA ASI Asia EMP
# 2 USA ASI Asia EMP
# 3 USA ASI Asia EMP
# 4 USA ASI Asia EMP
# 5 USA ASI Asia EMP
# 6 USA ASI Asia EMP
# 7 USA ASI Asia EMP
# 8 USA ASI Asia EMP
# 9 USA ASI Asia EMP
# 10 USA ASI Asia EMP
# # ... with 5,017 more rows
We can also easily specify code to demean, scale or compute percentages1 by groups:
# Demeaning sectoral data by Variable and Country (within transformation)
GGDC10S %>%
group_by(Variable,Country) %>%
select_at(6:16) %>% fmean(TRA = "-")
# # A tibble: 5,027 x 13
# # Groups: Variable, Country [85]
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 4 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 5 VA BWA -446. -4505. -941. -216. -895. -1942. -634. -1358. -2368. -771. -14074.
# 6 VA BWA -446. -4506. -941. -216. -894. -1941. -633. -1357. -2367. -770. -14072.
# 7 VA BWA -444. -4507. -941. -216. -894. -1940. -633. -1357. -2366. -770. -14069.
# 8 VA BWA -443. -4506. -941. -216. -894. -1944. -634. -1357. -2366. -770. -14070.
# 9 VA BWA -441. -4507. -941. -216. -894. -1943. -633. -1358. -2368. -771. -14071.
# 10 VA BWA -440. -4503. -939. -216. -892. -1942. -633. -1357. -2367. -770. -14061.
# # ... with 5,017 more rows
# Scaling sectoral data by Variable and Country
GGDC10S %>%
group_by(Variable,Country) %>%
select_at(6:16) %>% fsd(TRA = "/")
# # A tibble: 5,027 x 13
# # Groups: Variable, Country [85]
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA
# 4 VA BWA NA NA NA NA NA NA NA NA NA
# 5 VA BWA 0.0270 5.56e-4 5.23e-4 3.88e-4 5.11e-4 0.00194 0.00154 5.23e-4 0.00134
# 6 VA BWA 0.0260 3.97e-4 7.23e-4 5.03e-4 1.04e-3 0.00220 0.00180 5.83e-4 0.00158
# 7 VA BWA 0.0293 3.13e-4 5.71e-4 7.54e-4 1.04e-3 0.00257 0.00200 6.35e-4 0.00176
# 8 VA BWA 0.0317 3.66e-4 6.66e-4 7.54e-4 6.94e-4 0.00134 0.00160 7.19e-4 0.00195
# 9 VA BWA 0.0349 2.93e-4 5.33e-4 7.54e-4 9.42e-4 0.00161 0.00227 4.83e-4 0.00139
# 10 VA BWA 0.0362 8.34e-4 1.52e-3 2.15e-3 2.69e-3 0.00179 0.00253 5.77e-4 0.00155
# # ... with 5,017 more rows, and 2 more variables: OTH <dbl>, SUM <dbl>
# Computing sercentages of sectoral data by Variable and Country
GGDC10S %>%
group_by(Variable,Country) %>%
select_at(6:16) %>% fsum("%")
# # A tibble: 5,027 x 13
# # Groups: Variable, Country [85]
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA
# 4 VA BWA NA NA NA NA NA NA NA NA NA
# 5 VA BWA 0.0750 1.65e-3 0.00166 0.00103 0.00157 0.00682 0.00556 0.00175 0.00432
# 6 VA BWA 0.0724 1.18e-3 0.00230 0.00133 0.00320 0.00772 0.00649 0.00195 0.00511
# 7 VA BWA 0.0814 9.30e-4 0.00182 0.00199 0.00320 0.00903 0.00722 0.00213 0.00571
# 8 VA BWA 0.0881 1.08e-3 0.00212 0.00199 0.00213 0.00471 0.00577 0.00241 0.00631
# 9 VA BWA 0.0971 8.68e-4 0.00170 0.00199 0.00289 0.00565 0.00818 0.00162 0.00451
# 10 VA BWA 0.101 2.47e-3 0.00483 0.00568 0.00825 0.00628 0.00910 0.00193 0.00501
# # ... with 5,017 more rows, and 2 more variables: OTH <dbl>, SUM <dbl>
Weighted demeaning and scaling can be computed using:
# Weighted demeaning (within transformation)
GGDC10S %>%
group_by(Variable,Country) %>%
select_at(6:16) %>% fmean(SUM, "-")
# # A tibble: 5,027 x 13
# # Groups: Variable, Country [85]
# Variable Country SUM AGR MIN MAN PU CON WRT TRA FIRE GOV OTH
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 4 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 5 VA BWA 37.5 -1301. -13317. -2965. -529. -2746. -6540. -2157. -4431. -7551. -2613.
# 6 VA BWA 39.3 -1302. -13318. -2964. -529. -2745. -6540. -2156. -4431. -7550. -2613.
# 7 VA BWA 43.1 -1300. -13319. -2965. -528. -2745. -6538. -2156. -4431. -7550. -2612.
# 8 VA BWA 41.4 -1298. -13318. -2964. -528. -2746. -6542. -2156. -4431. -7549. -2612.
# 9 VA BWA 41.1 -1296. -13319. -2965. -528. -2745. -6541. -2156. -4431. -7551. -2613.
# 10 VA BWA 51.2 -1296. -13315. -2963. -528. -2743. -6541. -2155. -4431. -7550. -2613.
# # ... with 5,017 more rows
# Weighted scaling
GGDC10S %>%
group_by(Variable,Country) %>%
select_at(6:16) %>% fsd(SUM, "/")
# # A tibble: 5,027 x 13
# # Groups: Variable, Country [85]
# Variable Country SUM AGR MIN MAN PU CON WRT TRA FIRE
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA
# 4 VA BWA NA NA NA NA NA NA NA NA NA
# 5 VA BWA 37.5 0.0221 5.29e-4 4.49e-4 4.71e-4 4.56e-4 0.00155 0.00117 4.63e-4
# 6 VA BWA 39.3 0.0214 3.78e-4 6.21e-4 6.10e-4 9.30e-4 0.00175 0.00137 5.15e-4
# 7 VA BWA 43.1 0.0240 2.98e-4 4.90e-4 9.15e-4 9.30e-4 0.00205 0.00152 5.62e-4
# 8 VA BWA 41.4 0.0260 3.48e-4 5.72e-4 9.15e-4 6.20e-4 0.00107 0.00122 6.35e-4
# 9 VA BWA 41.1 0.0287 2.78e-4 4.57e-4 9.15e-4 8.41e-4 0.00128 0.00173 4.27e-4
# 10 VA BWA 51.2 0.0297 7.93e-4 1.30e-3 2.61e-3 2.40e-3 0.00143 0.00192 5.10e-4
# # ... with 5,017 more rows, and 2 more variables: GOV <dbl>, OTH <dbl>
Alternatively we could also replace data points with their groupwise weighted mean or standard deviation:
# This conducts a weighted between transformation (replacing with weighted mean)
GGDC10S %>%
group_by(Variable,Country) %>%
select_at(6:16) %>% fmean(SUM, "replace")
# # A tibble: 5,027 x 13
# # Groups: Variable, Country [85]
# Variable Country SUM AGR MIN MAN PU CON WRT TRA FIRE GOV OTH
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 4 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 5 VA BWA 37.5 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# 6 VA BWA 39.3 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# 7 VA BWA 43.1 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# 8 VA BWA 41.4 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# 9 VA BWA 41.1 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# 10 VA BWA 51.2 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# # ... with 5,017 more rows
# This also replaces missing values in each group
GGDC10S %>%
group_by(Variable,Country) %>%
select_at(6:16) %>% fmean(SUM, "replace_fill")
# # A tibble: 5,027 x 13
# # Groups: Variable, Country [85]
# Variable Country SUM AGR MIN MAN PU CON WRT TRA FIRE GOV OTH
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# 2 VA BWA NA 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# 3 VA BWA NA 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# 4 VA BWA NA 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# 5 VA BWA 37.5 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# 6 VA BWA 39.3 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# 7 VA BWA 43.1 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# 8 VA BWA 41.4 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# 9 VA BWA 41.1 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# 10 VA BWA 51.2 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# # ... with 5,017 more rows
It is also possible to center data points on the global mean, which is achieved by subtracting out group means and adding the overall mean of the data again:
# This group-centers data on the overall mean of the data
GGDC10S %>%
group_by(Variable,Country) %>%
select_at(6:16) %>% fmean(TRA = "-+")
# # A tibble: 5,027 x 13
# # Groups: Variable, Country [85]
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA NA
# 4 VA BWA NA NA NA NA NA NA NA NA NA NA
# 5 VA BWA 2.53e6 1.86e6 5.54e6 335463. 1.80e6 3.39e6 1.47e6 1.66e6 1.71e6 1.68e6
# 6 VA BWA 2.53e6 1.86e6 5.54e6 335463. 1.80e6 3.39e6 1.47e6 1.66e6 1.71e6 1.68e6
# 7 VA BWA 2.53e6 1.86e6 5.54e6 335463. 1.80e6 3.39e6 1.47e6 1.66e6 1.71e6 1.68e6
# 8 VA BWA 2.53e6 1.86e6 5.54e6 335463. 1.80e6 3.39e6 1.47e6 1.66e6 1.71e6 1.68e6
# 9 VA BWA 2.53e6 1.86e6 5.54e6 335463. 1.80e6 3.39e6 1.47e6 1.66e6 1.71e6 1.68e6
# 10 VA BWA 2.53e6 1.86e6 5.54e6 335464. 1.80e6 3.39e6 1.47e6 1.66e6 1.71e6 1.68e6
# # ... with 5,017 more rows, and 1 more variable: SUM <dbl>
Sequential operations such as scaling and then centering are also easily performed:
# This scales and centers (i.e. standardizes) the data
GGDC10S %>%
group_by(Variable,Country) %>%
select_at(6:16) %>% fsd(TRA = "/") %>% fmean(TRA = "-")
# # A tibble: 5,027 x 13
# # Groups: Variable, Country [85]
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 4 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 5 VA BWA -0.738 -0.717 -0.668 -0.805 -0.692 -0.603 -0.589 -0.635 -0.656 -0.596 -0.676
# 6 VA BWA -0.739 -0.717 -0.668 -0.805 -0.692 -0.603 -0.589 -0.635 -0.656 -0.596 -0.676
# 7 VA BWA -0.736 -0.717 -0.668 -0.805 -0.692 -0.603 -0.589 -0.635 -0.656 -0.595 -0.676
# 8 VA BWA -0.734 -0.717 -0.668 -0.805 -0.692 -0.604 -0.589 -0.635 -0.655 -0.595 -0.676
# 9 VA BWA -0.730 -0.717 -0.668 -0.805 -0.692 -0.604 -0.588 -0.635 -0.656 -0.596 -0.676
# 10 VA BWA -0.729 -0.716 -0.667 -0.803 -0.690 -0.603 -0.588 -0.635 -0.656 -0.596 -0.675
# # ... with 5,017 more rows
Of course it is also possible to combine multiple functions as in the aggregation section, or to add variables to existing data, as shown below:
# This group-centers data on the group-medians and adds the new variables right next to the original ones
add_vars(GGDC10S, seq(7,27,2)) <- GGDC10S %>%
group_by(Variable,Country) %>% get_vars(6:16) %>%
fmedian(TRA = "-") %>% add_stub("demean_")
GGDC10S
# # A tibble: 5,027 x 27
# Country Regioncode Region Variable Year AGR demean_AGR MIN demean_MIN MAN demean_MAN
# * <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA SSA Sub-s~ VA 1960 NA NA NA NA NA NA
# 2 BWA SSA Sub-s~ VA 1961 NA NA NA NA NA NA
# 3 BWA SSA Sub-s~ VA 1962 NA NA NA NA NA NA
# 4 BWA SSA Sub-s~ VA 1963 NA NA NA NA NA NA
# 5 BWA SSA Sub-s~ VA 1964 16.3 -110. 3.49 -1476. 0.737 -258.
# 6 BWA SSA Sub-s~ VA 1965 15.7 -110. 2.50 -1477. 1.02 -258.
# 7 BWA SSA Sub-s~ VA 1966 17.7 -109. 1.97 -1477. 0.804 -258.
# 8 BWA SSA Sub-s~ VA 1967 19.1 -107. 2.30 -1477. 0.938 -258.
# 9 BWA SSA Sub-s~ VA 1968 21.1 -105. 1.84 -1478. 0.750 -258.
# 10 BWA SSA Sub-s~ VA 1969 21.9 -104. 5.24 -1474. 2.14 -257.
# # ... with 5,017 more rows, and 16 more variables: PU <dbl>, demean_PU <dbl>, CON <dbl>,
# # demean_CON <dbl>, WRT <dbl>, demean_WRT <dbl>, TRA <dbl>, demean_TRA <dbl>, FIRE <dbl>,
# # demean_FIRE <dbl>, GOV <dbl>, demean_GOV <dbl>, OTH <dbl>, demean_OTH <dbl>, SUM <dbl>,
# # demean_SUM <dbl>
rm(GGDC10S)
Certainly There are lots of other examples one could construct using the 8 operations and 13 functions listed above, the examples provided just outline the suggested programming basics.
TRA
FunctionBehind the scenes of the TRA = ...
argument, the fast functions first compute the grouped statistics on all columns of the data, and these statistics are then directly fed into a C++ function that uses them to replace or sweep them out of data points in one of the 8 ways described above. This function can however also be called directly by the name of TRA
(shorthand for ‘transforming’ data by replacing or sweeping out statistics). Fundamentally, TRA
is a generalization of base::sweep
for column-wise grouped operations2. Direct calls to TRA
enable more control over inputs and outputs.
The two operations below are equivalent, although the first is slightly more efficient as it only requires one method dispatch and one check of the inputs:
# This divides by the product
GGDC10S %>%
group_by(Variable,Country) %>%
select_at(6:16) %>% fprod(TRA = "/")
# # A tibble: 5,027 x 13
# # Groups: Variable, Country [85]
# Variable Country AGR MIN MAN PU CON WRT TRA
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA
# 4 VA BWA NA NA NA NA NA NA NA
# 5 VA BWA 1.29e-105 2.81e-127 1.40e-101 4.44e-74 4.19e-102 3.97e-113 6.91e-92
# 6 VA BWA 1.24e-105 2.00e-127 1.94e-101 5.75e-74 8.55e-102 4.49e-113 8.08e-92
# 7 VA BWA 1.39e-105 1.58e-127 1.53e-101 8.62e-74 8.55e-102 5.26e-113 8.98e-92
# 8 VA BWA 1.51e-105 1.85e-127 1.78e-101 8.62e-74 5.70e-102 2.74e-113 7.18e-92
# 9 VA BWA 1.66e-105 1.48e-127 1.43e-101 8.62e-74 7.74e-102 3.29e-113 1.02e-91
# 10 VA BWA 1.72e-105 4.21e-127 4.07e-101 2.46e-73 2.21e-101 3.66e-113 1.13e-91
# # ... with 5,017 more rows, and 4 more variables: FIRE <dbl>, GOV <dbl>, OTH <dbl>, SUM <dbl>
# Same thing
GGDC10S %>%
group_by(Variable,Country) %>%
select_at(6:16) %>% TRA(fprod(.),"/") # [same as TRA(.,fprod(.),"/")]
# # A tibble: 5,027 x 13
# # Groups: Variable, Country [85]
# Variable Country AGR MIN MAN PU CON WRT TRA
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA
# 4 VA BWA NA NA NA NA NA NA NA
# 5 VA BWA 1.29e-105 2.81e-127 1.40e-101 4.44e-74 4.19e-102 3.97e-113 6.91e-92
# 6 VA BWA 1.24e-105 2.00e-127 1.94e-101 5.75e-74 8.55e-102 4.49e-113 8.08e-92
# 7 VA BWA 1.39e-105 1.58e-127 1.53e-101 8.62e-74 8.55e-102 5.26e-113 8.98e-92
# 8 VA BWA 1.51e-105 1.85e-127 1.78e-101 8.62e-74 5.70e-102 2.74e-113 7.18e-92
# 9 VA BWA 1.66e-105 1.48e-127 1.43e-101 8.62e-74 7.74e-102 3.29e-113 1.02e-91
# 10 VA BWA 1.72e-105 4.21e-127 4.07e-101 2.46e-73 2.21e-101 3.66e-113 1.13e-91
# # ... with 5,017 more rows, and 4 more variables: FIRE <dbl>, GOV <dbl>, OTH <dbl>, SUM <dbl>
TRA.grouped_df
was designed such that it matches the columns of statistics (aggregated columns) to those of the original data, and only transforms matching columns while returning the whole data.frame. Thus it is easily possible to only apply a transformation to the first two sectors:
# This only demeans Agriculture (AGR) and Mining (MIN)
GGDC10S %>%
group_by(Variable,Country) %>%
select_at(6:16) %>% TRA(fmean(get_vars(.,c("AGR","MIN"))),"-")
# # A tibble: 5,027 x 13
# # Groups: Variable, Country [85]
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 4 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 5 VA BWA -446. -4505. 0.737 0.104 0.660 6.24 1.66 1.12 4.82 2.34 37.5
# 6 VA BWA -446. -4506. 1.02 0.135 1.35 7.06 1.94 1.25 5.70 2.68 39.3
# 7 VA BWA -444. -4507. 0.804 0.203 1.35 8.27 2.15 1.36 6.37 2.99 43.1
# 8 VA BWA -443. -4506. 0.938 0.203 0.897 4.31 1.72 1.54 7.04 3.31 41.4
# 9 VA BWA -441. -4507. 0.750 0.203 1.22 5.17 2.44 1.03 5.03 2.36 41.1
# 10 VA BWA -440. -4503. 2.14 0.578 3.47 5.75 2.72 1.23 5.59 2.63 51.2
# # ... with 5,017 more rows
Another potential use of TRA
is to do computations in two- or more steps, for example if both aggregated and transformed data are needed, or if computations are more complex and involve other manipulations in between the aggregating and sweeping part:
# Get grouped tibble
gGGDC <- GGDC10S %>% group_by(Variable,Country)
# Get aggregated data
gsumGGDC <- gGGDC %>% select_at(6:16) %>% fsum
gsumGGDC
# # A tibble: 85 x 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 8.80e4 3230. 1.20e5 6307. 4.60e4 1.23e5 4.02e4 3.89e4 1.27e5 6.15e4 6.54e5
# 2 EMP BOL 5.88e4 3418. 1.43e4 326. 7.49e3 1.72e4 7.04e3 2.72e3 NA 2.41e4 1.35e5
# 3 EMP BRA 1.07e6 12773. 4.33e5 22604. 2.19e5 5.28e5 1.27e5 2.74e5 3.29e5 3.54e5 3.36e6
# 4 EMP BWA 8.84e3 493. 8.49e2 145. 1.19e3 1.71e3 3.93e2 7.21e2 2.87e3 1.30e3 1.85e4
# 5 EMP CHL 4.42e4 6389. 3.94e4 1850. 1.86e4 4.38e4 1.63e4 1.72e4 NA 6.32e4 2.51e5
# 6 EMP CHN 1.73e7 422972. 4.03e6 96364. 1.25e6 1.73e6 8.36e5 2.96e5 1.36e6 1.86e6 2.91e7
# 7 EMP COL 1.89e5 8843. 7.17e4 2068. 3.20e4 1.26e5 2.86e4 3.96e4 NA 1.06e5 6.03e5
# 8 EMP CRI 1.43e4 106. 8.44e3 884. 3.57e3 9.71e3 2.63e3 3.40e3 7.94e3 4.04e3 5.50e4
# 9 EMP DEW 1.05e5 17083. 3.56e5 9499. 8.79e4 1.87e5 6.23e4 7.09e4 1.66e5 4.20e4 1.10e6
# 10 EMP DNK 1.51e4 514. 3.25e4 881. 1.10e4 2.91e4 1.03e4 1.16e4 3.51e4 7.13e3 1.53e5
# # ... with 75 more rows
# Get transformed (scaled) data
TRA(gGGDC, gsumGGDC, "/")
# # A tibble: 5,027 x 16
# # Groups: Variable, Country [85]
# Country Regioncode Region Variable Year AGR MIN MAN PU CON WRT
# * <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA SSA Sub-s~ VA 1960 NA NA NA NA NA NA
# 2 BWA SSA Sub-s~ VA 1961 NA NA NA NA NA NA
# 3 BWA SSA Sub-s~ VA 1962 NA NA NA NA NA NA
# 4 BWA SSA Sub-s~ VA 1963 NA NA NA NA NA NA
# 5 BWA SSA Sub-s~ VA 1964 7.50e-4 1.65e-5 1.66e-5 1.03e-5 1.57e-5 6.82e-5
# 6 BWA SSA Sub-s~ VA 1965 7.24e-4 1.18e-5 2.30e-5 1.33e-5 3.20e-5 7.72e-5
# 7 BWA SSA Sub-s~ VA 1966 8.14e-4 9.30e-6 1.82e-5 1.99e-5 3.20e-5 9.03e-5
# 8 BWA SSA Sub-s~ VA 1967 8.81e-4 1.08e-5 2.12e-5 1.99e-5 2.13e-5 4.71e-5
# 9 BWA SSA Sub-s~ VA 1968 9.71e-4 8.68e-6 1.70e-5 1.99e-5 2.89e-5 5.65e-5
# 10 BWA SSA Sub-s~ VA 1969 1.01e-3 2.47e-5 4.83e-5 5.68e-5 8.25e-5 6.28e-5
# # ... with 5,017 more rows, and 5 more variables: TRA <dbl>, FIRE <dbl>, GOV <dbl>, OTH <dbl>,
# # SUM <dbl>
I have already noted above that whether using the argument to fast statistical functions or TRA
directly, these data transformations are essentially a two-step process: Statistics are first computed and then used to transform this original data. This process is already very efficient since all functions are written in C++, and programmatically separating the computation of statistics and data transformation tasks allows for unlimited combinations and drastically simplifies the code base of this package.
Nonetheless there are of course more memory efficient and faster ways to program such data transformations, which principally involve doing them column-by-column with a single C++ function. To ensure that this package lives up to the highest standards of performance for common uses, I have implemented such slightly more efficient algorithms for the very commonly applied tasks of centering and averaging data by groups (widely known as ‘between’-group and ‘within’-group transformations), and scaling and centering data by groups (also known as ‘standardizing’ data).
The functions fbetween
and fwithin
are faster implementations of fmean
invoked with different TRA
options:
GGDC10S %>% # Same as ... %>% fmean(TRA = "replace")
group_by(Variable,Country) %>% select_at(6:16) %>% fbetween %>% head(2)
# # A tibble: 2 x 13
# # Groups: Variable, Country [1]
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA NA
GGDC10S %>% # Same as ... %>% fmean(TRA = "replace_fill")
group_by(Variable,Country) %>% select_at(6:16) %>% fbetween(fill = TRUE) %>% head(2)
# # A tibble: 2 x 13
# # Groups: Variable, Country [1]
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA 462. 4509. 942. 216. 895. 1948. 635. 1359. 2373. 773. 14112.
# 2 VA BWA 462. 4509. 942. 216. 895. 1948. 635. 1359. 2373. 773. 14112.
GGDC10S %>% # Same as ... %>% fmean(TRA = "-")
group_by(Variable,Country) %>% select_at(6:16) %>% fwithin %>% head(2)
# # A tibble: 2 x 13
# # Groups: Variable, Country [1]
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA NA
GGDC10S %>% # Same as ... %>% fmean(TRA = "-+")
group_by(Variable,Country) %>% select_at(6:16) %>% fwithin(add.global.mean = TRUE) %>% head(2)
# # A tibble: 2 x 13
# # Groups: Variable, Country [1]
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA NA
Apart from higher speed, there is one additional advantage of using fwithin
in particular, which regards the joint use of weights and the add.global.mean
option: ... %>% fmean(w = SUM, TRA = "-+")
will not properly group-center the data on the overall weighted mean. Instead, it will group-center data on a frequency weighted average of the weighted group-means, thus not taking into account different aggregated weights attached to those weighted group-means themselves. The reason for this shortcoming is simply that TRA
was not designed to take a separate weight vector as input. fwithin(w = SUM, add.global.mean = TRUE)
does a better job and properly centers data on the weighted overall mean after subtracting out weighted group means:
GGDC10S %>% # This does not center data on a properly computed weighted overall mean
group_by(Variable,Country) %>% select_at(6:16) %>% fmean(SUM, TRA = "-+")
# # A tibble: 5,027 x 13
# # Groups: Variable, Country [85]
# Variable Country SUM AGR MIN MAN PU CON WRT TRA FIRE GOV
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA NA
# 4 VA BWA NA NA NA NA NA NA NA NA NA NA
# 5 VA BWA 37.5 8.72e6 7.25e6 1.74e7 1.01e6 6.43e6 1.05e7 4.86e6 4.85e6 4.99e6
# 6 VA BWA 39.3 8.72e6 7.25e6 1.74e7 1.01e6 6.43e6 1.05e7 4.86e6 4.85e6 4.99e6
# 7 VA BWA 43.1 8.72e6 7.25e6 1.74e7 1.01e6 6.43e6 1.05e7 4.86e6 4.85e6 4.99e6
# 8 VA BWA 41.4 8.72e6 7.25e6 1.74e7 1.01e6 6.43e6 1.05e7 4.86e6 4.85e6 4.99e6
# 9 VA BWA 41.1 8.72e6 7.25e6 1.74e7 1.01e6 6.43e6 1.05e7 4.86e6 4.85e6 4.99e6
# 10 VA BWA 51.2 8.72e6 7.25e6 1.74e7 1.01e6 6.43e6 1.05e7 4.86e6 4.85e6 4.99e6
# # ... with 5,017 more rows, and 1 more variable: OTH <dbl>
GGDC10S %>% # This does a proper job by both subtracting weighted group-means and adding a weighted overall mean
group_by(Variable,Country) %>% select_at(6:16) %>% fwithin(SUM, add.global.mean = TRUE)
# # A tibble: 5,027 x 13
# # Groups: Variable, Country [85]
# Variable Country SUM AGR MIN MAN PU CON WRT TRA FIRE GOV
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA NA
# 4 VA BWA NA NA NA NA NA NA NA NA NA NA
# 5 VA BWA 37.5 4.29e8 3.70e8 7.38e8 2.73e7 2.83e8 4.33e8 1.97e8 1.55e8 2.10e8
# 6 VA BWA 39.3 4.29e8 3.70e8 7.38e8 2.73e7 2.83e8 4.33e8 1.97e8 1.55e8 2.10e8
# 7 VA BWA 43.1 4.29e8 3.70e8 7.38e8 2.73e7 2.83e8 4.33e8 1.97e8 1.55e8 2.10e8
# 8 VA BWA 41.4 4.29e8 3.70e8 7.38e8 2.73e7 2.83e8 4.33e8 1.97e8 1.55e8 2.10e8
# 9 VA BWA 41.1 4.29e8 3.70e8 7.38e8 2.73e7 2.83e8 4.33e8 1.97e8 1.55e8 2.10e8
# 10 VA BWA 51.2 4.29e8 3.70e8 7.38e8 2.73e7 2.83e8 4.33e8 1.97e8 1.55e8 2.10e8
# # ... with 5,017 more rows, and 1 more variable: OTH <dbl>
The sequential scaling and centering ... %>% fsd(TRA = "/") %>% fmean(TRA = "-")
shown in an earlier example is also not the best way of doing things. The function fscale
does this much quicker in a single step:
# This efficiently scales and centers (i.e. standardizes) the data
GGDC10S %>%
group_by(Variable,Country) %>%
select_at(6:16) %>% fscale
# # A tibble: 5,027 x 13
# # Groups: Variable, Country [85]
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 4 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 5 VA BWA -0.738 -0.717 -0.668 -0.805 -0.692 -0.603 -0.589 -0.635 -0.656 -0.596 -0.676
# 6 VA BWA -0.739 -0.717 -0.668 -0.805 -0.692 -0.603 -0.589 -0.635 -0.656 -0.596 -0.676
# 7 VA BWA -0.736 -0.717 -0.668 -0.805 -0.692 -0.603 -0.589 -0.635 -0.656 -0.595 -0.676
# 8 VA BWA -0.734 -0.717 -0.668 -0.805 -0.692 -0.604 -0.589 -0.635 -0.655 -0.595 -0.676
# 9 VA BWA -0.730 -0.717 -0.668 -0.805 -0.692 -0.604 -0.588 -0.635 -0.656 -0.596 -0.676
# 10 VA BWA -0.729 -0.716 -0.667 -0.803 -0.690 -0.603 -0.588 -0.635 -0.656 -0.596 -0.675
# # ... with 5,017 more rows
It was suggested some time ago that leaving the best wine for the end is not he best strategy when giving a feast. Considering the marriage of collapse and dplyr the 3 functions for time-computations introduced in this section combine great flexibility with precision and computing power, and feature amongst the highlights of collapse.
The first function, flag
, computes sequences of lags and leads on time-series and panel-data. fdiff
computes sequences of lagged-leaded and iterated differences on time-series and panel-data, and fgrowth
computes lagged-leaded and iterated growth-rates obtained via the exact computation method or through log-differencing. In addition: None of these functions require the data to be sorted, they can carry out fast computations on completely unordered data as long as a time-variable is supplied that uniquely identifies the data.
Beginning with flag
, the following code computes 1 fully-identified panel-lag and 1 fully identified panel-lead of each variable in the data:
GGDC10S %>%
group_by(Variable,Country) %>%
select_at(5:16) %>% flag(-1:1, Year)
# # A tibble: 5,027 x 36
# # Groups: Variable, Country [85]
# Variable Country Year F1.AGR AGR L1.AGR F1.MIN MIN L1.MIN F1.MAN MAN L1.MAN F1.PU PU
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA 1960 NA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA 1961 NA NA NA NA NA NA NA NA NA NA NA
# 3 VA BWA 1962 NA NA NA NA NA NA NA NA NA NA NA
# 4 VA BWA 1963 16.3 NA NA 3.49 NA NA 0.737 NA NA 0.104 NA
# 5 VA BWA 1964 15.7 16.3 NA 2.50 3.49 NA 1.02 0.737 NA 0.135 0.104
# 6 VA BWA 1965 17.7 15.7 16.3 1.97 2.50 3.49 0.804 1.02 0.737 0.203 0.135
# 7 VA BWA 1966 19.1 17.7 15.7 2.30 1.97 2.50 0.938 0.804 1.02 0.203 0.203
# 8 VA BWA 1967 21.1 19.1 17.7 1.84 2.30 1.97 0.750 0.938 0.804 0.203 0.203
# 9 VA BWA 1968 21.9 21.1 19.1 5.24 1.84 2.30 2.14 0.750 0.938 0.578 0.203
# 10 VA BWA 1969 23.1 21.9 21.1 10.2 5.24 1.84 4.15 2.14 0.750 1.12 0.578
# # ... with 5,017 more rows, and 22 more variables: L1.PU <dbl>, F1.CON <dbl>, CON <dbl>,
# # L1.CON <dbl>, F1.WRT <dbl>, WRT <dbl>, L1.WRT <dbl>, F1.TRA <dbl>, TRA <dbl>, L1.TRA <dbl>,
# # F1.FIRE <dbl>, FIRE <dbl>, L1.FIRE <dbl>, F1.GOV <dbl>, GOV <dbl>, L1.GOV <dbl>, F1.OTH <dbl>,
# # OTH <dbl>, L1.OTH <dbl>, F1.SUM <dbl>, SUM <dbl>, L1.SUM <dbl>
If the time-variable passed does not exactly identify the data (i.e. because of gaps or repeated values in each group), all 3 functions will issue appropriate error messages. It is also possible to omit the time-variable if one is certain that the data is sorted:
GGDC10S %>%
group_by(Variable,Country) %>%
select_at(6:16) %>% flag
# # A tibble: 5,027 x 13
# # Groups: Variable, Country [85]
# Variable Country L1.AGR L1.MIN L1.MAN L1.PU L1.CON L1.WRT L1.TRA L1.FIRE L1.GOV L1.OTH L1.SUM
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 4 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 5 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 6 VA BWA 16.3 3.49 0.737 0.104 0.660 6.24 1.66 1.12 4.82 2.34 37.5
# 7 VA BWA 15.7 2.50 1.02 0.135 1.35 7.06 1.94 1.25 5.70 2.68 39.3
# 8 VA BWA 17.7 1.97 0.804 0.203 1.35 8.27 2.15 1.36 6.37 2.99 43.1
# 9 VA BWA 19.1 2.30 0.938 0.203 0.897 4.31 1.72 1.54 7.04 3.31 41.4
# 10 VA BWA 21.1 1.84 0.750 0.203 1.22 5.17 2.44 1.03 5.03 2.36 41.1
# # ... with 5,017 more rows
fdiff
can compute continuous sequences of lagged, leaded and iterated differences. The code below computes the 1 and 10 year first and second differences of each variable in the data:
GGDC10S %>%
group_by(Variable,Country) %>%
select_at(5:16) %>% fdiff(c(1, 10), 1:2, Year)
# # A tibble: 5,027 x 47
# # Groups: Variable, Country [85]
# Variable Country Year D1.AGR D2.AGR L10D1.AGR L10D2.AGR D1.MIN D2.MIN L10D1.MIN L10D2.MIN D1.MAN
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA 1960 NA NA NA NA NA NA NA NA NA
# 2 VA BWA 1961 NA NA NA NA NA NA NA NA NA
# 3 VA BWA 1962 NA NA NA NA NA NA NA NA NA
# 4 VA BWA 1963 NA NA NA NA NA NA NA NA NA
# 5 VA BWA 1964 NA NA NA NA NA NA NA NA NA
# 6 VA BWA 1965 -0.575 NA NA NA -0.998 NA NA NA 0.282
# 7 VA BWA 1966 1.95 2.53 NA NA -0.525 0.473 NA NA -0.214
# 8 VA BWA 1967 1.47 -0.488 NA NA 0.328 0.854 NA NA 0.134
# 9 VA BWA 1968 1.95 0.488 NA NA -0.460 -0.788 NA NA -0.188
# 10 VA BWA 1969 0.763 -1.19 NA NA 3.41 3.87 NA NA 1.39
# # ... with 5,017 more rows, and 35 more variables: D2.MAN <dbl>, L10D1.MAN <dbl>, L10D2.MAN <dbl>,
# # D1.PU <dbl>, D2.PU <dbl>, L10D1.PU <dbl>, L10D2.PU <dbl>, D1.CON <dbl>, D2.CON <dbl>,
# # L10D1.CON <dbl>, L10D2.CON <dbl>, D1.WRT <dbl>, D2.WRT <dbl>, L10D1.WRT <dbl>, L10D2.WRT <dbl>,
# # D1.TRA <dbl>, D2.TRA <dbl>, L10D1.TRA <dbl>, L10D2.TRA <dbl>, D1.FIRE <dbl>, D2.FIRE <dbl>,
# # L10D1.FIRE <dbl>, L10D2.FIRE <dbl>, D1.GOV <dbl>, D2.GOV <dbl>, L10D1.GOV <dbl>,
# # L10D2.GOV <dbl>, D1.OTH <dbl>, D2.OTH <dbl>, L10D1.OTH <dbl>, L10D2.OTH <dbl>, D1.SUM <dbl>,
# # D2.SUM <dbl>, L10D1.SUM <dbl>, L10D2.SUM <dbl>
Finally, fgrowth
computes growth rates in the same way. By default exact growth rates are computed, but the user can also request growth rates obtained by log-differencing:
# Exact growth rates, computed as: (x - lag(x)) / lag(x) * 100
GGDC10S %>%
group_by(Variable,Country) %>%
select_at(5:16) %>% fgrowth(c(1, 10), 1:2, Year)
# # A tibble: 5,027 x 47
# # Groups: Variable, Country [85]
# Variable Country Year G1.AGR G2.AGR L10G1.AGR L10G2.AGR G1.MIN G2.MIN L10G1.MIN L10G2.MIN
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA 1960 NA NA NA NA NA NA NA NA
# 2 VA BWA 1961 NA NA NA NA NA NA NA NA
# 3 VA BWA 1962 NA NA NA NA NA NA NA NA
# 4 VA BWA 1963 NA NA NA NA NA NA NA NA
# 5 VA BWA 1964 NA NA NA NA NA NA NA NA
# 6 VA BWA 1965 -3.52 NA NA NA -28.6 NA NA NA
# 7 VA BWA 1966 12.4 -452. NA NA -21.1 -26.3 NA NA
# 8 VA BWA 1967 8.29 -33.3 NA NA 16.7 -179. NA NA
# 9 VA BWA 1968 10.2 23.1 NA NA -20 -220. NA NA
# 10 VA BWA 1969 3.61 -64.6 NA NA 185. -1026. NA NA
# # ... with 5,017 more rows, and 36 more variables: G1.MAN <dbl>, G2.MAN <dbl>, L10G1.MAN <dbl>,
# # L10G2.MAN <dbl>, G1.PU <dbl>, G2.PU <dbl>, L10G1.PU <dbl>, L10G2.PU <dbl>, G1.CON <dbl>,
# # G2.CON <dbl>, L10G1.CON <dbl>, L10G2.CON <dbl>, G1.WRT <dbl>, G2.WRT <dbl>, L10G1.WRT <dbl>,
# # L10G2.WRT <dbl>, G1.TRA <dbl>, G2.TRA <dbl>, L10G1.TRA <dbl>, L10G2.TRA <dbl>, G1.FIRE <dbl>,
# # G2.FIRE <dbl>, L10G1.FIRE <dbl>, L10G2.FIRE <dbl>, G1.GOV <dbl>, G2.GOV <dbl>, L10G1.GOV <dbl>,
# # L10G2.GOV <dbl>, G1.OTH <dbl>, G2.OTH <dbl>, L10G1.OTH <dbl>, L10G2.OTH <dbl>, G1.SUM <dbl>,
# # G2.SUM <dbl>, L10G1.SUM <dbl>, L10G2.SUM <dbl>
# Log-difference growth rates, computed as: log(x / lag(x)) * 100
GGDC10S %>%
group_by(Variable,Country) %>%
select_at(5:16) %>% fgrowth(c(1, 10), 1:2, Year, logdiff = TRUE)
# # A tibble: 5,027 x 47
# # Groups: Variable, Country [85]
# Variable Country Year Dlog1.AGR Dlog2.AGR L10Dlog1.AGR L10Dlog2.AGR Dlog1.MIN Dlog2.MIN
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA 1960 NA NA NA NA NA NA
# 2 VA BWA 1961 NaN NA NA NA NaN NA
# 3 VA BWA 1962 NaN NaN NA NA NaN NaN
# 4 VA BWA 1963 NaN NaN NA NA NaN NaN
# 5 VA BWA 1964 NaN NaN NA NA NaN NaN
# 6 VA BWA 1965 -3.59 NaN NA NA -33.6 NaN
# 7 VA BWA 1966 11.7 NaN NA NA -23.6 NaN
# 8 VA BWA 1967 7.96 -38.6 NA NA 15.4 NaN
# 9 VA BWA 1968 9.72 19.9 NA NA -22.3 NaN
# 10 VA BWA 1969 3.55 -101. NA NA 105. NaN
# # ... with 5,017 more rows, and 38 more variables: L10Dlog1.MIN <dbl>, L10Dlog2.MIN <dbl>,
# # Dlog1.MAN <dbl>, Dlog2.MAN <dbl>, L10Dlog1.MAN <dbl>, L10Dlog2.MAN <dbl>, Dlog1.PU <dbl>,
# # Dlog2.PU <dbl>, L10Dlog1.PU <dbl>, L10Dlog2.PU <dbl>, Dlog1.CON <dbl>, Dlog2.CON <dbl>,
# # L10Dlog1.CON <dbl>, L10Dlog2.CON <dbl>, Dlog1.WRT <dbl>, Dlog2.WRT <dbl>, L10Dlog1.WRT <dbl>,
# # L10Dlog2.WRT <dbl>, Dlog1.TRA <dbl>, Dlog2.TRA <dbl>, L10Dlog1.TRA <dbl>, L10Dlog2.TRA <dbl>,
# # Dlog1.FIRE <dbl>, Dlog2.FIRE <dbl>, L10Dlog1.FIRE <dbl>, L10Dlog2.FIRE <dbl>, Dlog1.GOV <dbl>,
# # Dlog2.GOV <dbl>, L10Dlog1.GOV <dbl>, L10Dlog2.GOV <dbl>, Dlog1.OTH <dbl>, Dlog2.OTH <dbl>,
# # L10Dlog1.OTH <dbl>, L10Dlog2.OTH <dbl>, Dlog1.SUM <dbl>, Dlog2.SUM <dbl>, L10Dlog1.SUM <dbl>,
# # L10Dlog2.SUM <dbl>
fdiff
and fgrowth
can also perform leaded (forward) differences and growth rates, although I have never come to employ these in my personal work (i.e. ... %>% fgrowth(-c(1, 10), 1:2, Year)
would compute one and 10-year leaded first and second differences). Again it is possible to perform sequential operations:
# This computes the 1 and 10-year growth rates, for the current period and lagged by one period
GGDC10S %>%
group_by(Variable,Country) %>%
select_at(5:16) %>% fgrowth(c(1, 10), 1, Year) %>% flag(0:1, Year)
# # A tibble: 5,027 x 47
# # Groups: Variable, Country [85]
# Variable Country Year G1.AGR L1.G1.AGR L10G1.AGR L1.L10G1.AGR G1.MIN L1.G1.MIN L10G1.MIN
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA 1960 NA NA NA NA NA NA NA
# 2 VA BWA 1961 NA NA NA NA NA NA NA
# 3 VA BWA 1962 NA NA NA NA NA NA NA
# 4 VA BWA 1963 NA NA NA NA NA NA NA
# 5 VA BWA 1964 NA NA NA NA NA NA NA
# 6 VA BWA 1965 -3.52 NA NA NA -28.6 NA NA
# 7 VA BWA 1966 12.4 -3.52 NA NA -21.1 -28.6 NA
# 8 VA BWA 1967 8.29 12.4 NA NA 16.7 -21.1 NA
# 9 VA BWA 1968 10.2 8.29 NA NA -20 16.7 NA
# 10 VA BWA 1969 3.61 10.2 NA NA 185. -20 NA
# # ... with 5,017 more rows, and 37 more variables: L1.L10G1.MIN <dbl>, G1.MAN <dbl>,
# # L1.G1.MAN <dbl>, L10G1.MAN <dbl>, L1.L10G1.MAN <dbl>, G1.PU <dbl>, L1.G1.PU <dbl>,
# # L10G1.PU <dbl>, L1.L10G1.PU <dbl>, G1.CON <dbl>, L1.G1.CON <dbl>, L10G1.CON <dbl>,
# # L1.L10G1.CON <dbl>, G1.WRT <dbl>, L1.G1.WRT <dbl>, L10G1.WRT <dbl>, L1.L10G1.WRT <dbl>,
# # G1.TRA <dbl>, L1.G1.TRA <dbl>, L10G1.TRA <dbl>, L1.L10G1.TRA <dbl>, G1.FIRE <dbl>,
# # L1.G1.FIRE <dbl>, L10G1.FIRE <dbl>, L1.L10G1.FIRE <dbl>, G1.GOV <dbl>, L1.G1.GOV <dbl>,
# # L10G1.GOV <dbl>, L1.L10G1.GOV <dbl>, G1.OTH <dbl>, L1.G1.OTH <dbl>, L10G1.OTH <dbl>,
# # L1.L10G1.OTH <dbl>, G1.SUM <dbl>, L1.G1.SUM <dbl>, L10G1.SUM <dbl>, L1.L10G1.SUM <dbl>
Using the same data as in section 1.4 (1 million obs in 17000 groups), I run benchmarks of collapse functions against native dplyr solutions:
dim(data)
# [1] 1005400 13
GRP(data)
# collapse grouping object of length 1005400 with 17000 ordered groups
#
# Call: GRP.grouped_df(X = data), ordered
#
# Distribution of group sizes:
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 4.00 53.00 62.00 59.14 63.00 65.00
#
# Groups with sizes:
# EMP1.ARG1 EMP1.BOL1 EMP1.BRA1 EMP1.BWA1 EMP1.CHL1 EMP1.CHN1
# 62 61 62 52 63 62
# ---
# VA99.TWN99 VA99.TZA99 VA99.USA99 VA99.VEN99 VA99.ZAF99 VA99.ZMB99
# 63 52 65 63 52 52
# Grouped Sum (mutate does not have an option to preserve missing values as given by "replace")
system.time(fsum(data, TRA = "replace_fill"))
# user system elapsed
# 0.08 0.00 0.08
system.time(mutate_all(data, sum, na.rm = TRUE))
# user system elapsed
# 0.21 0.05 0.25
# Dviding by grouped sum
system.time(fsum(data, TRA = "/"))
# user system elapsed
# 0.13 0.02 0.14
system.time(mutate_all(data, function(x) x/sum(x, na.rm = TRUE)))
# user system elapsed
# 0.86 0.05 0.91
# Mean (between transformation)
system.time(fmean(data, TRA = "replace_fill"))
# user system elapsed
# 0.05 0.05 0.09
system.time(fbetween(data, fill = TRUE))
# user system elapsed
# 0.05 0.04 0.10
system.time(mutate_all(data, mean, na.rm = TRUE))
# user system elapsed
# 2.75 0.03 2.78
# De-Mean (within transformation)
system.time(fmean(data, TRA = "-"))
# user system elapsed
# 0.08 0.01 0.10
system.time(fwithin(data))
# user system elapsed
# 0.06 0.03 0.10
system.time(mutate_all(data, function(x) x - mean(x, na.rm = TRUE)))
# user system elapsed
# 2.31 0.08 2.39
# Centering on global mean
system.time(fwithin(data, add.global.mean = TRUE))
# user system elapsed
# 0.08 0.00 0.08
# Weighted Demeaning
system.time(fwithin(data, SUM))
# user system elapsed
# 0.08 0.00 0.08
system.time(fwithin(data, SUM, add.global.mean = TRUE))
# user system elapsed
# 0.06 0.01 0.08
# Scaling
system.time(fsd(data, TRA = "/"))
# user system elapsed
# 0.12 0.03 0.15
system.time(mutate_all(data, function(x) x/sd(x, na.rm = TRUE)))
# user system elapsed
# 3.72 0.02 3.75
# Standardizing
system.time(fscale(data))
# user system elapsed
# 0.10 0.02 0.12
# system.time(mutate_all(data, scale)) This takes 32 seconds to compute..
# Weighted Scaling and standardizing
system.time(fsd(data, SUM, TRA = "/"))
# user system elapsed
# 0.12 0.02 0.14
system.time(fscale(data, SUM))
# user system elapsed
# 0.07 0.03 0.11
# Lags and Leads
system.time(flag(data))
# user system elapsed
# 0.01 0.04 0.04
system.time(mutate_all(data, lag))
# user system elapsed
# 0.18 0.01 0.19
system.time(flag(data, -1))
# user system elapsed
# 0.02 0.03 0.04
system.time(mutate_all(data, lead))
# user system elapsed
# 0.17 0.01 0.19
system.time(flag(data, -1:1))
# user system elapsed
# 0.07 0.04 0.10
# Differences
system.time(fdiff(data))
# user system elapsed
# 0.04 0.02 0.06
system.time(fdiff(data,1,1:2))
# user system elapsed
# 0.11 0.04 0.16
system.time(fdiff(data, c(1,10)))
# user system elapsed
# 0.04 0.06 0.11
system.time(fdiff(data, c(1,10), 1:2))
# user system elapsed
# 0.38 0.08 0.46
# Growth Rates
system.time(fgrowth(data))
# user system elapsed
# 0.06 0.03 0.09
system.time(fgrowth(data,1,1:2))
# user system elapsed
# 0.17 0.05 0.22
system.time(fgrowth(data, c(1,10)))
# user system elapsed
# 0.12 0.04 0.17
system.time(fgrowth(data, c(1,10), 1:2))
# user system elapsed
# 0.36 0.20 0.57
Again the benchmarks show stunning performance gains using collapse functions.
Timmer, M. P., de Vries, G. J., & de Vries, K. (2015). “Patterns of Structural Change in Developing Countries.” . In J. Weiss, & M. Tribe (Eds.), Routledge Handbook of Industry and Development. (pp. 65-83). Routledge.