collapse is a C/C++ based package for data manipulation in R. It’s aims are
to facilitate complex data transformation and exploration tasks and
to help make R code fast, flexible, parsimonious and programmer friendly.
This vignette demonstrates these two points and introduces all of the main features of the package. Apart from this vignette, collapse comes with a built-in structured documentation available under help("collapse-documentation")
after installing the package, and help("collapse-package")
provides a compact set of examples for quick-start. The two other vignettes focus on the integration of collapse with dplyr workflows (highly recommended for dplyr / tidyverse users), and on the integration of collapse with the plm package (+ some advanced programming with panel-data).
This vignette utilizes the 2 datasets that come with collapse: wlddev
and GGDC10S
, as well a few datasets from base R: mtcars
, iris
, airquality
, and the time-series Airpassengers
and EuStockMarkets
. Below I introduce wlddev
and GGDC10S
and summarize them using qsu
(quick-summary), as I will not spend much time explaining these datasets in the remainder of the vignette. You may choose to skip this section and start with Section 2.
This dataset contains 4 key World Bank Development Indicators covering 216 countries over 59 years. It is a balanced panel with \(216 \times 59 = 12744\) observations.
library(collapse)
head(wlddev)
# country iso3c date year decade region income OECD PCGDP LIFEEX GINI ODA
# 1 Afghanistan AFG 1961-01-01 1960 1960 South Asia Low income FALSE NA 32.292 NA 114440000
# 2 Afghanistan AFG 1962-01-01 1961 1960 South Asia Low income FALSE NA 32.742 NA 233350000
# 3 Afghanistan AFG 1963-01-01 1962 1960 South Asia Low income FALSE NA 33.185 NA 114880000
# 4 Afghanistan AFG 1964-01-01 1963 1960 South Asia Low income FALSE NA 33.624 NA 236450000
# 5 Afghanistan AFG 1965-01-01 1964 1960 South Asia Low income FALSE NA 34.060 NA 302480000
# 6 Afghanistan AFG 1966-01-01 1965 1960 South Asia Low income FALSE NA 34.495 NA 370250000
# The variables have "label" attributes. Use vlabels() to get and set labels
namlab(wlddev, class = TRUE)
# Variable Class Label
# 1 country character Country Name
# 2 iso3c factor Country Code
# 3 date Date Date Recorded (Fictitious)
# 4 year integer Year
# 5 decade numeric Decade
# 6 region factor Region
# 7 income factor Income Level
# 8 OECD logical Is OECD Member Country?
# 9 PCGDP numeric GDP per capita (constant 2010 US$)
# 10 LIFEEX numeric Life expectancy at birth, total (years)
# 11 GINI numeric GINI index (World Bank estimate)
# 12 ODA numeric Net ODA received (constant 2015 US$)
# This counts the number of non-missing values, more in section 2
fNobs(wlddev)
# country iso3c date year decade region income OECD PCGDP LIFEEX GINI ODA
# 12744 12744 12744 12744 12744 12744 12744 12744 8995 11068 1356 8336
# This counts the number of distinct values, more in section 2
fNdistinct(wlddev)
# country iso3c date year decade region income OECD PCGDP LIFEEX GINI ODA
# 216 216 59 59 7 7 4 2 8995 10048 363 7564
# The countries included:
cat(levels(wlddev$iso3c))
# ABW AFG AGO ALB AND ARE ARG ARM ASM ATG AUS AUT AZE BDI BEL BEN BFA BGD BGR BHR BHS BIH BLR BLZ BMU BOL BRA BRB BRN BTN BWA CAF CAN CHE CHI CHL CHN CIV CMR COD COG COL COM CPV CRI CUB CUW CYM CYP CZE DEU DJI DMA DNK DOM DZA ECU EGY ERI ESP EST ETH FIN FJI FRA FRO FSM GAB GBR GEO GHA GIB GIN GMB GNB GNQ GRC GRD GRL GTM GUM GUY HKG HND HRV HTI HUN IDN IMN IND IRL IRN IRQ ISL ISR ITA JAM JOR JPN KAZ KEN KGZ KHM KIR KNA KOR KWT LAO LBN LBR LBY LCA LIE LKA LSO LTU LUX LVA MAC MAF MAR MCO MDA MDG MDV MEX MHL MKD MLI MLT MMR MNE MNG MNP MOZ MRT MUS MWI MYS NAM NCL NER NGA NIC NLD NOR NPL NRU NZL OMN PAK PAN PER PHL PLW PNG POL PRI PRT PRY PSE PYF QAT ROU RUS RWA SAU SDN SEN SGP SLB SLE SLV SMR SOM SRB SSD STP SUR SVK SVN SWE SWZ SXM SYC SYR TCA TCD TGO THA TJK TKM TLS TON TTO TUN TUR TUV TZA UGA UKR URY USA UZB VCT VEN VGB VIR VNM VUT WSM XKX YEM ZAF ZMB ZWE
# use descr(wlddev) for a more detailed description of each variable
Of the categorical identifiers, the date variable was artificially generated to have an example dataset that contains all common data types frequently encountered in R.
Below I show how this data can be properly summarized using the function qsu
. qsu
stands shorthand for quick-summary and was inspired by the summarize and xtsummarize commands in STATA. Since wlddev
is a panel-dataset, we would normally like to obtain statistics not just on the overall variation in the data, but also on the variation between country averages vs. the variation within countries over time. We might also be interested in higher moments such as the skewness and the kurtosis. Such a summary is easily implemented using qsu
:
qsu(wlddev, pid = ~ iso3c, cols = c(1,4,9:12), vlabels = TRUE, higher = TRUE)
# , , country: Country Name
#
# N/T Mean SD Min Max Skew Kurt
# Overall 12744 - - - - - -
# Between 216 - - - - - -
# Within 59 - - - - - -
#
# , , year: Year
#
# N/T Mean SD Min Max Skew Kurt
# Overall 12744 1989 17.03 1960 2018 -0 1.8
# Between 216 1989 0 1989 1989 - -
# Within 59 1989 17.03 1960 2018 -0 1.8
#
# , , PCGDP: GDP per capita (constant 2010 US$)
#
# N/T Mean SD Min Max Skew Kurt
# Overall 8995 11563.65 18348.41 131.65 191586.64 3.11 16.96
# Between 203 12488.86 19628.37 255.4 141165.08 3.21 17.25
# Within 44.31 11563.65 6334.95 -30529.09 75348.07 0.7 17.05
#
# , , LIFEEX: Life expectancy at birth, total (years)
#
# N/T Mean SD Min Max Skew Kurt
# Overall 11068 63.84 11.45 18.91 85.42 -0.67 2.65
# Between 207 64.53 10.02 39.35 85.42 -0.53 2.23
# Within 53.47 63.84 5.83 33.47 83.86 -0.25 3.75
#
# , , GINI: GINI index (World Bank estimate)
#
# N/T Mean SD Min Max Skew Kurt
# Overall 1356 39.4 9.68 16.2 65.8 0.46 2.29
# Between 161 39.58 8.37 23.37 61.71 0.52 2.67
# Within 8.42 39.4 3.04 23.96 54.8 0.14 5.78
#
# , , ODA: Net ODA received (constant 2015 US$)
#
# N/T Mean SD Min Max Skew Kurt
# Overall 8336 428,746468 819,868971 -1.08038000e+09 2.45521800e+10 7.19 122.9
# Between 178 418,026522 548,293709 423846.15 3.53258914e+09 2.47 10.65
# Within 46.83 428,746468 607,024040 -2.47969577e+09 2.35093916e+10 10.3 298.12
The output above is a 3D array of statistics which can also be subsetted ([
) or permuted using aperm()
. For each variable statistics are computed on the Overall (raw) data, and on the Between-country and Within-country transformed data1.
The statistics show that year is individual-invariant (evident from the 0 Between-country standard-deviation), that we have GINI-data on only 161 countries, with on average only 8.42 observations per country, and that PCGDP, LIFEEX and GINI vary more between countries, but ODA received varies more within countries over time. It is a common pattern that the kurtosis increases in within-transformed data, while the skewness decreases in most cases.
Note: Other distributional statistics like the median and quantiles are currently not implemented for reasons having to do with computation speed (>10x faster than base::summary
and suitable for really large panels) and the algorithm2 behind qsu
, but might come in a further update of qsu
.
The Groningen Growth and Development Centre 10-Sector Database provides long-run data on sectoral productivity performance in Africa, Asia, and Latin America. Variables covered in the data set are annual series of value added (VA, in local currency), and persons employed (EMP) for 10 broad sectors.
head(GGDC10S)
# # A tibble: 6 x 16
# Country Regioncode Region Variable Year AGR MIN MAN PU CON WRT TRA FIRE GOV
# <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA SSA Sub-s~ VA 1960 NA NA NA NA NA NA NA NA NA
# 2 BWA SSA Sub-s~ VA 1961 NA NA NA NA NA NA NA NA NA
# 3 BWA SSA Sub-s~ VA 1962 NA NA NA NA NA NA NA NA NA
# 4 BWA SSA Sub-s~ VA 1963 NA NA NA NA NA NA NA NA NA
# 5 BWA SSA Sub-s~ VA 1964 16.3 3.49 0.737 0.104 0.660 6.24 1.66 1.12 4.82
# 6 BWA SSA Sub-s~ VA 1965 15.7 2.50 1.02 0.135 1.35 7.06 1.94 1.25 5.70
# # ... with 2 more variables: OTH <dbl>, SUM <dbl>
namlab(GGDC10S, class = TRUE)
# Variable Class Label
# 1 Country character Country
# 2 Regioncode character Region code
# 3 Region character Region
# 4 Variable character Variable
# 5 Year numeric Year
# 6 AGR numeric Agriculture
# 7 MIN numeric Mining
# 8 MAN numeric Manufacturing
# 9 PU numeric Utilities
# 10 CON numeric Construction
# 11 WRT numeric Trade, restaurants and hotels
# 12 TRA numeric Transport, storage and communication
# 13 FIRE numeric Finance, insurance, real estate and business services
# 14 GOV numeric Government services
# 15 OTH numeric Community, social and personal services
# 16 SUM numeric Summation of sector GDP
fNobs(GGDC10S)
# Country Regioncode Region Variable Year AGR MIN MAN PU
# 5027 5027 5027 5027 5027 4364 4355 4355 4354
# CON WRT TRA FIRE GOV OTH SUM
# 4355 4355 4355 4355 3482 4248 4364
fNdistinct(GGDC10S)
# Country Regioncode Region Variable Year AGR MIN MAN PU
# 43 6 6 2 67 4353 4224 4353 4237
# CON WRT TRA FIRE GOV OTH SUM
# 4339 4344 4334 4349 3470 4238 4364
# The countries included:
cat(funique(GGDC10S$Country, ordered = TRUE))
# ARG BOL BRA BWA CHL CHN COL CRI DEW DNK EGY ESP ETH FRA GBR GHA HKG IDN IND ITA JPN KEN KOR MEX MOR MUS MWI MYS NGA NGA(alt) NLD PER PHL SEN SGP SWE THA TWN TZA USA VEN ZAF ZMB
# use descr(GGDC10S) for a more detailed description of each variable
The first problem in summarizing this data is that value added (VA) is in local currency, the second that it contains 2 different Variables (VA and EMP) stacked in the same column. One way of solving the first problem could be converting the data to percentages through dividing by the overall VA and EMP contained in the last column. A different solution involving grouped-scaling is introduced in section 4.4. The second problem in nicely handled by qsu
, which can also compute panel-statistics by groups.
# Converting data to percentages of overall VA / EMP
pGGDC10S <- sweep(GGDC10S[6:15], 1, GGDC10S$SUM, "/") * 100
# Summarizing the sectoral data by variable, overall, between and within countries
su <- qsu(pGGDC10S, by = GGDC10S$Variable, pid = GGDC10S[c("Variable","Country")], higher = TRUE)
# This gives a 4D array of summary statistics
str(su)
# 'qsu' num [1:2, 1:7, 1:3, 1:10] 2225 2139 35.1 17.3 26.7 ...
# - attr(*, "dimnames")=List of 4
# ..$ : chr [1:2] "EMP" "VA"
# ..$ : chr [1:7] "N/T" "Mean" "SD" "Min" ...
# ..$ : chr [1:3] "Overall" "Between" "Within"
# ..$ : chr [1:10] "AGR" "MIN" "MAN" "PU" ...
# Permuting this array to a more readible format
aperm(su, c(4,2,3,1))
# , , Overall, EMP
#
# N/T Mean SD Min Max Skew Kurt
# AGR 2225 35.09 26.72 0.16 100 0.49 2.1
# MIN 2216 1.03 1.42 0 9.41 3.13 15.04
# MAN 2216 14.98 8.04 0.58 45.3 0.43 2.85
# PU 2215 0.58 0.36 0.02 2.48 1.26 5.58
# CON 2216 5.66 2.93 0.14 15.99 -0.06 2.27
# WRT 2216 14.92 6.56 0.81 32.8 -0.18 2.32
# TRA 2216 4.82 2.65 0.15 15.05 0.95 4.47
# FIRE 2216 4.65 4.35 0.08 21.77 1.23 4.08
# GOV 1780 13.13 8.08 0 34.89 0.63 2.53
# OTH 2109 8.4 6.64 0.42 34.89 1.4 4.32
#
# , , Between, EMP
#
# N/T Mean SD Min Max Skew Kurt
# AGR 42 35.09 24.12 1 88.33 0.52 2.24
# MIN 42 1.03 1.23 0.03 6.85 2.73 12.33
# MAN 42 14.98 7.04 1.72 32.34 -0.02 2.43
# PU 42 0.58 0.3 0.07 1.32 0.55 2.69
# CON 42 5.66 2.47 0.5 10.37 -0.44 2.33
# WRT 42 14.92 5.26 4 26.77 -0.55 2.73
# TRA 42 4.82 2.47 0.37 12.39 0.98 4.79
# FIRE 42 4.65 3.45 0.15 12.44 0.61 2.59
# GOV 34 13.13 7.28 2.01 29.16 0.39 2.11
# OTH 40 8.4 6.27 1.35 26.4 1.43 4.32
#
# , , Within, EMP
#
# N/T Mean SD Min Max Skew Kurt
# AGR 52.98 26.38 11.5 -5.32 107.49 1.6 11.97
# MIN 52.76 3.4 0.72 -1.41 7.51 -0.2 15.03
# MAN 52.76 17.48 3.89 -1.11 40.4 -0.08 7.4
# PU 52.74 1.39 0.19 0.63 2.55 0.57 7.85
# CON 52.76 5.76 1.56 0.9 12.97 0.31 4.12
# WRT 52.76 15.76 3.91 3.74 29.76 0.33 3.34
# TRA 52.76 6.35 0.96 2.35 11.11 0.27 5.72
# FIRE 52.76 5.82 2.66 -2.98 16 0.55 4.03
# GOV 52.35 13.26 3.51 -2.2 23.61 -0.56 4.73
# OTH 52.73 7.39 2.2 -2.33 17.44 0.29 6.46
#
# , , Overall, VA
#
# N/T Mean SD Min Max Skew Kurt
# AGR 2139 17.31 15.51 0.03 95.22 1.33 4.88
# MIN 2139 5.85 9.1 0 59.06 2.72 10.92
# MAN 2139 20.07 8 0.98 41.63 -0.03 2.68
# PU 2139 2.23 1.11 0 9.19 0.89 6.24
# CON 2139 5.87 2.51 0.6 25.86 1.5 8.96
# WRT 2139 16.63 5.14 4.52 39.76 0.35 3.27
# TRA 2139 7.93 3.11 0.8 25.96 1.01 5.71
# FIRE 2139 7.04 12.71 -151.07 39.17 -6.23 59.87
# GOV 1702 13.41 6.35 0.76 32.51 0.49 2.9
# OTH 2139 6.4 5.84 0.23 31.45 1.5 4.21
#
# , , Between, VA
#
# N/T Mean SD Min Max Skew Kurt
# AGR 43 17.31 13.19 0.61 63.84 1.13 4.71
# MIN 43 5.85 7.57 0.05 27.92 1.71 4.81
# MAN 43 20.07 6.64 4.19 32.11 -0.36 2.62
# PU 43 2.23 0.75 0.45 4.31 0.62 3.87
# CON 43 5.87 1.85 2.94 12.93 1.33 6.5
# WRT 43 16.63 4.38 8.42 26.39 0.29 2.46
# TRA 43 7.93 2.72 2.04 14.89 0.64 3.67
# FIRE 43 7.04 9.03 -35.61 23.87 -2.67 15.1
# GOV 35 13.41 5.87 1.98 27.77 0.52 3.04
# OTH 43 6.4 5.61 1.12 19.53 1.33 3.2
#
# , , Within, VA
#
# N/T Mean SD Min Max Skew Kurt
# AGR 49.74 26.38 8.15 5.24 94.35 1.23 9.53
# MIN 49.74 3.4 5.05 -20.05 35.71 0.34 13.1
# MAN 49.74 17.48 4.46 1.12 36.35 -0.19 3.93
# PU 49.74 1.39 0.82 -1.09 6.27 0.53 5.35
# CON 49.74 5.76 1.7 -0.35 18.69 0.75 6.38
# WRT 49.74 15.76 2.69 4.65 32.67 0.23 4.5
# TRA 49.74 6.35 1.5 0.92 18.6 0.7 10.11
# FIRE 49.74 5.82 8.94 -109.63 54.12 -2.77 54.6
# GOV 48.63 13.26 2.42 5.12 22.85 0.17 3.31
# OTH 49.74 7.39 1.62 -0.92 19.31 0.73 9.66
The statistics show that the dataset is very consistent: Employment data cover 42 countries and 53 time-periods in almost all sectors. Agriculture is the largest sector in terms of employment, amounting to a 35% share of employment across countries and time, with a standard deviation (SD) of around 27%. The between-country SD in agricultural employment share is 24% and the within SD is 12%, indicating that processes of structural change are very gradual and most of the variation in structure is between countries. The next largest sectors after agriculture are manufacturing, wholesale and retail trade and government, each claiming an approx. 15% share of the economy. In these sectors the between-country SD is also about twice as large as the within-country SD.
In terms of value added, the data covers 43 countries in 50 time-periods. Agriculture, manufacturing, wholesale and retail trade and government are also the largest sectors in terms of VA, but with a diminished agricultural share (around 17%) and a greater share for manufacturing (around 20%). The variation between countries is again greater than the variation within countries, but it seems that at least in terms of agricultural VA share there is also a considerable within-country SD of 8%. This is also true for the finance and real estate sector with a within SD of 9%, suggesting (using a bit of common sense) that a diminishing VA share in agriculture and increased VA share in finance and real estate was a pattern characterizing most of the countries in this sample.
I note that these two examples have not yet exhausted the capabilities of qsu
which can also compute weighted versions of all the above statistics and output to list of matrices instead of higher-dimensional array. It is of course also possible to compute conventional and weighted statistics on cross-sectional data using qsu
.
As a final step I introduce a plot function which can be used to plot the structural transformation of any supported country. Below I do so for Tanzania.
library(data.table)
library(ggplot2)
plotGGDC <- function(ctry) {
dat <- qDT(GGDC10S)[Country == ctry]
dat <- cbind(get_vars(dat, c("Variable","Year")),
replace_outliers(sweep(get_vars(dat, 6:15), 1, dat$SUM, "/"), 0, NA, "min"))
dat$Variable <- Recode(dat$Variable,"VA"="Value Added Share","EMP"="Employment Share")
dat <- melt(dat, 1:2, variable.name = "Sector")
ggplot(aes(x = Year, y = value, fill = Sector), data = dat) +
geom_area(position = "fill", alpha = 0.9) + labs(x = NULL, y = NULL) +
theme_linedraw(base_size = 14) + facet_wrap( ~ Variable) +
scale_fill_manual(values = sub("#00FF66FF", "#00CC66", rainbow(10))) +
scale_x_continuous(breaks = scales::pretty_breaks(n = 7), expand = c(0, 0)) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 10), expand = c(0, 0),
labels = scales::percent) +
theme(axis.text.x = element_text(angle = 315, hjust = 0, margin = ggplot2::margin(t = 0)),
strip.background = element_rect(colour = "grey20", fill = "grey20"),
strip.text = element_text(face = "bold"))
}
# Plotting the structural transformation of Tannzania
plotGGDC("TZA")
A key feature of collapse is it’s broad set of Fast Statistical Functions (fsum, fprod, fmean, fmedian, fmode, fvar, fsd, fmin, fmax, ffirst, flast, fNobs, fNdistinct
), which are able to dramatically speed-up column-wise, grouped and weighted computations on vectors, matrices or data.frame’s. The basic syntax common to all of these functions is:
where x
is a vector, matrix or data.frame, g
takes groups supplied as vector, factor, list of vectors or GRP object, and w
takes a weight vector (available only to fmean, fmode, fvar
and fsd
). TRA
and can be used to transform x
using the computed statistics and one of 8 available transformations ("replace_fill", "replace", "-", "-+", "/", "%", "+", "*"
, discussed in section 4.3). na.rm
efficiently removes missing values and is TRUE
by default. use.g.names = TRUE
generates new row-names from the unique groups supplied to g
, and drop = TRUE
returns a vector when performing simple (non-grouped) computations on matrix or data.frame columns.
With that in mind, let’s start with some simple examples. To calculate the mean of each column in a data.frame or matrix, it is sufficient to type:
fmean(mtcars)
# mpg cyl disp hp drat wt qsec vs am
# 20.090625 6.187500 230.721875 146.687500 3.596562 3.217250 17.848750 0.437500 0.406250
# gear carb
# 3.687500 2.812500
fmean(mtcars, drop = FALSE) # This returns a 1-row data-frame
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 20.09062 6.1875 230.7219 146.6875 3.596562 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125
m <- qM(mtcars) # This quickly converts objects to matrices
fmean(m)
# mpg cyl disp hp drat wt qsec vs am
# 20.090625 6.187500 230.721875 146.687500 3.596562 3.217250 17.848750 0.437500 0.406250
# gear carb
# 3.687500 2.812500
fmean(mtcars, drop = FALSE) # This returns a 1-row matrix
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 20.09062 6.1875 230.7219 146.6875 3.596562 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125
It is also possible to calculate fast groupwise statistics, by simply passing grouping vectors or lists of grouping vectors to the fast functions:
fmean(mtcars, mtcars$cyl)
# mpg cyl disp hp drat wt qsec vs am gear carb
# 4 26.66364 4 105.1364 82.63636 4.070909 2.285727 19.13727 0.9090909 0.7272727 4.090909 1.545455
# 6 19.74286 6 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286 0.4285714 3.857143 3.428571
# 8 15.10000 8 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000 0.1428571 3.285714 3.500000
fmean(mtcars, mtcars[c("cyl","vs","am")])
# mpg cyl disp hp drat wt qsec vs am gear carb
# 4.0.1 26.00000 4 120.3000 91.00000 4.430000 2.140000 16.70000 0 1 5.000000 2.000000
# 4.1.0 22.90000 4 135.8667 84.66667 3.770000 2.935000 20.97000 1 0 3.666667 1.666667
# 4.1.1 28.37143 4 89.8000 80.57143 4.148571 2.028286 18.70000 1 1 4.142857 1.428571
# 6.0.1 20.56667 6 155.0000 131.66667 3.806667 2.755000 16.32667 0 1 4.333333 4.666667
# 6.1.0 19.12500 6 204.5500 115.25000 3.420000 3.388750 19.21500 1 0 3.500000 2.500000
# 8.0.0 15.05000 8 357.6167 194.16667 3.120833 4.104083 17.14250 0 0 3.000000 3.083333
# 8.0.1 15.40000 8 326.0000 299.50000 3.880000 3.370000 14.55000 0 1 5.000000 6.000000
In the example above we might be inclined to remove the grouping columns from the output, as the unique row-names already indicate the combination of grouping variables. This can be done in a secure and more efficient way using get_vars
:
# Getting column indices [same as match(c("cyl","vs","am"), names(mtcars)) but gives error if non-matched]
ind <- get_vars(mtcars, c("cyl","vs","am"), return = "indices")
# Subsetting columns with get_vars is 2x faster than [.data.frame
fmean(get_vars(mtcars, -ind), get_vars(mtcars, ind))
# mpg disp hp drat wt qsec gear carb
# 4.0.1 26.00000 120.3000 91.00000 4.430000 2.140000 16.70000 5.000000 2.000000
# 4.1.0 22.90000 135.8667 84.66667 3.770000 2.935000 20.97000 3.666667 1.666667
# 4.1.1 28.37143 89.8000 80.57143 4.148571 2.028286 18.70000 4.142857 1.428571
# 6.0.1 20.56667 155.0000 131.66667 3.806667 2.755000 16.32667 4.333333 4.666667
# 6.1.0 19.12500 204.5500 115.25000 3.420000 3.388750 19.21500 3.500000 2.500000
# 8.0.0 15.05000 357.6167 194.16667 3.120833 4.104083 17.14250 3.000000 3.083333
# 8.0.1 15.40000 326.0000 299.50000 3.880000 3.370000 14.55000 5.000000 6.000000
get_vars
also subsets data.table columns and other data.frame-like classes, and is about 2x the speed of [.data.frame
. Replacements of the form get_vars(data, ind) <- newcols
are about 4x as fast as data[ind] <- newcols
. It is also possible to subset with functions i.e. get_vars(mtcars, is.ordered)
and regular expressions i.e. get_vars(mtcars, c("c","v","a"), regex = TRUE)
or get_vars(mtcars, "c|v|a", regex = TRUE)
. Next to get_vars
there are also the predicates num_vars
, cat_vars
, char_vars
, fact_vars
, logi_vars
and Date_vars
to subset and replace data by type.
This programming can become even more efficient when passing factors or grouping objects to the g
argument. qF
efficiently turns atomic vectors into factors, and the GRP
function creates grouping objects (of class GRP) from vectors or lists of columns. By default, both are ordered, but must not be. For multiple variables, GRP
is always superior to creating multiple factors and interacting them, and it is also faster than base::interaction
for lists of factors.
# This creates a (ordered) factor, about 10x faster than as.factor(mtcars$cyl)
f <- qF(mtcars$cyl, na.exclude = FALSE)
str(f)
# Ord.factor w/ 3 levels "4"<"6"<"8": 2 2 1 2 3 2 3 1 1 2 ...
# This creates a 'GRP' object. Grouping is done via radix ordering in C (using data.table's forder function)
g <- GRP(mtcars, ~ cyl + vs + am) # Using the formula interface, could also use c("cyl","vs","am") or c(2,8:9)
g
# collapse grouping object of length 32 with 7 ordered groups
#
# Call: GRP.default(X = mtcars, by = ~cyl + vs + am), unordered
#
# Distribution of group sizes:
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.000 2.500 3.000 4.571 5.500 12.000
#
# Groups with sizes:
# 4.0.1 4.1.0 4.1.1 6.0.1 6.1.0 8.0.0 8.0.1
# 1 3 7 3 4 12 2
plot(g)
With factors or GRP objects, computations are faster since the fast functions would otherwise internally group the vectors every time they are executed. Compared to factors, grouped computations using GRP
objects are a bit more efficient, primarily because they require no further checks, while factors are checked for missing values3 unless a class ‘na.included’ is attached. By default qF
acts just like as.factor
and preserves missing values when generating factors. Therefore the most effective way of programming with factors is to use qF(x, na.exclude = FALSE)
to create the factor. This will create an underlying integer for NA
‘s and attach a class’na.included’, so that no further checks are run on that factor in the collapse ecosystem.
Using the objects just created, it is easy to compute over the same groups with multiple functions:
dat <- get_vars(mtcars, -ind)
# Grouped mean
fmean(dat, f)
# mpg disp hp drat wt qsec gear carb
# 4 26.66364 105.1364 82.63636 4.070909 2.285727 19.13727 4.090909 1.545455
# 6 19.74286 183.3143 122.28571 3.585714 3.117143 17.97714 3.857143 3.428571
# 8 15.10000 353.1000 209.21429 3.229286 3.999214 16.77214 3.285714 3.500000
# Grouped standard-deviation
fsd(dat, f)
# mpg disp hp drat wt qsec gear carb
# 4 4.509828 26.87159 20.93453 0.3654711 0.5695637 1.682445 0.5393599 0.522233
# 6 1.453567 41.56246 24.26049 0.4760552 0.3563455 1.706866 0.6900656 1.812654
# 8 2.560048 67.77132 50.97689 0.3723618 0.7594047 1.196014 0.7262730 1.556624
fsd(dat, g)
# mpg disp hp drat wt qsec gear carb
# 4.0.1 NA NA NA NA NA NA NA NA
# 4.1.0 1.4525839 13.969371 19.65536 0.1300000 0.4075230 1.67143651 0.5773503 0.5773503
# 4.1.1 4.7577005 18.802128 24.14441 0.3783926 0.4400840 0.94546285 0.3779645 0.5345225
# 6.0.1 0.7505553 8.660254 37.52777 0.1616581 0.1281601 0.76872188 0.5773503 1.1547005
# 6.1.0 1.6317169 44.742634 9.17878 0.5919459 0.1162164 0.81590441 0.5773503 1.7320508
# 8.0.0 2.7743959 71.823494 33.35984 0.2302749 0.7683069 0.80164745 0.0000000 0.9003366
# 8.0.1 0.5656854 35.355339 50.20458 0.4808326 0.2828427 0.07071068 0.0000000 2.8284271
Now suppose we wanted to create a new dataset which contains the mean, sd, min and max of the variables mpg and disp grouped by cyl, vs and am:
dat <- get_vars(mtcars, c("mpg", "disp"))
# add_stub is a collapse predicate to add a prefix (default) or postfix to column names
cbind(add_stub(fmean(dat, g), "mean_"),
add_stub(fsd(dat, g), "sd_"),
add_stub(fmin(dat, g), "min_"),
add_stub(fmax(dat, g), "max_"))
# mean_mpg mean_disp sd_mpg sd_disp min_mpg min_disp max_mpg max_disp
# 4.0.1 26.00000 120.3000 NA NA 26.0 120.3 26.0 120.3
# 4.1.0 22.90000 135.8667 1.4525839 13.969371 21.5 120.1 24.4 146.7
# 4.1.1 28.37143 89.8000 4.7577005 18.802128 21.4 71.1 33.9 121.0
# 6.0.1 20.56667 155.0000 0.7505553 8.660254 19.7 145.0 21.0 160.0
# 6.1.0 19.12500 204.5500 1.6317169 44.742634 17.8 167.6 21.4 258.0
# 8.0.0 15.05000 357.6167 2.7743959 71.823494 10.4 275.8 19.2 472.0
# 8.0.1 15.40000 326.0000 0.5656854 35.355339 15.0 301.0 15.8 351.0
We could also calculate groupwise-frequency weighted means and standard-deviations using a weight vector, and we could decide to include the original grouping columns and omit the generated row-names, as shown below4.
There is also a collapse predicate add_vars
which serves as a much faster and more versatile alternative to cbind.data.frame
. The intention behind add_vars
is to be able to efficiently add multiple columns to an existing data.frame. Thus in a call add_vars(data, newcols1, newcols2)
, newcols1
and newcols2
are added (by default) at the end of data
, while preserving all attributes of data
.
# This generates a random vector of weights
weights <- abs(rnorm(nrow(mtcars)))
# Grouped and weighted mean and sd and grouped min and max, combined using add_vars
add_vars(g[["groups"]],
add_stub(fmean(dat, g, weights, use.g.names = FALSE), "w_mean_"),
add_stub(fsd(dat, g, weights, use.g.names = FALSE), "w_sd_"),
add_stub(fmin(dat, g, use.g.names = FALSE), "min_"),
add_stub(fmax(dat, g, use.g.names = FALSE), "max_"))
# cyl vs am w_mean_mpg w_mean_disp w_sd_mpg w_sd_disp min_mpg min_disp max_mpg max_disp
# 1 4 0 1 26.00000 120.30000 0.000000 0.00000 26.0 120.3 26.0 120.3
# 2 4 1 0 22.77276 138.51716 1.707875 18.72771 21.5 120.1 24.4 146.7
# 3 4 1 1 29.52737 81.64415 4.674793 16.42655 21.4 71.1 33.9 121.0
# 4 6 0 1 20.52959 154.57224 1.194314 13.78055 19.7 145.0 21.0 160.0
# 5 6 1 0 18.47185 208.18111 1.438912 42.94401 17.8 167.6 21.4 258.0
# 6 8 0 0 15.46451 335.07016 2.182173 65.12019 10.4 275.8 19.2 472.0
# 7 8 0 1 15.27441 318.15046 0.736511 46.03194 15.0 301.0 15.8 351.0
We can also use add_vars
to bind columns in a different order than as they are passed. Specifying add_vars(data, newcols1, newcols2, pos = "front")
would be equivalent to add_vars(newcols1, newcols2, data)
while keeping the attributes of data
. Moreover it is also possible to pass a vector of positions that the new columns should have in the combined data:
# Binding and reordering columns in a single step: Add columns in specific positions
add_vars(g[["groups"]],
add_stub(fmean(dat, g, weights, use.g.names = FALSE), "w_mean_"),
add_stub(fsd(dat, g, weights, use.g.names = FALSE), "w_sd_"),
add_stub(fmin(dat, g, use.g.names = FALSE), "min_"),
add_stub(fmax(dat, g, use.g.names = FALSE), "max_"),
pos = c(4,8,5,9,6,10,7,11))
# cyl vs am w_mean_mpg w_sd_mpg min_mpg max_mpg w_mean_disp w_sd_disp min_disp max_disp
# 1 4 0 1 26.00000 0.000000 26.0 26.0 120.30000 0.00000 120.3 120.3
# 2 4 1 0 22.77276 1.707875 21.5 24.4 138.51716 18.72771 120.1 146.7
# 3 4 1 1 29.52737 4.674793 21.4 33.9 81.64415 16.42655 71.1 121.0
# 4 6 0 1 20.52959 1.194314 19.7 21.0 154.57224 13.78055 145.0 160.0
# 5 6 1 0 18.47185 1.438912 17.8 21.4 208.18111 42.94401 167.6 258.0
# 6 8 0 0 15.46451 2.182173 10.4 19.2 335.07016 65.12019 275.8 472.0
# 7 8 0 1 15.27441 0.736511 15.0 15.8 318.15046 46.03194 301.0 351.0
As a final layer of added complexity, we could utilize the TRA
argument to generate groupwise-weighted demeaned, and scaled data, with additional columns giving the group-minimum and maximum values:
head(add_vars(get_vars(mtcars, ind),
add_stub(fmean(dat, g, weights, "-"), "w_demean_"), # This calculates weighted group means and uses them to demean the data
add_stub(fsd(dat, g, weights, "/"), "w_scale_"), # This calculates weighted group sd's and uses them to scale the data
add_stub(fmin(dat, g, "replace"), "min_"), # This replaces all observations by their group-minimum
add_stub(fmax(dat, g, "replace"), "max_"))) # This replaces all observations by their group-maximum
# cyl vs am w_demean_mpg w_demean_disp w_scale_mpg w_scale_disp min_mpg min_disp
# Mazda RX4 6 0 1 0.4704056 5.427756 17.583310 11.610567 19.7 145.0
# Mazda RX4 Wag 6 0 1 0.4704056 5.427756 17.583310 11.610567 19.7 145.0
# Datsun 710 4 1 1 -6.7273707 26.355848 4.877221 6.574723 21.4 71.1
# Hornet 4 Drive 6 1 0 2.9281456 49.818890 14.872349 6.007823 17.8 167.6
# Hornet Sportabout 8 0 0 3.2354853 24.929837 8.569441 5.528239 10.4 275.8
# Valiant 6 1 0 -0.3718544 16.818890 12.578950 5.239380 17.8 167.6
# max_mpg max_disp
# Mazda RX4 21.0 160
# Mazda RX4 Wag 21.0 160
# Datsun 710 33.9 121
# Hornet 4 Drive 21.4 258
# Hornet Sportabout 19.2 472
# Valiant 21.4 258
It is also possible to add_vars<-
to mtcars
itself. The default option would add these columns at the end, but we could also specify positions:
# This defines the positions where we want to add these columns
pos <- c(2,8,3,9,4,10,5,11)
add_vars(mtcars, pos) <- c(add_stub(fmean(dat, g, weights, "-"), "w_demean_"),
add_stub(fsd(dat, g, weights, "/"), "w_scale_"),
add_stub(fmin(dat, g, "replace"), "min_"),
add_stub(fmax(dat, g, "replace"), "max_"))
head(mtcars)
# mpg w_demean_mpg w_scale_mpg min_mpg max_mpg cyl disp w_demean_disp w_scale_disp
# Mazda RX4 21.0 0.4704056 17.583310 19.7 21.0 6 160 5.427756 11.610567
# Mazda RX4 Wag 21.0 0.4704056 17.583310 19.7 21.0 6 160 5.427756 11.610567
# Datsun 710 22.8 -6.7273707 4.877221 21.4 33.9 4 108 26.355848 6.574723
# Hornet 4 Drive 21.4 2.9281456 14.872349 17.8 21.4 6 258 49.818890 6.007823
# Hornet Sportabout 18.7 3.2354853 8.569441 10.4 19.2 8 360 24.929837 5.528239
# Valiant 18.1 -0.3718544 12.578950 17.8 21.4 6 225 16.818890 5.239380
# min_disp max_disp hp drat wt qsec vs am gear carb
# Mazda RX4 145.0 160 110 3.90 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 145.0 160 110 3.90 2.875 17.02 0 1 4 4
# Datsun 710 71.1 121 93 3.85 2.320 18.61 1 1 4 1
# Hornet 4 Drive 167.6 258 110 3.08 3.215 19.44 1 0 3 1
# Hornet Sportabout 275.8 472 175 3.15 3.440 17.02 0 0 3 2
# Valiant 167.6 258 105 2.76 3.460 20.22 1 0 3 1
rm(mtcars)
These examples above could be made more involved using the full set of Fast Statistical Functions, and also employing all of the vector- valued functions and operators (fscale/STD, fbetween/B, fwithin/W, fHDbetween/HDB, fHDwithin/HDW, flag/L/F, fdiff/D, fgrowth/G
) discussed later. They also provide merely suggestions for use of these features and are focused on programming with data.frames (as the predicates get_vars
, add_vars
etc. are made for data.frames). The Fast Statistical Functions however work equally well on vectors and matrices. Not really discussed so far were a set of functions qDF, qDT, qM
which deliver very fast conversions between matrices, data.frames and data.tables.
Using collapse’s fast functions and the programming principles laid out here can speed up grouped computations by orders of magnitude - even compared to packages like dplyr or data.table (see e.g. the benchmarks provided further down). Simple column-wise computations on matrices are also slightly faster than with base
functions like colMeans
, colSums
, and of course a lot faster than applying these base functions to data.frame’s (which involves a conversion to matrix). Fast row-wise operations are not really the focus of collapse for the moment, also provided that they are not so common. Using conversions with qM
together with base functions like rowSums
however does a very decent job of speeding them up (i.e. evaluate the speed of rowSums(qM(mtcars))
against rowSums(mtcars)
).
The kind of advanced groupwise programming introduced in the previous section is the fastest and most customizable way of dealing with many data transformation problems, and it is also made highly compatible with workflows in packages like dplyr and plm (see the two vignettes on these subjects). Some tasks such as multivariate aggregations on a single data.frame are however so common that this demands for a more compact solution that efficiently integrates multiple computational steps:
collap
is a fast multi-purpose aggregation command designed to solve complex aggregation problems efficiently and with a minimum of coding. collap
performs optimally together with the Fast Statistical Functions, but will also work with other functions.
To perform the above aggregation with collap
, one would simply need to type5:
collap(mtcars, mpg + disp ~ cyl + vs + am, list(fmean, fsd, fmin, fmax), keep.col.order = FALSE)
# cyl vs am fmean.mpg fmean.disp fsd.mpg fsd.disp fmin.mpg fmin.disp fmax.mpg fmax.disp
# 1 4 0 1 26.00000 120.3000 NA NA 26.0 120.3 26.0 120.3
# 2 4 1 0 22.90000 135.8667 1.4525839 13.969371 21.5 120.1 24.4 146.7
# 3 4 1 1 28.37143 89.8000 4.7577005 18.802128 21.4 71.1 33.9 121.0
# 4 6 0 1 20.56667 155.0000 0.7505553 8.660254 19.7 145.0 21.0 160.0
# 5 6 1 0 19.12500 204.5500 1.6317169 44.742634 17.8 167.6 21.4 258.0
# 6 8 0 0 15.05000 357.6167 2.7743959 71.823494 10.4 275.8 19.2 472.0
# 7 8 0 1 15.40000 326.0000 0.5656854 35.355339 15.0 301.0 15.8 351.0
The original idea behind collap
is however better demonstrated with a different dataset. Consider the World Development Dataset wlddev
included in the package and introduced in section 1:
head(wlddev)
# country iso3c date year decade region income OECD PCGDP LIFEEX GINI ODA
# 1 Afghanistan AFG 1961-01-01 1960 1960 South Asia Low income FALSE NA 32.292 NA 114440000
# 2 Afghanistan AFG 1962-01-01 1961 1960 South Asia Low income FALSE NA 32.742 NA 233350000
# 3 Afghanistan AFG 1963-01-01 1962 1960 South Asia Low income FALSE NA 33.185 NA 114880000
# 4 Afghanistan AFG 1964-01-01 1963 1960 South Asia Low income FALSE NA 33.624 NA 236450000
# 5 Afghanistan AFG 1965-01-01 1964 1960 South Asia Low income FALSE NA 34.060 NA 302480000
# 6 Afghanistan AFG 1966-01-01 1965 1960 South Asia Low income FALSE NA 34.495 NA 370250000
Suppose we would like to aggregate this data by country and decade, but keep all that categorical information. With collap
this is extremely simple:
head(collap(wlddev, ~ iso3c + decade))
# country iso3c date year decade region income OECD PCGDP
# 1 Aruba ABW 1961-01-01 1962.5 1960 Latin America & Caribbean High income FALSE NA
# 2 Aruba ABW 1967-01-01 1970.0 1970 Latin America & Caribbean High income FALSE NA
# 3 Aruba ABW 1976-01-01 1980.0 1980 Latin America & Caribbean High income FALSE NA
# 4 Aruba ABW 1987-01-01 1990.0 1990 Latin America & Caribbean High income FALSE 23677.09
# 5 Aruba ABW 1996-01-01 2000.0 2000 Latin America & Caribbean High income FALSE 26766.93
# 6 Aruba ABW 2007-01-01 2010.0 2010 Latin America & Caribbean High income FALSE 25238.80
# LIFEEX GINI ODA
# 1 66.58583 NA NA
# 2 69.14178 NA NA
# 3 72.17600 NA 33630000
# 4 73.45356 NA 41563333
# 5 73.85773 NA 19857000
# 6 75.01078 NA NA
Note that the columns of the data are in the original order and also retain all their attributes. To understand this result let us briefly examine the syntax of collap
:
collap(X, by, FUN = fmean, catFUN = fmode, cols = NULL, custom = NULL,
keep.by = TRUE, keep.col.order = TRUE, sort.row = TRUE,
parallel = FALSE, mc.cores = 1L,
return = c("wide","list","long","long_dupl"), give.names = "auto") # , ...
It is clear that X
is the data and by
supplies the grouping information, which can be a one- or two-sided formula or alternatively grouping vectors, factors, lists and GRP
objects (like the Fast Statistical Functions). Then FUN
provides the function(s) applied only to numeric variables in X
and defaults to the mean, while catFUN
provides the function(s) applied only to categorical variables in X
and defaults to a fast implementation of the statistical mode6. keep.col.order = TRUE
specifies that the data is to be returned with the original column-order. Thus in the above example it was sufficient to supply X
and by
and collap
did the rest for us.
Suppose we only want to aggregate the 4 series in this dataset. This can be done utilizing the cols
argument:
head(collap(wlddev, ~ iso3c + decade, cols = 9:12))
# iso3c decade PCGDP LIFEEX GINI ODA
# 1 ABW 1960 NA 66.58583 NA NA
# 2 ABW 1970 NA 69.14178 NA NA
# 3 ABW 1980 NA 72.17600 NA 33630000
# 4 ABW 1990 23677.09 73.45356 NA 41563333
# 5 ABW 2000 26766.93 73.85773 NA 19857000
# 6 ABW 2010 25238.80 75.01078 NA NA
As before we could use multiple functions by putting them in a named or unnamed list7:
head(collap(wlddev, ~ iso3c + decade, list(fmean, fmedian, fsd), cols = 9:12))
# iso3c decade fmean.PCGDP fmedian.PCGDP fsd.PCGDP fmean.LIFEEX fmedian.LIFEEX fsd.LIFEEX fmean.GINI
# 1 ABW 1960 NA NA NA 66.58583 66.6155 0.6595475 NA
# 2 ABW 1970 NA NA NA 69.14178 69.1400 0.9521791 NA
# 3 ABW 1980 NA NA NA 72.17600 72.2930 0.8054561 NA
# 4 ABW 1990 23677.09 25357.79 4100.7901 73.45356 73.4680 0.1152921 NA
# 5 ABW 2000 26766.93 26966.05 834.3735 73.85773 73.7870 0.2217034 NA
# 6 ABW 2010 25238.80 24629.08 1580.8698 75.01078 75.0160 0.3942914 NA
# fmedian.GINI fsd.GINI fmean.ODA fmedian.ODA fsd.ODA
# 1 NA NA NA NA NA
# 2 NA NA NA NA NA
# 3 NA NA 33630000 33630000 NA
# 4 NA NA 41563333 36710000 16691094
# 5 NA NA 19857000 16530000 28602034
# 6 NA NA NA NA NA
With multiple functions, we could also request collap
to return a long-format of the data:
head(collap(wlddev, ~ iso3c + decade, list(fmean, fmedian, fsd), cols = 9:12, return = "long"))
# Function iso3c decade PCGDP LIFEEX GINI ODA
# 1 fmean ABW 1960 NA 66.58583 NA NA
# 2 fmean ABW 1970 NA 69.14178 NA NA
# 3 fmean ABW 1980 NA 72.17600 NA 33630000
# 4 fmean ABW 1990 23677.09 73.45356 NA 41563333
# 5 fmean ABW 2000 26766.93 73.85773 NA 19857000
# 6 fmean ABW 2010 25238.80 75.01078 NA NA
The final feature of collap
I want to highlight at this point is the custom
argument, which allows the user to circumvent the broad distinction into numeric and categorical data (and the associated FUN
and catFUN
arguments) and specify exactly which columns to aggregate using which functions:
head(collap(wlddev, ~ iso3c + decade,
custom = list(fmean = 9:12, fsd = 9:12,
ffirst = c("country","region","income"),
flast = c("year","date"),
fmode = "OECD")))
# ffirst.country iso3c flast.date flast.year decade ffirst.region ffirst.income
# 1 Aruba ABW 1966-01-01 1965 1960 Latin America & Caribbean High income
# 2 Aruba ABW 1975-01-01 1974 1970 Latin America & Caribbean High income
# 3 Aruba ABW 1986-01-01 1985 1980 Latin America & Caribbean High income
# 4 Aruba ABW 1995-01-01 1994 1990 Latin America & Caribbean High income
# 5 Aruba ABW 2006-01-01 2005 2000 Latin America & Caribbean High income
# 6 Aruba ABW 2015-01-01 2014 2010 Latin America & Caribbean High income
# fmode.OECD fmean.PCGDP fsd.PCGDP fmean.LIFEEX fsd.LIFEEX fmean.GINI fsd.GINI fmean.ODA fsd.ODA
# 1 FALSE NA NA 66.58583 0.6595475 NA NA NA NA
# 2 FALSE NA NA 69.14178 0.9521791 NA NA NA NA
# 3 FALSE NA NA 72.17600 0.8054561 NA NA 33630000 NA
# 4 FALSE 23677.09 4100.7901 73.45356 0.1152921 NA NA 41563333 16691094
# 5 FALSE 26766.93 834.3735 73.85773 0.2217034 NA NA 19857000 28602034
# 6 FALSE 25238.80 1580.8698 75.01078 0.3942914 NA NA NA NA
Through setting the argument give.names = FALSE
, the output can also be generated without changing the column names.
When it comes to larger aggregation problems, the performance if collapse is in line with data.table, and offers the additional advantage of high-performance weighted and categorical aggregations:
# Creating a data.table with 10 columns and 1 mio. obs, including missing values
testdat <- na_insert(qDT(replicate(10, rnorm(1e6), simplify = FALSE)), prop = 0.1) # 10% missing
testdat[["g1"]] <- sample.int(1000, 1e6, replace = TRUE) # 1000 groups
testdat[["g2"]] <- sample.int(100, 1e6, replace = TRUE) # 100 groups
# The average group size is 10, there are about 100000 groups
GRP(testdat, ~ g1 + g2)
# collapse grouping object of length 1000000 with 99998 ordered groups
#
# Call: GRP.default(X = testdat, by = ~g1 + g2), unordered
#
# Distribution of group sizes:
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1 8 10 10 12 26
#
# Groups with sizes:
# 1.1 1.2 1.3 1.4 1.5 1.6
# 7 13 10 5 16 18
# ---
# 1000.95 1000.96 1000.97 1000.98 1000.99 1000.100
# 10 8 11 14 18 7
# dplyr vs. data.table vs. collap (calling Fast Functions):
library(dplyr)
# Sum
system.time(testdat %>% group_by(g1,g2) %>% summarize_all(sum, na.rm = TRUE))
# user system elapsed
# 0.52 0.01 0.53
system.time(testdat[, lapply(.SD, sum, na.rm = TRUE), keyby = c("g1","g2")])
# user system elapsed
# 0.17 0.00 0.09
system.time(collap(testdat, ~ g1 + g2, fsum))
# user system elapsed
# 0.1 0.0 0.1
# Product
system.time(testdat %>% group_by(g1,g2) %>% summarize_all(prod, na.rm = TRUE))
# user system elapsed
# 2.67 0.02 2.69
system.time(testdat[, lapply(.SD, prod, na.rm = TRUE), keyby = c("g1","g2")])
# user system elapsed
# 0.24 0.01 0.21
system.time(collap(testdat, ~ g1 + g2, fprod))
# user system elapsed
# 0.13 0.00 0.13
# Mean
system.time(testdat %>% group_by(g1,g2) %>% summarize_all(mean.default, na.rm = TRUE))
# user system elapsed
# 5.29 0.00 5.29
system.time(testdat[, lapply(.SD, mean, na.rm = TRUE), keyby = c("g1","g2")])
# user system elapsed
# 0.16 0.05 0.18
system.time(collap(testdat, ~ g1 + g2))
# user system elapsed
# 0.16 0.00 0.16
# Weighted Mean
w <- abs(100*rnorm(1e6)) + 1
testdat[["w"]] <- w
# Seems not possible with dplyr ...
system.time(testdat[, lapply(.SD, weighted.mean, w = w, na.rm = TRUE), keyby = c("g1","g2")])
# user system elapsed
# 10.92 0.00 10.92
system.time(collap(testdat, ~ g1 + g2, w = w))
# user system elapsed
# 0.16 0.00 0.16
# Maximum
system.time(testdat %>% group_by(g1,g2) %>% summarize_all(max, na.rm = TRUE))
# user system elapsed
# 0.48 0.03 0.52
system.time(testdat[, lapply(.SD, max, na.rm = TRUE), keyby = c("g1","g2")])
# user system elapsed
# 0.30 0.00 0.25
system.time(collap(testdat, ~ g1 + g2, fmax))
# user system elapsed
# 0.12 0.00 0.12
# Median
system.time(testdat %>% group_by(g1,g2) %>% summarize_all(median.default, na.rm = TRUE))
# user system elapsed
# 46.92 0.00 47.15
system.time(testdat[, lapply(.SD, median, na.rm = TRUE), keyby = c("g1","g2")])
# user system elapsed
# 0.50 0.00 0.46
system.time(collap(testdat, ~ g1 + g2, fmedian))
# user system elapsed
# 0.70 0.01 0.72
# Variance
system.time(testdat %>% group_by(g1,g2) %>% summarize_all(var, na.rm = TRUE))
# user system elapsed
# 16.31 0.02 16.32
system.time(testdat[, lapply(.SD, var, na.rm = TRUE), keyby = c("g1","g2")])
# user system elapsed
# 0.64 0.03 0.63
system.time(collap(testdat, ~ g1 + g2, fvar))
# user system elapsed
# 0.21 0.00 0.20
# Note: fvar implements a numerically stable online variance using Welfords Algorithm.
# Weighted Variance
# Don't know how to do this fast in dplyr or data.table.
system.time(collap(testdat, ~ g1 + g2, fvar, w = w))
# user system elapsed
# 0.22 0.00 0.22
# Last value
system.time(testdat %>% group_by(g1,g2) %>% summarize_all(last))
# user system elapsed
# 4.17 0.01 4.19
system.time(testdat[, lapply(.SD, last), keyby = c("g1","g2")])
# user system elapsed
# 0.09 0.02 0.06
system.time(collap(testdat, ~ g1 + g2, flast, na.rm = FALSE))
# user system elapsed
# 0.08 0.00 0.08
# Note: collapse functions ffirst and flast by default also remove missing values i.e. take the first and last non-missing data point
# Mode
# Defining a mode function in base R and applying it by groups is very slow, no matter whether you use dplyr or data.table.
# There are solutions suggested on stackoverflow on using chained operations in data.table to compute the mode,
# but those I find rather arcane and they are also not very fast.
system.time(collap(testdat, ~ g1 + g2, fmode))
# user system elapsed
# 1.17 0.03 1.21
# Note: This mode function uses index hashing in C++, it's a blast !
# Weighted Mode
system.time(collap(testdat, ~ g1 + g2, fmode, w = w))
# user system elapsed
# 2.37 0.13 2.50
# Number of Distinct Values
# No straightforward data.table solution..
system.time(testdat %>% group_by(g1,g2) %>% summarize_all(n_distinct, na.rm = TRUE))
# user system elapsed
# 8.04 0.00 8.11
system.time(collap(testdat, ~ g1 + g2, fNdistinct))
# user system elapsed
# 1.08 0.09 1.20
I believe on really huge datasets aggregated on a multi-core machine, data.table’s memory efficiency and thread-parallelization will let it run faster with some GeForce optimized functions, but that does not apply to most users (I have tested up to 10 million obs. on my laptop where collapse is still very much in line). In comparison to collapse and data.table the performance of dplyr on this data is rather poor, especially for base functions that are not highly optimized like sum
. I do however very much appreciate the tidyverse ecosystem for highly organized data exploration and transformation. Therefore I have created methods for all of the Fast Statistical Functions as well as collap
, enabling them to be used effectively in the dplyr ecosystem where they produce amazing speed gains. This is the subject of the ‘collapse and dplyr’ vignette.
Apart from its non-reliance on non-standard evaluation, a central advantage of collapse for programming is the speed it maintains on smaller problems where it’s more efficient R code compared to dplyr and data.table really plays out:
# 12000 obs in 1500 groups: A more typical case
GRP(wlddev, ~ iso3c + decade)
# collapse grouping object of length 12744 with 1512 ordered groups
#
# Call: GRP.default(X = wlddev, by = ~iso3c + decade), unordered
#
# Distribution of group sizes:
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 4.000 6.000 9.000 8.429 11.000 11.000
#
# Groups with sizes:
# ABW.1960 ABW.1970 ABW.1980 ABW.1990 ABW.2000 ABW.2010
# 6 9 11 9 11 9
# ---
# ZWE.1970 ZWE.1980 ZWE.1990 ZWE.2000 ZWE.2010 ZWE.2020
# 9 11 9 11 9 4
library(microbenchmark)
dtwlddev <- qDT(wlddev)
microbenchmark(dplyr = dtwlddev %>% group_by(iso3c,decade) %>% select_at(9:12) %>% summarise_all(sum, na.rm = TRUE),
data.table = dtwlddev[, lapply(.SD, sum, na.rm = TRUE), by = c("iso3c","decade"), .SDcols = 9:12],
collap = collap(dtwlddev, ~ iso3c + decade, fsum, cols = 9:12),
fast_fun = fsum(get_vars(dtwlddev, 9:12), GRP(dtwlddev, ~ iso3c + decade), use.g.names = FALSE)) # We can gain a bit coding it manually
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# dplyr 12.560093 13.373157 15.545487 14.007945 14.527154 52.813437 100 c
# data.table 3.087589 3.268543 3.785641 3.497691 3.942377 5.959194 100 b
# collap 1.279393 1.390732 1.502004 1.471502 1.533308 3.221463 100 a
# fast_fun 1.162922 1.260651 1.338360 1.332273 1.400103 1.850144 100 a
# Now going really small:
dtmtcars <- qDT(mtcars)
microbenchmark(dplyr = dtmtcars %>% group_by(cyl,vs,am) %>% summarise_all(sum, na.rm = TRUE), # Large R overhead
data.table = dtmtcars[, lapply(.SD, sum, na.rm = TRUE), by = c("cyl","vs","am")], # Large R overhead
collap = collap(dtmtcars, ~ cyl + vs + am, fsum), # Now this is still quite efficient
fast_fun = fsum(dtmtcars, GRP(dtmtcars, ~ cyl + vs + am), use.g.names = FALSE)) # And this is nearly the speed of a full C++ implementation
# Unit: microseconds
# expr min lq mean median uq max neval cld
# dplyr 1591.766 1798.602 1931.9456 1925.7830 2024.8495 3150.956 100 c
# data.table 2880.976 2998.340 3238.6844 3190.2265 3397.0615 4434.366 100 d
# collap 166.897 204.158 240.7019 246.3290 268.6415 338.256 100 b
# fast_fun 85.680 105.315 129.8584 128.2965 149.7165 263.733 100 a
In general, the smaller the problem, the greater advantage collapse has over other packages because it’s R overhead (i.e. the R code executed before the actual C-function doing the hard work is called) is carefully minimized. Most users working on typical datasets (< 1 Mio obs.) will find that their code runs significantly faster when implemented in collapse compared to other solutions.
collapse also provides an ensemble of function to perform common data transformations extremely efficiently and user friendly. I start off this section by briefly introducing two apply functions I thought were missing in the base R ensemble, and then quickly move to the more involved functions to carry out extremely fast grouped transformations.
dapply
is an efficient apply command for matrices and data.frames. It can be used to apply functions to rows or (by default) columns of matrices or data.frames and by default returns objects of the same type and with the same attributes.
dapply(mtcars, median)
# mpg cyl disp hp drat wt qsec vs am gear carb
# 19.200 6.000 196.300 123.000 3.695 3.325 17.710 0.000 0.000 4.000 2.000
dapply(mtcars, median, MARGIN = 1)
# Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout
# 4.000 4.000 4.000 3.215 3.440
# Valiant Duster 360 Merc 240D Merc 230 Merc 280
# 3.460 4.000 4.000 4.000 4.000
# Merc 280C Merc 450SE Merc 450SL Merc 450SLC Cadillac Fleetwood
# 4.000 4.070 3.730 3.780 5.250
# Lincoln Continental Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
# 5.424 5.345 4.000 4.000 4.000
# Toyota Corona Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird
# 3.700 3.520 3.435 4.000 3.845
# Fiat X1-9 Porsche 914-2 Lotus Europa Ford Pantera L Ferrari Dino
# 4.000 4.430 4.000 5.000 6.000
# Maserati Bora Volvo 142E
# 8.000 4.000
dapply(mtcars, quantile)
# mpg cyl disp hp drat wt qsec vs am gear carb
# 0% 10.400 4 71.100 52.0 2.760 1.51300 14.5000 0 0 3 1
# 25% 15.425 4 120.825 96.5 3.080 2.58125 16.8925 0 0 3 2
# 50% 19.200 6 196.300 123.0 3.695 3.32500 17.7100 0 0 4 2
# 75% 22.800 8 326.000 180.0 3.920 3.61000 18.9000 1 1 4 4
# 100% 33.900 8 472.000 335.0 4.930 5.42400 22.9000 1 1 5 8
head(dapply(mtcars, quantile, MARGIN = 1))
# 0% 25% 50% 75% 100%
# Mazda RX4 0 3.2600 4.000 18.730 160
# Mazda RX4 Wag 0 3.3875 4.000 19.010 160
# Datsun 710 1 1.6600 4.000 20.705 108
# Hornet 4 Drive 0 2.0000 3.215 20.420 258
# Hornet Sportabout 0 2.5000 3.440 17.860 360
# Valiant 0 1.8800 3.460 19.160 225
head(dapply(mtcars, log)) # This is considerably more efficient than log(mtcars)
# mpg cyl disp hp drat wt qsec vs am gear
# Mazda RX4 3.044522 1.791759 5.075174 4.700480 1.360977 0.9631743 2.800933 -Inf 0 1.386294
# Mazda RX4 Wag 3.044522 1.791759 5.075174 4.700480 1.360977 1.0560527 2.834389 -Inf 0 1.386294
# Datsun 710 3.126761 1.386294 4.682131 4.532599 1.348073 0.8415672 2.923699 0 0 1.386294
# Hornet 4 Drive 3.063391 1.791759 5.552960 4.700480 1.124930 1.1678274 2.967333 0 -Inf 1.098612
# Hornet Sportabout 2.928524 2.079442 5.886104 5.164786 1.147402 1.2354715 2.834389 -Inf -Inf 1.098612
# Valiant 2.895912 1.791759 5.416100 4.653960 1.015231 1.2412686 3.006672 0 -Inf 1.098612
# carb
# Mazda RX4 1.3862944
# Mazda RX4 Wag 1.3862944
# Datsun 710 0.0000000
# Hornet 4 Drive 0.0000000
# Hornet Sportabout 0.6931472
# Valiant 0.0000000
dapply
preserves the data structure:
It also delivers seamless conversions, i.e. you can apply functions to data frame rows or columns and return a matrix or vice-versa:
identical(log(m), dapply(mtcars, log, return = "matrix"))
# [1] TRUE
identical(dapply(mtcars, log), dapply(m, log, return = "data.frame"))
# [1] TRUE
I do not provide benchmarks here, but dapply
is also very efficient. On data.frames, the performance is comparable to lapply
, and dapply
is about 2x faster than apply
for row- or column-wise operations on matrices. The most important feature for me however is that it does not change the structure of the data at all: all attributes are preserved, so you can use dapply
on a data table, grouped tibble, or on a time-series matrix and get a transformed object of the same class back (unless the result is a scalar in which case dapply
by default simplifies and returns a vector).
BY
is a generalization of dapply
for grouped computations using functions that are not part of the Fast Statistical Functions introduced above. It fundamentally is a reimplementation of the lapply(split(x, g), FUN, ...)
computing paradigm in base R, but substantially faster and more versatile than functions like tapply
, by
or aggregate
. It is however not faster than dplyr which remains the best solution for larger grouped computations on data.frames requiring split-apply-combine computing.
BY
is S3 generic with methods for vector, matrix, data.frame and grouped_df8. It also supports the same grouping (g
) inputs as the Fast Statistical Functions (grouping vectors, factors, lists or GRP objects). Below I demonstrate the use if BY
on vectors matrices and data.frames.
v <- iris$Sepal.Length # A numeric vector
f <- iris$Species # A factor
## default vector method
BY(v, f, sum) # Sum by species, about 2x faster than tapply(v, f, sum)
# setosa versicolor virginica
# 250.3 296.8 329.4
BY(v, f, quantile) # Species quantiles: by default stacked
# setosa.0% setosa.25% setosa.50% setosa.75% setosa.100% versicolor.0%
# 4.300 4.800 5.000 5.200 5.800 4.900
# versicolor.25% versicolor.50% versicolor.75% versicolor.100% virginica.0% virginica.25%
# 5.600 5.900 6.300 7.000 4.900 6.225
# virginica.50% virginica.75% virginica.100%
# 6.500 6.900 7.900
BY(v, f, quantile, expand.wide = TRUE) # Wide format
# 0% 25% 50% 75% 100%
# setosa 4.3 4.800 5.0 5.2 5.8
# versicolor 4.9 5.600 5.9 6.3 7.0
# virginica 4.9 6.225 6.5 6.9 7.9
## matrix method
miris <- qM(num_vars(iris))
BY(miris, f, sum) # Also returns as matrix
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# setosa 250.3 171.4 73.1 12.3
# versicolor 296.8 138.5 213.0 66.3
# virginica 329.4 148.7 277.6 101.3
head(BY(miris, f, quantile))
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# setosa.0% 4.3 2.300 1.000 0.1
# setosa.25% 4.8 3.200 1.400 0.2
# setosa.50% 5.0 3.400 1.500 0.2
# setosa.75% 5.2 3.675 1.575 0.3
# setosa.100% 5.8 4.400 1.900 0.6
# versicolor.0% 4.9 2.000 3.000 1.0
BY(miris, f, quantile, expand.wide = TRUE)[,1:5]
# Sepal.Length.0% Sepal.Length.25% Sepal.Length.50% Sepal.Length.75% Sepal.Length.100%
# setosa 4.3 4.800 5.0 5.2 5.8
# versicolor 4.9 5.600 5.9 6.3 7.0
# virginica 4.9 6.225 6.5 6.9 7.9
BY(miris, f, quantile, expand.wide = TRUE, return = "list")[1:2] # list of matrices
# $Sepal.Length
# 0% 25% 50% 75% 100%
# setosa 4.3 4.800 5.0 5.2 5.8
# versicolor 4.9 5.600 5.9 6.3 7.0
# virginica 4.9 6.225 6.5 6.9 7.9
#
# $Sepal.Width
# 0% 25% 50% 75% 100%
# setosa 2.3 3.200 3.4 3.675 4.4
# versicolor 2.0 2.525 2.8 3.000 3.4
# virginica 2.2 2.800 3.0 3.175 3.8
## data.frame method
BY(num_vars(iris), f, sum) # Also returns a data.fram etc...
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# setosa 250.3 171.4 73.1 12.3
# versicolor 296.8 138.5 213.0 66.3
# virginica 329.4 148.7 277.6 101.3
## Conversions
identical(BY(num_vars(iris), f, sum), BY(miris, f, sum, return = "data.frame"))
# [1] TRUE
identical(BY(miris, f, sum), BY(num_vars(iris), f, sum, return = "matrix"))
# [1] TRUE
TRA
is an S3 generic that efficiently transforms data by either (column-wise) replacing data values with supplied statistics or sweeping the statistics out of the data. The 8 operations supported by TRA
are:
1 - “replace_fill” : replace and overwrite missing values (same as dplyr::mutate)
2 - “replace” : replace but preserve missing values
3 - “-” : subtract (center)
4 - “-+” : subtract group-statistics but add average of group statistics
5 - “/” : divide (scale)
6 - “%” : compute percentages (divide and multiply by 100)
7 - “+” : add
8 - "*" : multiply
TRA
is also incorporated as an argument to all Fast Statistical Functions. Therefore it is only really necessary and advisable to use the TRA()
function if both aggregate statistics and transformed data are required, or to sweep out statistics otherwise obtained (e.g. regression or correlation coefficients etc.). Below I compute the column means of the iris-matrix obtained above, and use them to demean that matrix.
# Note: All examples below generalize to vectors or data.frames
stats <- fmean(miris) # Savig stats
head(TRA(miris, stats, "-"), 3) # Centering. Same as sweep(miris, 2, stats, "-")
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# [1,] -0.7433333 0.44266667 -2.358 -0.9993333
# [2,] -0.9433333 -0.05733333 -2.358 -0.9993333
# [3,] -1.1433333 0.14266667 -2.458 -0.9993333
The code below shows 3 identical ways to center data in the collapse package. For the very common centering and averaging tasks, collapse supplies 2 special functions fwithin
and fbetween
(discussed in section 4.5) which are slightly faster and more memory efficient than fmean(..., TRA = "-")
and fmean(..., TRA = "replace")
.
# 3 ways of centering data
all_identical(TRA(miris, fmean(miris), "-"),
fmean(miris, TRA = "-"), # better for any operation if the stats are not needed
fwithin(miris)) # fastest, fwithin is discussed in section 4.5
# [1] TRUE
# Simple replacing [same as fmean(miris, TRA = "replace") or fbetween(miris)]
head(TRA(miris, fmean(miris), "replace"), 3)
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# [1,] 5.843333 3.057333 3.758 1.199333
# [2,] 5.843333 3.057333 3.758 1.199333
# [3,] 5.843333 3.057333 3.758 1.199333
# Simple scaling [same as fsd(miris, TRA = "/")]
head(TRA(miris, fsd(miris), "/"), 3)
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# [1,] 6.158928 8.029986 0.7930671 0.2623854
# [2,] 5.917402 6.882845 0.7930671 0.2623854
# [3,] 5.675875 7.341701 0.7364195 0.2623854
All of the above is functionality also offered by base::sweep
, although TRA
is about 4x faster. The big advantage of TRA
is that it also supports grouped operations:
# Grouped centering [same as fmean(miris, f, TRA = "-") or fwithin(m, f)]
head(TRA(miris, fmean(miris, f), "-", f), 3)
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# [1,] 0.094 0.072 -0.062 -0.046
# [2,] -0.106 -0.428 -0.062 -0.046
# [3,] -0.306 -0.228 -0.162 -0.046
# Grouped replacing [same as fmean(m, f, TRA = "replace") or fbetween(m, f)]
head(TRA(miris, fmean(miris, f), "replace", f), 3)
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# [1,] 5.006 3.428 1.462 0.246
# [2,] 5.006 3.428 1.462 0.246
# [3,] 5.006 3.428 1.462 0.246
# Groupwise percentages [same as fsum(m, f, TRA = "%")]
head(TRA(miris, fsum(miris, f), "%", f), 3)
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# [1,] 2.037555 2.042007 1.915185 1.626016
# [2,] 1.957651 1.750292 1.915185 1.626016
# [3,] 1.877747 1.866978 1.778386 1.626016
A somewhat special operation performed by TRA
is the grouped centering on the overall statistic (which for the mean is also performed more efficiently by fwithin
):
# Grouped centering on the overall mean [same as fmean(m, f, TRA = "-+") or fwithin(m, f, add.global.mean = TRUE)]
head(TRA(miris, fmean(miris, f), "-+", f), 3)
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# [1,] 5.937333 3.129333 3.696 1.153333
# [2,] 5.737333 2.629333 3.696 1.153333
# [3,] 5.537333 2.829333 3.596 1.153333
head(TRA(TRA(miris, fmean(miris, f), "-", f), fmean(miris), "+"), 3) # Same thing done manually!
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# [1,] 5.937333 3.129333 3.696 1.153333
# [2,] 5.737333 2.629333 3.696 1.153333
# [3,] 5.537333 2.829333 3.596 1.153333
# This group-centers data on the overall median!
head(fmedian(miris, f, "-+"), 3)
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# [1,] 5.9 3.166667 3.7 1.166667
# [2,] 5.7 2.666667 3.7 1.166667
# [3,] 5.5 2.866667 3.6 1.166667
This is the within transformation also computed by qsu
discussed in section 1. It’s utility in the case of grouped centering is demonstrated visually in section 4.5.
The function fscale
can be used to efficiently standardize (i.e. scale and center) data using a numerically stable online algorithm. It’s structure is the same as the Fast Statistical Functions. The standardization-operator STD
also exists as a wrapper around fscale
. The difference is that by default STD
adds a prefix to standardized variables and also provides an enhanced method for data.frames (more about operators in the next section).
# fsccale doesn't rename columns
head(fscale(mtcars),2)
# mpg cyl disp hp drat wt qsec vs
# Mazda RX4 0.1508848 -0.1049878 -0.5706198 -0.5350928 0.5675137 -0.6103996 -0.7771651 -0.8680278
# Mazda RX4 Wag 0.1508848 -0.1049878 -0.5706198 -0.5350928 0.5675137 -0.3497853 -0.4637808 -0.8680278
# am gear carb
# Mazda RX4 1.189901 0.4235542 0.7352031
# Mazda RX4 Wag 1.189901 0.4235542 0.7352031
# By default adds a prefix
head(STD(mtcars),2)
# STD.mpg STD.cyl STD.disp STD.hp STD.drat STD.wt STD.qsec STD.vs
# Mazda RX4 0.1508848 -0.1049878 -0.5706198 -0.5350928 0.5675137 -0.6103996 -0.7771651 -0.8680278
# Mazda RX4 Wag 0.1508848 -0.1049878 -0.5706198 -0.5350928 0.5675137 -0.3497853 -0.4637808 -0.8680278
# STD.am STD.gear STD.carb
# Mazda RX4 1.189901 0.4235542 0.7352031
# Mazda RX4 Wag 1.189901 0.4235542 0.7352031
# See that is works
qsu(STD(mtcars))
# N Mean SD Min Max
# STD.mpg 32 -0 1 -1.61 2.29
# STD.cyl 32 0 1 -1.22 1.01
# STD.disp 32 -0 1 -1.29 1.95
# STD.hp 32 0 1 -1.38 2.75
# STD.drat 32 -0 1 -1.56 2.49
# STD.wt 32 -0 1 -1.74 2.26
# STD.qsec 32 0 1 -1.87 2.83
# STD.vs 32 0 1 -0.87 1.12
# STD.am 32 -0 1 -0.81 1.19
# STD.gear 32 0 1 -0.93 1.78
# STD.carb 32 -0 1 -1.12 3.21
Scaling with fscale / STD
can also be done groupwise and / or weighted. For example the Groningen Growth and Development Center 10-Sector Database9 provides annual series of value added in local currency and persons employed for 10 broad sectors in several African, Asian, and Latin American countries.
head(GGDC10S)
# # A tibble: 6 x 16
# Country Regioncode Region Variable Year AGR MIN MAN PU CON WRT TRA FIRE GOV
# <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA SSA Sub-s~ VA 1960 NA NA NA NA NA NA NA NA NA
# 2 BWA SSA Sub-s~ VA 1961 NA NA NA NA NA NA NA NA NA
# 3 BWA SSA Sub-s~ VA 1962 NA NA NA NA NA NA NA NA NA
# 4 BWA SSA Sub-s~ VA 1963 NA NA NA NA NA NA NA NA NA
# 5 BWA SSA Sub-s~ VA 1964 16.3 3.49 0.737 0.104 0.660 6.24 1.66 1.12 4.82
# 6 BWA SSA Sub-s~ VA 1965 15.7 2.50 1.02 0.135 1.35 7.06 1.94 1.25 5.70
# # ... with 2 more variables: OTH <dbl>, SUM <dbl>
If we wanted to correlate this data across countries and sectors, it needs to be standardized:
# Standardizing Sectors by Variable and Country
STD_GGDC10S <- STD(GGDC10S, ~ Variable + Country, cols = 6:16)
head(STD_GGDC10S)
# # A tibble: 6 x 13
# Variable Country STD.AGR STD.MIN STD.MAN STD.PU STD.CON STD.WRT STD.TRA STD.FIRE STD.GOV STD.OTH
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA NA
# 4 VA BWA NA NA NA NA NA NA NA NA NA NA
# 5 VA BWA -0.738 -0.717 -0.668 -0.805 -0.692 -0.603 -0.589 -0.635 -0.656 -0.596
# 6 VA BWA -0.739 -0.717 -0.668 -0.805 -0.692 -0.603 -0.589 -0.635 -0.656 -0.596
# # ... with 1 more variable: STD.SUM <dbl>
# Correlating Standardized Value-Added across countries
pwcor(num_vars(filter(STD_GGDC10S, Variable == "VA")))
# STD.AGR STD.MIN STD.MAN STD.PU STD.CON STD.WRT STD.TRA STD.FIRE STD.GOV STD.OTH STD.SUM
# STD.AGR 1 .88 .93 .88 .89 .90 .90 .86 .93 .88 .90
# STD.MIN .88 1 .86 .84 .85 .85 .84 .83 .88 .84 .86
# STD.MAN .93 .86 1 .95 .96 .97 .98 .95 .98 .97 .98
# STD.PU .88 .84 .95 1 .95 .96 .96 .95 .96 .96 .97
# STD.CON .89 .85 .96 .95 1 .98 .98 .97 .98 .97 .98
# STD.WRT .90 .85 .97 .96 .98 1 .99 .98 .99 .99 1.00
# STD.TRA .90 .84 .98 .96 .98 .99 1 .98 .99 .99 .99
# STD.FIRE .86 .83 .95 .95 .97 .98 .98 1 .98 .98 .98
# STD.GOV .93 .88 .98 .96 .98 .99 .99 .98 1 .99 1.00
# STD.OTH .88 .84 .97 .96 .97 .99 .99 .98 .99 1 .99
# STD.SUM .90 .86 .98 .97 .98 1.00 .99 .98 1.00 .99 1
As a slightly faster alternative to fmean(x, g, w, TRA = "-"/"-+")
or fmean(x, g, w, TRA = "replace"/"replace_fill")
, fwithin
and fbetween
can be used to perform common (grouped, weighted) centering and averaging tasks (also known as between- and within- transformations in the language of panel-data econometrics, thus the names). The operators W
and B
also exist.
## Simple centering and averaging
head(fbetween(mtcars$mpg))
# [1] 20.09062 20.09062 20.09062 20.09062 20.09062 20.09062
head(fwithin(mtcars$mpg))
# [1] 0.909375 0.909375 2.709375 1.309375 -1.390625 -1.990625
all.equal(fbetween(mtcars) + fwithin(mtcars), mtcars)
# [1] TRUE
## Groupwise centering and averaging
head(fbetween(mtcars$mpg, mtcars$cyl))
# [1] 19.74286 19.74286 26.66364 19.74286 15.10000 19.74286
head(fwithin(mtcars$mpg, mtcars$cyl))
# [1] 1.257143 1.257143 -3.863636 1.657143 3.600000 -1.642857
all.equal(fbetween(mtcars, mtcars$cyl) + fwithin(mtcars, mtcars$cyl), mtcars)
# [1] TRUE
To demonstrate more clearly the utility of the operators which exists for all fast transformation and time-series functions, the code below implements the task of demeaning 4 series by country and saving the country-id using the within-operator W
as opposed to fwithin
which requires all input to be passed externally like the Fast Statistical Functions.
head(W(wlddev, ~ iso3c, cols = 9:12)) # Center the 4 series in this dataset by country
# iso3c W.PCGDP W.LIFEEX W.GINI W.ODA
# 1 AFG NA -15.59016 NA -1236633448
# 2 AFG NA -15.14016 NA -1117723448
# 3 AFG NA -14.69716 NA -1236193448
# 4 AFG NA -14.25816 NA -1114623448
# 5 AFG NA -13.82216 NA -1048593448
# 6 AFG NA -13.38716 NA -980823448
head(add_vars(get_vars(wlddev, "iso3c"), # Same thing done manually using fwithin...
add_stub(fwithin(get_vars(wlddev, 9:12), wlddev$iso3c), "W.")))
# iso3c W.PCGDP W.LIFEEX W.GINI W.ODA
# 1 AFG NA -15.59016 NA -1236633448
# 2 AFG NA -15.14016 NA -1117723448
# 3 AFG NA -14.69716 NA -1236193448
# 4 AFG NA -14.25816 NA -1114623448
# 5 AFG NA -13.82216 NA -1048593448
# 6 AFG NA -13.38716 NA -980823448
It is also possible to drop the id’s in W
using the argument keep.by = FALSE
. fbetween / B
and fwithin / W
each have one additional computational option:
# This replaces missing values with the group-mean: Same as fmean(x, g, TRA = "replace_fill")
head(B(wlddev, ~ iso3c, cols = 9:12, fill = TRUE))
# iso3c B.PCGDP B.LIFEEX B.GINI B.ODA
# 1 AFG 482.1631 47.88216 NA 1351073448
# 2 AFG 482.1631 47.88216 NA 1351073448
# 3 AFG 482.1631 47.88216 NA 1351073448
# 4 AFG 482.1631 47.88216 NA 1351073448
# 5 AFG 482.1631 47.88216 NA 1351073448
# 6 AFG 482.1631 47.88216 NA 1351073448
# This adds back the global mean after subtracting out group means: Same as fmean(x, g, TRA = "-+")
head(W(wlddev, ~ iso3c, cols = 9:12, add.global.mean = TRUE))
# iso3c W.PCGDP W.LIFEEX W.GINI W.ODA
# 1 AFG NA 48.25093 NA -807886980
# 2 AFG NA 48.70093 NA -688976980
# 3 AFG NA 49.14393 NA -807446980
# 4 AFG NA 49.58293 NA -685876980
# 5 AFG NA 50.01893 NA -619846980
# 6 AFG NA 50.45393 NA -552076980
# Note: This is not just slightly faster than fmean(x, g, TRA = "-+"), but if weights are used, fmean(x, g, w, "-+")
# gives a wrong result: It subtracts weighted group means but then centers on the frequency-weighted average of those group means,
# whereas fwithin(x, g, w, add.global.mean = TRUE) will also center on the properly weighted overall mean.
# Visual demonstration of centering on the global mean vs. simple centering
oldpar <- par(mfrow = c(1,3))
plot(iris[1:2], col = iris$Species, main = "Raw Data") # Raw data
plot(W(iris, ~ Species)[2:3], col = iris$Species, main = "Simple Centering") # Simple centering
plot(W(iris, ~ Species, add.global.mean = TRUE)[2:3], col = iris$Species, # Centering on overall mean: Preserves level of data
main = "add.global.mean")
Another great utility of operators is that they can be employed in regression formulas in a manor that is both very efficient and pleasing to the eyes. Below I demonstrate the use of W
and B
to efficiently run fixed-effects regressions with lm
.
# When using operators in formulas, we need to remove missing values beforehand to obtain the same results as a Fixed-Effects package
data <- na.omit(get_vars(wlddev, c("iso3c","year","PCGDP","LIFEEX")))
# classical lm() -> iso3c is a factor, creates a matrix of 200+ country dummies.
coef(lm(PCGDP ~ LIFEEX + iso3c, data))[1:2]
# (Intercept) LIFEEX
# -1684.4057 363.2881
# Centering each variable individually
coef(lm(W(PCGDP,iso3c) ~ W(LIFEEX,iso3c), data))
# (Intercept) W(LIFEEX, iso3c)
# 9.790087e-13 3.632881e+02
# Centering the data
coef(lm(W.PCGDP ~ W.LIFEEX, W(data, PCGDP + LIFEEX ~ iso3c)))
# (Intercept) W.LIFEEX
# 9.790087e-13 3.632881e+02
# Adding the overall mean back to the data only changes the intercept
coef(lm(W.PCGDP ~ W.LIFEEX, W(data, PCGDP + LIFEEX ~ iso3c, add.global.mean = TRUE)))
# (Intercept) W.LIFEEX
# -13176.9192 363.2881
# Procedure suggested by Mundlak (1978) - controlling for group averages instead of demeaning
coef(lm(PCGDP ~ LIFEEX + B(LIFEEX,iso3c), data))
# (Intercept) LIFEEX B(LIFEEX, iso3c)
# -49424.6522 363.2881 560.0116
In general I recommend calling the full functions (i.e. fwithin
or fscale
etc.) for programming since they are a bit more efficient on the R-side of things and require all input in terms of data. For all other purposes I find the operators are more convenient. It is important to note that the operators can do everything the functions can do (i.e. you can also pass grouping vectors or GRP objects to them). They are just simple wrappers that in the data.frame method add 4 additional features:
by
i.e. W(mtcars, ~ cyl)
or W(mtcars, mpg ~ cyl)
cyl
in the above example) when passed in a formula (default keep.by = TRUE
)cols
argument (i.e. W(mtcars, ~ cyl, cols = 4:7)
is the same as W(mtcars, hp + drat + wt + qsec ~ cyl)
)stub = "W."
)That’s it about operators! If you like this kind of parsimony use them, otherwise leave it.
Sometimes simple centering is not enough, for example if a linear model with multiple levels of fixed-effects needs to be estimated, potentially involving interactions with continuous covariates. For these purposes fHDwithin / HDW
and fHDbetween / HDB
were created as efficient multi-purpose functions for linear prediction and partialling out. They operate by splitting complex regression problems in 2 parts: Factors and factor-interactions are projected out using lfe::demeanlist
, an efficient C
routine for centering vectors on multiple factors, whereas continuous variables are dealt with using a standard qr
decomposition in base R. The examples below show the use of the HDW
operator in manually solving a regression problem with country and time fixed effects.
data$year <- qF(data$year) # the country code (iso3c) is already a factor
# classical lm() -> creates a matrix of 196 country dummies and 56 year dummies
coef(lm(PCGDP ~ LIFEEX + iso3c + year, data))[1:2]
# (Intercept) LIFEEX
# 45233.6452 -317.9238
# Centering each variable individually
coef(lm(HDW(PCGDP, list(iso3c, year)) ~ HDW(LIFEEX, list(iso3c, year)), data))
# (Intercept) HDW(LIFEEX, list(iso3c, year))
# -3.087522e-14 -3.179238e+02
# Centering the entire data
coef(lm(HDW.PCGDP ~ HDW.LIFEEX, HDW(data, PCGDP + LIFEEX ~ iso3c + year)))
# (Intercept) HDW.LIFEEX
# -3.087522e-14 -3.179238e+02
# Procedure suggested by Mundlak (1978) - controlling for averages instead of demeaning
coef(lm(PCGDP ~ LIFEEX + HDB(LIFEEX, list(iso3c, year)), data))
# (Intercept) LIFEEX HDB(LIFEEX, list(iso3c, year))
# -45858.1471 -317.9238 1186.1225
We may wish to test whether including time fixed-effects in the above regression actually impacts the fit. This can be done with the fast F-test:
# The syntax is fFtest(y, exc, X, full.df = TRUE). 'exc' are exclusion restrictions.
# full.df = TRUE means count degrees of freedom in the same way as if dummies were created
fFtest(data$PCGDP, data$year, get_vars(data, c("LIFEEX","iso3c")))
# R-Sq. DF1 DF2 F-Stat. P-Value
# Full Model 0.896 253 8144 277.440 0.000
# Restricted Model 0.877 197 8200 296.420 0.000
# Exclusion Rest. 0.019 56 8144 26.817 0.000
The test shows that the time fixed-effects (accounted for like year dummies) are jointly significant.
One can also use fHDbetween / HDB
and fHDwithin / HDW
to project out interactions and continuous covariates. The interaction feature of HDW
and HDB
is still a bit experimental as lfe::demeanlist
is not very fast at it.
wlddev$year <- as.numeric(wlddev$year)
# classical lm() -> full country-year interaction, -> 200+ country dummies, 200+ trends, year and ODA
coef(lm(PCGDP ~ LIFEEX + iso3c*year + ODA, wlddev))[1:2]
# (Intercept) LIFEEX
# -7.258331e+05 7.174007e+00
# Same using HDW -> However lde::demeanlist is not nearly as fast on interactions..
coef(lm(HDW.PCGDP ~ HDW.LIFEEX, HDW(wlddev, PCGDP + LIFEEX ~ iso3c*year + ODA)))
# (Intercept) HDW.LIFEEX
# -1.511946e-05 7.171590e+00
# example of a simple continuous problem
head(HDW(iris[1:2], iris[3:4]))
# HDW.Sepal.Length HDW.Sepal.Width
# 1 0.21483967 0.2001352
# 2 0.01483967 -0.2998648
# 3 -0.13098262 -0.1255786
# 4 -0.33933805 -0.1741510
# 5 0.11483967 0.3001352
# 6 0.41621663 0.6044681
# May include factors..
head(HDW(iris[1:2], iris[3:5]))
# HDW.Sepal.Length HDW.Sepal.Width
# 1 0.14989286 0.1102684
# 2 -0.05010714 -0.3897316
# 3 -0.15951256 -0.1742640
# 4 -0.44070173 -0.3051992
# 5 0.04989286 0.2102684
# 6 0.17930818 0.3391766
I am hoping that the lfe package Author Simen Gaure will at some point improve the part of the algorithm projecting out interactions. Otherwise I will code something myself to improve this feature. There have also been several packages published recently to estimate heterogeneous slopes models. I might take some time to look at those implementations and update HDW
and HDB
at some point.
Below I provide benchmarks for some very common data transformation tasks, again comparing collapse to dplyr and data.table:
# The average group size is 10, there are about 100000 groups
GRP(testdat, ~ g1 + g2)
# collapse grouping object of length 1000000 with 99998 ordered groups
#
# Call: GRP.default(X = testdat, by = ~g1 + g2), unordered
#
# Distribution of group sizes:
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1 8 10 10 12 26
#
# Groups with sizes:
# 1.1 1.2 1.3 1.4 1.5 1.6
# 7 13 10 5 16 18
# ---
# 1000.95 1000.96 1000.97 1000.98 1000.99 1000.100
# 10 8 11 14 18 7
# get indices of grouping columns
ind <- get_vars(testdat, c("g1","g2"), "indices")
# Centering
system.time(testdat %>% group_by(g1,g2) %>% mutate_all(function(x) x - mean.default(x, na.rm = TRUE)))
# user system elapsed
# 8.31 0.02 8.35
system.time(testdat[, lapply(.SD, function(x) x - mean(x, na.rm = TRUE)), keyby = c("g1","g2")])
# user system elapsed
# 10.71 0.01 10.94
system.time(W(testdat, ~ g1 + g2))
# user system elapsed
# 0.21 0.00 0.22
# Weighted Centering
# Can't easily be done in dplyr..
system.time(testdat[, lapply(.SD, function(x) x - weighted.mean(x, w, na.rm = TRUE)), keyby = c("g1","g2")])
# user system elapsed
# 13.78 0.00 13.91
system.time(W(testdat, ~ g1 + g2, ~ w))
# user system elapsed
# 0.21 0.00 0.22
# Centering on the overall mean
# Can't easily be done in dplyr or data.table.
system.time(W(testdat, ~ g1 + g2, add.global.mean = TRUE)) # Ordinary
# user system elapsed
# 0.21 0.00 0.22
system.time(W(testdat, ~ g1 + g2, ~ w, add.global.mean = TRUE)) # Weighted
# user system elapsed
# 0.21 0.00 0.22
# Centering on both grouping variables simultaneously
# Can't be done in dplyr or data.table at all!
system.time(HDW(testdat, ~ qF(g1) + qF(g2), variable.wise = TRUE)) # Ordinary
# user system elapsed
# 0.82 0.02 0.82
system.time(HDW(testdat, ~ qF(g1) + qF(g2), w = w, variable.wise = TRUE)) # Weighted
# user system elapsed
# 0.95 0.03 0.99
# Proportions
system.time(testdat %>% group_by(g1,g2) %>% mutate_all(function(x) x/sum(x, na.rm = TRUE)))
# user system elapsed
# 4.70 0.00 4.71
system.time(testdat[, lapply(.SD, function(x) x/sum(x, na.rm = TRUE)), keyby = c("g1","g2")])
# user system elapsed
# 2.10 0.00 2.09
system.time(fsum(get_vars(testdat, -ind), get_vars(testdat, ind), "/"))
# user system elapsed
# 0.17 0.00 0.17
# Scaling
system.time(testdat %>% group_by(g1,g2) %>% mutate_all(function(x) x/sd(x, na.rm = TRUE)))
# user system elapsed
# 19.48 0.06 19.58
system.time(testdat[, lapply(.SD, function(x) x/sd(x, na.rm = TRUE)), keyby = c("g1","g2")])
# user system elapsed
# 15.81 0.05 15.83
system.time(fsd(get_vars(testdat, -ind), get_vars(testdat, ind), TRA = "/"))
# user system elapsed
# 0.31 0.00 0.32
system.time(fsd(get_vars(testdat, -ind), get_vars(testdat, ind), w, "/")) # Weighted Scaling. Need a weighted sd to do in dplyr or data.table
# user system elapsed
# 0.29 0.00 0.28
# Scaling and centering (i.e. standardizing)
system.time(testdat %>% group_by(g1,g2) %>% mutate_all(function(x) (x - mean.default(x, na.rm = TRUE))/sd(x, na.rm = TRUE)))
# user system elapsed
# 23.89 0.03 24.00
system.time(testdat[, lapply(.SD, function(x) (x - mean(x, na.rm = TRUE))/sd(x, na.rm = TRUE)), keyby = c("g1","g2")])
# user system elapsed
# 27.25 0.06 27.57
system.time(STD(testdat, ~ g1 + g2))
# user system elapsed
# 0.33 0.00 0.33
system.time(STD(testdat, ~ g1 + g2, ~ w)) # Weighted standardizing: Also difficult to do in dplyr or data.table
# user system elapsed
# 0.33 0.00 0.33
# Replacing data with any ststistic, here the sum:
system.time(testdat %>% group_by(g1,g2) %>% mutate_all(sum, na.rm = TRUE))
# user system elapsed
# 0.85 0.03 0.88
system.time(testdat[, setdiff(names(testdat), c("g1","g2")) := lapply(.SD, sum, na.rm = TRUE), keyby = c("g1","g2")])
# user system elapsed
# 1.36 0.05 1.30
system.time(fsum(get_vars(testdat, -ind), get_vars(testdat, ind), "replace_fill")) # dplyr and data.table also fill missing values.
# user system elapsed
# 0.07 0.00 0.07
system.time(fsum(get_vars(testdat, -ind), get_vars(testdat, ind), "replace")) # This preserves missing values, and is not easily implemented in dplyr or data.table
# user system elapsed
# 0.07 0.00 0.07
The message is clear: collapse outperforms dplyr and data.table both in scope and speed when it comes to grouped and / or weighted transformations of data. This capacity of collapse should make it attractive to econometricians and people programming with complex panel-data. In the ‘collapse and plm’ vignette I provide a programming example by implementing a more general case of the Hausman and Taylor (1981) estimator with two levels of fixed effects, as well as further benchmarks.
collapse also presents some essential contributions in the time-series domain, particularly in the area of panel-data and efficient and secure computations on unordered time-dependent vectors and panel-series.
Starting with data exploration and an improved data-access of panel data, psmat
is an S3 generic to efficiently obtain matrices or 3D-arrays from panel data.
mts <- psmat(wlddev, PCGDP ~ iso3c, ~ year)
str(mts)
# 'psmat' num [1:216, 1:59] NA NA NA NA NA ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:216] "ABW" "AFG" "AGO" "ALB" ...
# ..$ : chr [1:59] "1960" "1961" "1962" "1963" ...
# - attr(*, "transpose")= logi FALSE
plot(mts, main = vlabels(wlddev)[9], xlab = "Year")
Passing a data.frame
of panel-series to psmat
generates a 3D array:
# Get panel-series array
psar <- psmat(wlddev, ~ iso3c, ~ year, cols = 9:12)
str(psar)
# 'psmat' num [1:216, 1:59, 1:4] NA NA NA NA NA ...
# - attr(*, "dimnames")=List of 3
# ..$ : chr [1:216] "ABW" "AFG" "AGO" "ALB" ...
# ..$ : chr [1:59] "1960" "1961" "1962" "1963" ...
# ..$ : chr [1:4] "PCGDP" "LIFEEX" "GINI" "ODA"
# - attr(*, "transpose")= logi FALSE
plot(psar, legend = TRUE)
# Plot array of Panel-Series aggregated by region:
plot(psmat(collap(wlddev, ~region+year, cols = 9:12),
~region, ~year), legend = TRUE,
labs = vlabels(wlddev)[9:12])
psmat
can also output a list of panel-series matrices, which can be used amongst other things to reshape the data with unlist2d
(discussed in more detail in List-Processing section).
# This gives list of ps-matrices
psml <- psmat(wlddev, ~ iso3c, ~ year, 9:12, array = FALSE)
str(psml, give.attr = FALSE)
# List of 4
# $ PCGDP : 'psmat' num [1:216, 1:59] NA NA NA NA NA ...
# $ LIFEEX: 'psmat' num [1:216, 1:59] 65.7 32.3 33.3 62.3 NA ...
# $ GINI : 'psmat' num [1:216, 1:59] NA NA NA NA NA NA NA NA NA NA ...
# $ ODA : 'psmat' num [1:216, 1:59] NA 114440000 -380000 NA NA ...
# Using unlist2d, can generate a data.frame
head(unlist2d(psml, idcols = "Variable", row.names = "Country"))[1:10]
# Variable Country 1960 1961 1962 1963 1964 1965 1966 1967
# 1 PCGDP ABW NA NA NA NA NA NA NA NA
# 2 PCGDP AFG NA NA NA NA NA NA NA NA
# 3 PCGDP AGO NA NA NA NA NA NA NA NA
# 4 PCGDP ALB NA NA NA NA NA NA NA NA
# 5 PCGDP AND NA NA NA NA NA NA NA NA
# 6 PCGDP ARE NA NA NA NA NA NA NA NA
The correlation structure of panel-data can also be explored with psacf
, pspacf
and psccf
. These functions are exact analogues to stats::acf
, stats::pacf
and stats::ccf
. They use fscale
to group-scale panel-data by the panel-id provided, and then compute the covariance of a sequence of panel-lags (generated with flag
discussed below) with the group-scaled level-series, dividing by the variance of the group-scaled level series. The Partial-ACF is generated from the ACF using a Yule-Walker decomposition (as in stats::pacf
).
# Panel- Cross-Correlation function of GDP per Capia and Life-Expectancy
psccf(wlddev$PCGDP, wlddev$LIFEEX, wlddev$iso3c, wlddev$year)
# Multivariate Panel-auto and cross-correlation function of 3 variables:
psacf(wlddev, PCGDP + LIFEEX + ODA ~ iso3c, ~year)
flag
and the corresponding lag- and lead- operators L
and F
are S3 generics to efficiently compute lags and leads on time-series and panel data. The code below shows how to compute simple lags and leads on the classic Box & Jenkins airline data that comes with R.
# 1 lag
L(AirPassengers)
# Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
# 1949 NA 112 118 132 129 121 135 148 148 136 119 104
# 1950 118 115 126 141 135 125 149 170 170 158 133 114
# 1951 140 145 150 178 163 172 178 199 199 184 162 146
# 1952 166 171 180 193 181 183 218 230 242 209 191 172
# 1953 194 196 196 236 235 229 243 264 272 237 211 180
# 1954 201 204 188 235 227 234 264 302 293 259 229 203
# 1955 229 242 233 267 269 270 315 364 347 312 274 237
# 1956 278 284 277 317 313 318 374 413 405 355 306 271
# 1957 306 315 301 356 348 355 422 465 467 404 347 305
# 1958 336 340 318 362 348 363 435 491 505 404 359 310
# 1959 337 360 342 406 396 420 472 548 559 463 407 362
# 1960 405 417 391 419 461 472 535 622 606 508 461 390
# 3 identical ways of computing 1 lag
all_identical(flag(AirPassengers), L(AirPassengers), F(AirPassengers,-1))
# [1] TRUE
# 3 identical ways of computing 1 lead
all_identical(flag(AirPassengers, -1), L(AirPassengers, -1), F(AirPassengers))
# [1] TRUE
# 1 lead and 3 lags - output as matrix
head(L(AirPassengers, -1:3))
# F1 -- L1 L2 L3
# [1,] 118 112 NA NA NA
# [2,] 132 118 112 NA NA
# [3,] 129 132 118 112 NA
# [4,] 121 129 132 118 112
# [5,] 135 121 129 132 118
# [6,] 148 135 121 129 132
# ... this is still a time-series object:
attributes(L(AirPassengers, -1:3))
# $tsp
# [1] 1949.000 1960.917 12.000
#
# $class
# [1] "ts" "matrix"
#
# $dim
# [1] 144 5
#
# $dimnames
# $dimnames[[1]]
# NULL
#
# $dimnames[[2]]
# [1] "F1" "--" "L1" "L2" "L3"
flag / L / F
also work well on (time-series) matrices. Below I run a regression with daily closing prices of major European stock indices: Germany DAX (Ibis), Switzerland SMI, France CAC, and UK FTSE. The data are sampled in business time, i.e. weekends and holidays are omitted.
str(EuStockMarkets)
# Time-Series [1:1860, 1:4] from 1991 to 1999: 1629 1614 1607 1621 1618 ...
# - attr(*, "dimnames")=List of 2
# ..$ : NULL
# ..$ : chr [1:4] "DAX" "SMI" "CAC" "FTSE"
# Data is recorded on 260 days per year, 1991-1999
tsp(EuStockMarkets)
# [1] 1991.496 1998.646 260.000
freq <- frequency(EuStockMarkets)
# There is some obvious seasonality
plot(stl(EuStockMarkets[,"DAX"], freq))
# 1 annual lead and 1 annual lag
head(L(EuStockMarkets, -1:1*freq))
# F260.DAX DAX L260.DAX F260.SMI SMI L260.SMI F260.CAC CAC L260.CAC F260.FTSE FTSE
# [1,] 1755.98 1628.75 NA 1846.6 1678.1 NA 1907.3 1772.8 NA 2515.8 2443.6
# [2,] 1754.95 1613.63 NA 1854.8 1688.5 NA 1900.6 1750.5 NA 2521.2 2460.2
# [3,] 1759.90 1606.51 NA 1845.3 1678.6 NA 1880.9 1718.0 NA 2493.9 2448.2
# [4,] 1759.84 1621.04 NA 1854.5 1684.1 NA 1873.5 1708.1 NA 2476.1 2470.4
# [5,] 1776.50 1618.16 NA 1870.5 1686.6 NA 1883.6 1723.1 NA 2497.1 2484.7
# [6,] 1769.98 1610.61 NA 1862.6 1671.6 NA 1868.5 1714.3 NA 2469.0 2466.8
# L260.FTSE
# [1,] NA
# [2,] NA
# [3,] NA
# [4,] NA
# [5,] NA
# [6,] NA
# DAX regressed on it's own 2 annual lags and the lags of the other indicators
summary(lm(DAX ~., data = L(EuStockMarkets, 0:2*freq)))
#
# Call:
# lm(formula = DAX ~ ., data = L(EuStockMarkets, 0:2 * freq))
#
# Residuals:
# Min 1Q Median 3Q Max
# -240.46 -51.28 -12.01 45.19 358.02
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -564.02041 93.94903 -6.003 2.49e-09 ***
# L260.DAX -0.12577 0.03002 -4.189 2.99e-05 ***
# L520.DAX -0.12528 0.04103 -3.053 0.00231 **
# SMI 0.32601 0.01726 18.890 < 2e-16 ***
# L260.SMI 0.27499 0.02517 10.926 < 2e-16 ***
# L520.SMI 0.04602 0.02602 1.769 0.07721 .
# CAC 0.59637 0.02349 25.389 < 2e-16 ***
# L260.CAC -0.14283 0.02763 -5.169 2.72e-07 ***
# L520.CAC 0.05196 0.03657 1.421 0.15557
# FTSE 0.01002 0.02403 0.417 0.67675
# L260.FTSE 0.04509 0.02807 1.606 0.10843
# L520.FTSE 0.10601 0.02717 3.902 0.00010 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 83.06 on 1328 degrees of freedom
# (520 observations deleted due to missingness)
# Multiple R-squared: 0.9943, Adjusted R-squared: 0.9942
# F-statistic: 2.092e+04 on 11 and 1328 DF, p-value: < 2.2e-16
The main innovation of flag / L / F
is the ability to efficiently compute sequences of lags and leads on panel-data, and that this panel-data need not be ordered:
# This lags all 4 series
head(L(wlddev, 1, ~iso3c, ~year, cols = 9:12))
# iso3c year L1.PCGDP L1.LIFEEX L1.GINI L1.ODA
# 1 AFG 1960 NA NA NA NA
# 2 AFG 1961 NA 32.292 NA 114440000
# 3 AFG 1962 NA 32.742 NA 233350000
# 4 AFG 1963 NA 33.185 NA 114880000
# 5 AFG 1964 NA 33.624 NA 236450000
# 6 AFG 1965 NA 34.060 NA 302480000
# Without t: Works here because data is ordered, but gives a message
head(L(wlddev, 1, ~iso3c, cols = 9:12))
# Panel-lag computed without timevar: Assuming ordered data
# iso3c L1.PCGDP L1.LIFEEX L1.GINI L1.ODA
# 1 AFG NA NA NA NA
# 2 AFG NA 32.292 NA 114440000
# 3 AFG NA 32.742 NA 233350000
# 4 AFG NA 33.185 NA 114880000
# 5 AFG NA 33.624 NA 236450000
# 6 AFG NA 34.060 NA 302480000
# 1 lead and 2 lags of GDP per Capita & Life Expectancy
head(L(wlddev, -1:2, PCGDP + LIFEEX ~ iso3c, ~year))
# iso3c year F1.PCGDP PCGDP L1.PCGDP L2.PCGDP F1.LIFEEX LIFEEX L1.LIFEEX L2.LIFEEX
# 1 AFG 1960 NA NA NA NA 32.742 32.292 NA NA
# 2 AFG 1961 NA NA NA NA 33.185 32.742 32.292 NA
# 3 AFG 1962 NA NA NA NA 33.624 33.185 32.742 32.292
# 4 AFG 1963 NA NA NA NA 34.060 33.624 33.185 32.742
# 5 AFG 1964 NA NA NA NA 34.495 34.060 33.624 33.185
# 6 AFG 1965 NA NA NA NA 34.928 34.495 34.060 33.624
Behind the scenes this works by coercing the supplied panel-id (iso3c) and time-variable (year) to factor (or to GRP object if multiple panel-ids or time-variables are supplied) and creating an ordering vector of the data. Panel-lags are then computed through the ordering vector while keeping track of individual groups and inserting NA
(or any other value passed to the fill
argument) in the right places. Thus the data need not be sorted to compute a fully-identified panel-lag, which is a key advantage to, say, the shift
function in data.table
. All of this is written very efficiently in C++, and comes with an additional benefit: If anything is wrong with the panel, i.e. there are repeated time-values within a group or jumps in the time-variable within a group, flag / L / F
will let you know. To give an example:
g <- c(1,1,1,2,2,2)
tryCatch(flag(1:6, 1, g, t = c(1,2,3,1,2,2)),
error = function(e) e)
# <Rcpp::exception in flag.default(1:6, 1, g, t = c(1, 2, 3, 1, 2, 2)): Repeated values of timevar within one or more groups>
tryCatch(flag(1:6, 1, g, t = c(1,2,3,1,2,4)),
error = function(e) e)
# <Rcpp::exception in flag.default(1:6, 1, g, t = c(1, 2, 3, 1, 2, 4)): Gaps in timevar within one or more groups>
Note that all of this does not require the panel to be balanced. flag / L /F
works fine on balanced and unbalanced panel data. One intended area of use, especially for the operators L
and F
, is to dramatically facilitate the implementation of dynamic models in various contexts. Below I show different ways L
can be used to estimate a dynamic panel-model using lm
:
# Different ways of regressing GDP on its's lags and life-Expectancy and it's lags
# 1 - Precomputing lags
summary(lm(PCGDP ~ ., L(wlddev, 0:2, PCGDP + LIFEEX ~ iso3c, ~ year, keep.ids = FALSE)))
#
# Call:
# lm(formula = PCGDP ~ ., data = L(wlddev, 0:2, PCGDP + LIFEEX ~
# iso3c, ~year, keep.ids = FALSE))
#
# Residuals:
# Min 1Q Median 3Q Max
# -16621.0 -100.0 -17.2 86.2 11935.3
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -321.51378 63.37246 -5.073 4e-07 ***
# L1.PCGDP 1.31801 0.01061 124.173 <2e-16 ***
# L2.PCGDP -0.31550 0.01070 -29.483 <2e-16 ***
# LIFEEX -1.93638 38.24878 -0.051 0.960
# L1.LIFEEX 10.01163 71.20359 0.141 0.888
# L2.LIFEEX -1.66669 37.70885 -0.044 0.965
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 791.3 on 7988 degrees of freedom
# (4750 observations deleted due to missingness)
# Multiple R-squared: 0.9974, Adjusted R-squared: 0.9974
# F-statistic: 6.166e+05 on 5 and 7988 DF, p-value: < 2.2e-16
# 2 - Ad-hoc computation in lm formula
summary(lm(PCGDP ~ L(PCGDP,1:2,iso3c,year) + L(LIFEEX,0:2,iso3c,year), wlddev))
#
# Call:
# lm(formula = PCGDP ~ L(PCGDP, 1:2, iso3c, year) + L(LIFEEX, 0:2,
# iso3c, year), data = wlddev)
#
# Residuals:
# Min 1Q Median 3Q Max
# -16621.0 -100.0 -17.2 86.2 11935.3
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -321.51378 63.37246 -5.073 4e-07 ***
# L(PCGDP, 1:2, iso3c, year)L1 1.31801 0.01061 124.173 <2e-16 ***
# L(PCGDP, 1:2, iso3c, year)L2 -0.31550 0.01070 -29.483 <2e-16 ***
# L(LIFEEX, 0:2, iso3c, year)-- -1.93638 38.24878 -0.051 0.960
# L(LIFEEX, 0:2, iso3c, year)L1 10.01163 71.20359 0.141 0.888
# L(LIFEEX, 0:2, iso3c, year)L2 -1.66669 37.70885 -0.044 0.965
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 791.3 on 7988 degrees of freedom
# (4750 observations deleted due to missingness)
# Multiple R-squared: 0.9974, Adjusted R-squared: 0.9974
# F-statistic: 6.166e+05 on 5 and 7988 DF, p-value: < 2.2e-16
# 3 - Precomputing panel-identifiers
g = qF(wlddev$iso3c, na.exclude = FALSE)
t = qF(wlddev$year, na.exclude = FALSE)
summary(lm(PCGDP ~ L(PCGDP,1:2,g,t) + L(LIFEEX,0:2,g,t), wlddev))
#
# Call:
# lm(formula = PCGDP ~ L(PCGDP, 1:2, g, t) + L(LIFEEX, 0:2, g,
# t), data = wlddev)
#
# Residuals:
# Min 1Q Median 3Q Max
# -16621.0 -100.0 -17.2 86.2 11935.3
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -321.51378 63.37246 -5.073 4e-07 ***
# L(PCGDP, 1:2, g, t)L1 1.31801 0.01061 124.173 <2e-16 ***
# L(PCGDP, 1:2, g, t)L2 -0.31550 0.01070 -29.483 <2e-16 ***
# L(LIFEEX, 0:2, g, t)-- -1.93638 38.24878 -0.051 0.960
# L(LIFEEX, 0:2, g, t)L1 10.01163 71.20359 0.141 0.888
# L(LIFEEX, 0:2, g, t)L2 -1.66669 37.70885 -0.044 0.965
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 791.3 on 7988 degrees of freedom
# (4750 observations deleted due to missingness)
# Multiple R-squared: 0.9974, Adjusted R-squared: 0.9974
# F-statistic: 6.166e+05 on 5 and 7988 DF, p-value: < 2.2e-16
Similarly to flag / L / F
, fdiff / D
computes sequences of suitably lagged / leaded and iterated differences on ordered and unordered time-series and panel-data, and fgrowth / G
computes growth rates or log-differences. Using again the Airpassengers data, the seasonal decomposition shows significant seasonality:
We can actually test the statistical significance of this seasonality around a cubic trend using again the fast F-test (same as running a regression with and without seasonal dummies and a cubic polynomial trend, but faster):
fFtest(AirPassengers, qF(cycle(AirPassengers)), poly(seq_along(AirPassengers), 3))
# R-Sq. DF1 DF2 F-Stat. P-Value
# Full Model 0.965 14 129 250.585 0.000
# Restricted Model 0.862 3 140 291.593 0.000
# Exclusion Rest. 0.102 11 129 33.890 0.000
The test shows significant seasonality. We can plot the series and the ordinary and seasonal (12-month) growth rate using:
It is evident that taking the annualized growth rate removes most of the periodic behavior. We can also compute second differences or growth rates of growth rates. Below I plot the ordinary and annual first and second differences of the data:
In general, both
fdiff / D
and fgrowth / G
can compute sequences of lagged / leaded and iterated growth rates, as the code below shows:
# sequence of leaded/lagged and iterated differences
head(D(AirPassengers, -2:2, 1:3))
# F2D1 F2D2 F2D3 FD1 FD2 FD3 -- D1 D2 D3 L2D1 L2D2 L2D3
# [1,] -20 -31 -69 -6 8 25 112 NA NA NA NA NA NA
# [2,] -11 -5 -12 -14 -17 -12 118 6 NA NA NA NA NA
# [3,] 11 38 77 3 -5 -27 132 14 8 NA 20 NA NA
# [4,] -6 7 49 8 22 23 129 -3 -17 -25 11 NA NA
# [5,] -27 -39 -19 -14 -1 12 121 -8 -5 12 -11 -31 NA
# [6,] -13 -42 -70 -13 -13 -1 135 14 22 27 6 -5 NA
All of this also works for panel-data. The code below gives an example:
y = 1:10
g = rep(1:2, each = 5)
t = rep(1:5, 2)
D(y, -2:2, 1:2, g, t)
# F2D1 F2D2 FD1 FD2 -- D1 D2 L2D1 L2D2
# [1,] -2 0 -1 0 1 NA NA NA NA
# [2,] -2 NA -1 0 2 1 NA NA NA
# [3,] -2 NA -1 0 3 1 0 2 NA
# [4,] NA NA -1 NA 4 1 0 2 NA
# [5,] NA NA NA NA 5 1 0 2 0
# [6,] -2 0 -1 0 6 NA NA NA NA
# [7,] -2 NA -1 0 7 1 NA NA NA
# [8,] -2 NA -1 0 8 1 0 2 NA
# [9,] NA NA -1 NA 9 1 0 2 NA
# [10,] NA NA NA NA 10 1 0 2 0
# attr(,"class")
# [1] "matrix"
The attached class-attribute allows calls of flag / L / F
, fdiff / D
and fgrowth / G
to be nested. In the example below, L.matrix
is called on the right-half ob the above sequence:
L(D(y, 0:2, 1:2, g, t), 0:1, g, t)
# -- L1.-- D1 L1.D1 D2 L1.D2 L2D1 L1.L2D1 L2D2 L1.L2D2
# [1,] 1 NA NA NA NA NA NA NA NA NA
# [2,] 2 1 1 NA NA NA NA NA NA NA
# [3,] 3 2 1 1 0 NA 2 NA NA NA
# [4,] 4 3 1 1 0 0 2 2 NA NA
# [5,] 5 4 1 1 0 0 2 2 0 NA
# [6,] 6 NA NA NA NA NA NA NA NA NA
# [7,] 7 6 1 NA NA NA NA NA NA NA
# [8,] 8 7 1 1 0 NA 2 NA NA NA
# [9,] 9 8 1 1 0 0 2 2 NA NA
# [10,] 10 9 1 1 0 0 2 2 0 NA
# attr(,"class")
# [1] "matrix"
If n * diff
(or n
in flag / L / F
) exceeds the length of the data or the average group size in panel-computations, all of these functions will throw appropriate errors:
tryCatch(D(y, 3, 2, g, t), error = function(e) e)
# <Rcpp::exception in fdiff.default(x, n, diff, g, t, fill, stubs, ...): abs(n * diff) exceeds average group size: 5>
Of course fdiff / D
and fgrowth / G
also come with a data.frame method, making the computation of growth-variables on datasets very easy:
head(G(wlddev, 0:1, 1, PCGDP + LIFEEX ~ iso3c, ~year))
# iso3c year PCGDP G1.PCGDP LIFEEX G1.LIFEEX
# 1 AFG 1960 NA NA 32.292 NA
# 2 AFG 1961 NA NA 32.742 1.393534
# 3 AFG 1962 NA NA 33.185 1.353002
# 4 AFG 1963 NA NA 33.624 1.322887
# 5 AFG 1964 NA NA 34.060 1.296693
# 6 AFG 1965 NA NA 34.495 1.277158
head(G(GGDC10S, 1, 1, ~ Variable + Country, ~ Year, cols = 6:10))
# # A tibble: 6 x 8
# Variable Country Year G1.AGR G1.MIN G1.MAN G1.PU G1.CON
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA 1960 NA NA NA NA NA
# 2 VA BWA 1961 NA NA NA NA NA
# 3 VA BWA 1962 NA NA NA NA NA
# 4 VA BWA 1963 NA NA NA NA NA
# 5 VA BWA 1964 NA NA NA NA NA
# 6 VA BWA 1965 -3.52 -28.6 38.2 29.4 104.
One could also add variables by reference using data.table:
head(qDT(wlddev)[, paste0("G.", names(wlddev)[9:12]) := fgrowth(.SD,1,1,iso3c,year), .SDcols = 9:12])
# country iso3c date year decade region income OECD PCGDP LIFEEX GINI ODA
# 1: Afghanistan AFG 1961-01-01 1960 1960 South Asia Low income FALSE NA 32.292 NA 114440000
# 2: Afghanistan AFG 1962-01-01 1961 1960 South Asia Low income FALSE NA 32.742 NA 233350000
# 3: Afghanistan AFG 1963-01-01 1962 1960 South Asia Low income FALSE NA 33.185 NA 114880000
# 4: Afghanistan AFG 1964-01-01 1963 1960 South Asia Low income FALSE NA 33.624 NA 236450000
# 5: Afghanistan AFG 1965-01-01 1964 1960 South Asia Low income FALSE NA 34.060 NA 302480000
# 6: Afghanistan AFG 1966-01-01 1965 1960 South Asia Low income FALSE NA 34.495 NA 370250000
# G.PCGDP G.LIFEEX G.GINI G.ODA
# 1: NA NA NA NA
# 2: NA 1.393534 NA 103.90598
# 3: NA 1.353002 NA -50.76923
# 4: NA 1.322887 NA 105.82347
# 5: NA 1.296693 NA 27.92557
# 6: NA 1.277158 NA 22.40479
When working with data.table it is important to realize that while collapse functions will work with data.table grouping using by
or keyby
, this is very slow because it will run a method-dispatch for every group. It is much better and more secure to utilize the functions fast internal grouping facilities, as I have done in the above example.
The code below estimates a dynamic panel model regressing the 10-year growth rate of GDP per capita on it’s 10-year lagged level and the 10-year growth rate of life-expectancy:
summary(lm(G(PCGDP,10,1,iso3c,year) ~
L(PCGDP,10,iso3c,year) +
G(LIFEEX,10,1,iso3c,year), data = wlddev))
#
# Call:
# lm(formula = G(PCGDP, 10, 1, iso3c, year) ~ L(PCGDP, 10, iso3c,
# year) + G(LIFEEX, 10, 1, iso3c, year), data = wlddev)
#
# Residuals:
# Min 1Q Median 3Q Max
# -104.75 -22.95 -4.39 13.21 1724.57
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 2.724e+01 1.206e+00 22.586 < 2e-16 ***
# L(PCGDP, 10, iso3c, year) -3.166e-04 5.512e-05 -5.743 9.72e-09 ***
# G(LIFEEX, 10, 1, iso3c, year) 5.166e-01 1.250e-01 4.134 3.61e-05 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 61.08 on 6498 degrees of freedom
# (6243 observations deleted due to missingness)
# Multiple R-squared: 0.009558, Adjusted R-squared: 0.009253
# F-statistic: 31.35 on 2 and 6498 DF, p-value: 2.812e-14
To go even a step further, the code below regresses the 10-year growth rate of GDP on the 10-year lagged levels and 10-year growth rates of GDP and life expectancy, with country and time-fixed effects projected out using HDW
. The standard errors are unreliable without bootstrapping, but this example nicely demonstrates the potential for complex estimations brought by collapse.
moddat <- HDW(L(G(wlddev, c(0, 10), 1, ~iso3c, ~year, 9:10), c(0, 10), ~iso3c, ~year), ~iso3c + qF(year))[-c(1,5)]
summary(lm(HDW.L10G1.PCGDP ~. , moddat))
#
# Call:
# lm(formula = HDW.L10G1.PCGDP ~ ., data = moddat)
#
# Residuals:
# Min 1Q Median 3Q Max
# -448.32 -11.07 -0.22 10.32 578.09
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 1.487e-15 4.402e-01 0.000 1
# HDW.L10.PCGDP -2.135e-03 1.263e-04 -16.902 < 2e-16 ***
# HDW.L10.L10G1.PCGDP -6.983e-01 9.520e-03 -73.355 < 2e-16 ***
# HDW.L10.LIFEEX 1.495e+00 2.744e-01 5.449 5.32e-08 ***
# HDW.L10G1.LIFEEX 7.979e-01 1.070e-01 7.459 1.04e-13 ***
# HDW.L10.L10G1.LIFEEX 8.709e-01 1.034e-01 8.419 < 2e-16 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 30.04 on 4651 degrees of freedom
# Multiple R-squared: 0.5652, Adjusted R-squared: 0.5647
# F-statistic: 1209 on 5 and 4651 DF, p-value: < 2.2e-16
How long did it take to run this computation? About 4 milliseconds on my laptop (2x 2.2 GHZ, 8 GB RAM), so there is plenty of room to do this with much larger data.
microbenchmark(HDW(L(G(wlddev, c(0, 10), 1, ~iso3c, ~year, 9:10), c(0, 10), ~iso3c, ~year), ~iso3c + qF(year)))
# Unit: milliseconds
# expr
# HDW(L(G(wlddev, c(0, 10), 1, ~iso3c, ~year, 9:10), c(0, 10), ~iso3c, ~year), ~iso3c + qF(year))
# min lq mean median uq max neval
# 4.821263 5.024528 5.124693 5.105523 5.178707 7.136843 100
One of the inconveniences of the above computations is that it requires declaring the panel-identifiers iso3c
and year
again and again for each function. A great remedy here are the plm classes pseries and pdata.frame which collapse was built to support. To advocate for the use of these classes for panel-data, here I show how one could run the same regression with plm:
pwlddev <- plm::pdata.frame(wlddev, index = c("iso3c", "year"))
moddat <- HDW(L(G(pwlddev, c(0, 10), 1, 9:10), c(0, 10)))[-c(1,5)]
summary(lm(HDW.L10G1.PCGDP ~. , moddat))
#
# Call:
# lm(formula = HDW.L10G1.PCGDP ~ ., data = moddat)
#
# Residuals:
# Min 1Q Median 3Q Max
# -241.41 -13.16 -1.09 11.25 793.22
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.4136682 0.4932531 0.839 0.402
# HDW.L10.PCGDP -0.0020196 0.0001312 -15.392 < 2e-16 ***
# HDW.L10.L10G1.PCGDP -0.6908712 0.0106041 -65.151 < 2e-16 ***
# HDW.L10.LIFEEX 1.2204125 0.2512625 4.857 1.23e-06 ***
# HDW.L10G1.LIFEEX 0.7711891 0.1109986 6.948 4.23e-12 ***
# HDW.L10.L10G1.LIFEEX 0.9303752 0.1066535 8.723 < 2e-16 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 33.62 on 4651 degrees of freedom
# (8087 observations deleted due to missingness)
# Multiple R-squared: 0.5078, Adjusted R-squared: 0.5073
# F-statistic: 959.8 on 5 and 4651 DF, p-value: < 2.2e-16
To learn more about the integration of collapse and plm, see the corresponding vignette.
Below I provide some benchmarks for lags, differences and growth rates on panel-data. I will run microbenchmarks on the wlddev
dataset. benchmarks on larger panels are already provided in the other vignettes. Again I compare collapse to dplyr and data.table:
# We have a balanced panel of 216 countries, each observed for 59 years
descr(wlddev, cols = c("iso3c", "year"))
# Dataset: wlddev, 2 Variables, N = 12744
# -----------------------------------------------------------------------------------------------------
# iso3c (factor): Country Code
# Stats:
# N Ndist
# 12744 216
# Table:
# ABW AFG AGO ALB AND ARE
# Freq 59 59 59 59 59 59
# Perc 0.46 0.46 0.46 0.46 0.46 0.46
# ---
# VUT WSM XKX YEM ZAF ZMB ZWE
# Freq 59 59 59 59 59 59 59
# Perc 0.46 0.46 0.46 0.46 0.46 0.46 0.46
#
# Summary of Table:
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 59 59 59 59 59 59
# -----------------------------------------------------------------------------------------------------
# year (numeric):
# Stats:
# N Ndist Mean SD Min Max Skew Kurt
# 12744 59 1989 17.03 1960 2018 -0 1.8
# Quant:
# 1% 5% 25% 50% 75% 95% 99%
# 1960 1962 1974 1989 2004 2016 2018
# -----------------------------------------------------------------------------------------------------
# 1 Panel-Lag
suppressMessages(
microbenchmark(dplyr_not_ordered = wlddev %>% group_by(iso3c) %>% select_at(9:12) %>% mutate_all(lag),
dplyr_ordered = wlddev %>% arrange(iso3c,year) %>% group_by(iso3c) %>% select_at(9:12) %>% mutate_all(lag),
data.table_not_ordered = dtwlddev[, shift(.SD), keyby = iso3c, .SDcols = 9:12],
data.table_ordered = dtwlddev[order(year), shift(.SD), keyby = iso3c, .SDcols = 9:12],
collapse_not_ordered = L(wlddev, 1, ~iso3c, cols = 9:12),
collapse_ordered = L(wlddev, 1, ~iso3c, ~year, cols = 9:12),
subtract_from_CNO = message("Panel-lag computed without timevar: Assuming ordered data")))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# dplyr_not_ordered 23116.533 23584.2005 30255.9278 24521.0985 26106.1705 307177.852 100 c
# dplyr_ordered 28002.501 29062.3400 31068.8398 29727.4725 31726.6630 73125.748 100 c
# data.table_not_ordered 4695.420 4914.7515 5993.6894 5087.4495 5271.9725 48162.196 100 b
# data.table_ordered 5737.410 6051.5675 6450.9372 6223.8195 6526.1525 8039.156 100 b
# collapse_not_ordered 320.852 425.9440 466.7711 475.2540 505.5990 592.617 100 a
# collapse_ordered 602.434 676.9585 709.4094 701.0560 752.3745 855.458 100 a
# subtract_from_CNO 166.004 230.4875 270.6094 289.6145 304.7875 348.073 100 a
# Sequence of 1 lead and 3 lags: Not possible in dplyr
microbenchmark(data.table_not_ordered = dtwlddev[, shift(.SD, -1:3), keyby = iso3c, .SDcols = 9:12],
data.table_ordered = dtwlddev[order(year), shift(.SD, -1:3), keyby = iso3c, .SDcols = 9:12],
collapse_ordered = L(wlddev, -1:3, ~iso3c, ~year, cols = 9:12))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# data.table_not_ordered 5970.351 6231.629 7077.991 6483.313 6663.374 64301.642 100 b
# data.table_ordered 7121.224 7315.788 8157.786 7557.432 7740.840 67464.647 100 b
# collapse_ordered 888.034 950.508 1006.896 1000.712 1073.673 1284.302 100 a
# 1 Panel-difference
microbenchmark(dplyr_not_ordered = wlddev %>% group_by(iso3c) %>% select_at(9:12) %>% mutate_all(function(x) x - lag(x)),
dplyr_ordered = wlddev %>% arrange(iso3c,year) %>% group_by(iso3c) %>% select_at(9:12) %>% mutate_all(function(x) x - lag(x)),
data.table_not_ordered = dtwlddev[, lapply(.SD, function(x) x - shift(x)), keyby = iso3c, .SDcols = 9:12],
data.table_ordered = dtwlddev[order(year), lapply(.SD, function(x) x - shift(x)), keyby = iso3c, .SDcols = 9:12],
collapse_ordered = D(wlddev, 1, 1, ~iso3c, ~year, cols = 9:12))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# dplyr_not_ordered 24116.128 24905.317 28815.4084 25588.3000 27158.200 70433.981 100 c
# dplyr_ordered 29077.066 30000.129 33739.8743 30787.5340 32690.558 77775.651 100 d
# data.table_not_ordered 14059.486 14690.257 16571.3306 15549.7310 15839.122 58180.460 100 b
# data.table_ordered 15069.791 15742.287 16955.6035 16578.3325 16998.475 55906.827 100 b
# collapse_ordered 624.301 712.212 754.2083 733.4085 794.098 1032.618 100 a
# Iterated Panel-Difference: Not straightforward in dplyr or data.table
microbenchmark(collapse_ordered = D(wlddev, 1, 2, ~iso3c, ~year, cols = 9:12))
# Unit: microseconds
# expr min lq mean median uq max neval
# collapse_ordered 763.977 780.934 817.3077 812.1715 840.7315 1004.951 100
# Sequence of Lagged/Leaded Differences: Not straightforward in dplyr or data.table
microbenchmark(collapse_ordered = D(wlddev, -1:3, 1, ~iso3c, ~year, cols = 9:12))
# Unit: microseconds
# expr min lq mean median uq max neval
# collapse_ordered 983.531 991.786 1070.527 1028.379 1118.744 1296.35 100
# Sequence of Lagged/Leaded and Iterated Differences: Not straightforward in dplyr or data.table
microbenchmark(collapse_ordered = D(wlddev, -1:3, 1:2, ~iso3c, ~year, cols = 9:12))
# Unit: milliseconds
# expr min lq mean median uq max neval
# collapse_ordered 2.015702 2.094688 2.284892 2.268056 2.360875 3.634243 100
# The same applies to growth rates or log-differences.
microbenchmark(collapse_ordered_growth = G(wlddev, 1, 1, ~iso3c, ~year, cols = 9:12),
collapse_ordered_logdiff = G(wlddev, 1, 1, ~iso3c, ~year, cols = 9:12, logdiff = TRUE))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# collapse_ordered_growth 761.299 778.4795 837.4469 828.459 880.001 1173.186 100 a
# collapse_ordered_logdiff 2897.041 2917.7920 3103.7793 3009.050 3302.681 3686.454 100 b
The results are similar to the grouped transformations: collapse dramatically facilitates and speeds up these complex operations in R. Again plm classes are very useful to avoid having to specify panel-identifiers all the time. See the ‘collapse and plm’ vignette for more details.
collapse also provides an ensemble of list-processing functions that grew out of a necessity of working with complex nested lists of data objects. The example provided in this section is also somewhat complex, but it demonstrates the utility of these functions while also providing a nice data-transformation task. When summarizing the GGDC10S
data in section 1, it became clear that certain sectors have a high share of economic activity in almost all countries in the sample. The application I devised for this section is to see if there are common patterns in the interaction of these important sectors across countries. The approach for this will be an attempt of running a (Structural) Panel-Vector-Autoregression (SVAR) in value added with the 6 most important sectors (excluding government): Agriculture, manufacturing, wholesale and retail trade, construction, transport and storage and finance and real estate.
For this I will use the vars package10. Since vars natively does not support panel-VAR, we need to create the central varest object manually and then run the SVAR
function to impose identification restrictions. We start off exploring and harmonizing the data:
library(vars)
# The 6 most important non-government sectors (see section 1)
sec <- c("AGR","MAN","WRT","CON","TRA","FIRE")
# This creates a data.table containing the value added of the 6 most important non-government sectors
data <- qDT(GGDC10S)[Variable == "VA"] %>% get_vars(c("Country","Year", sec)) %>% na.omit
# Let's look at the log VA in agriculture across countries:
AGRmat <- log(psmat(data, AGR ~ Country, ~ Year, transpose = TRUE)) # Converting to panel-series matrix
plot(AGRmat)
The plot shows quite some heterogeneity both in the levels (VA is in local currency) and in trend growth rates. In the panel-VAR estimation we are only really interested in the sectoral relationships within countries. Thus we need to harmonize this sectoral data further. One way would be taking growth rates or log-differences of the data, but VAR’s are usually estimated in levels unless the data are cointegrated (and value added series do not, in general, exhibit unit-root behavior). Thus to harmonize the data further I opt for subtracting a country-sector specific cubic trend from the data in logs:
# Subtracting a country specific cubic growth trend
AGRmat <- dapply(AGRmat, fHDwithin, poly(seq_row(AGRmat), 3), fill = TRUE)
plot(AGRmat)
This seems to have done a decent job in curbing some of that heterogeneity. Some series however have a high variance around that cubic trend. Therefore as a final step I standardize the data to bring the variances in line:
Now this looks pretty good, and is about the most we can do in terms of harmonization without differencing the data. Below I apply these transformations to all sectors:
# Taking logs
get_vars(data, 3:8) <- dapply(get_vars(data, 3:8), log)
# Iteratively projecting out country FE and cubic trends from complete cases (still very slow)
get_vars(data, 3:8) <- HDW(data, ~ qF(Country)*poly(Year, 3), fill = TRUE)
# Scaling
get_vars(data, 3:8) <- STD(data, ~ Country, cols = 3:8, keep.by = FALSE)
# Check the plot
plot(psmat(data, ~Country, ~Year))
Since the data is annual, let us estimate the Panel-VAR with one lag:
# This adds one lag of all series to the data
add_vars(data) <- L(data, 1, ~ Country, ~ Year, keep.ids = FALSE)
# This removes missing values from all but the first row and drops identifier columns (vars is made for time-series without gaps)
data <- rbind(data[1, -(1:2)], na.omit(data[-1, -(1:2)]))
head(data)
# STD.HDW.AGR STD.HDW.MAN STD.HDW.WRT STD.HDW.CON STD.HDW.TRA STD.HDW.FIRE L1.STD.HDW.AGR
# 1: 0.65713943 2.2350583 1.946383 -0.03574399 1.0877811 1.0476507 NA
# 2: -0.14377115 1.8693570 1.905081 1.23225734 1.0542315 0.9105622 0.65713943
# 3: -0.09209879 -0.8212004 1.997253 -0.01783824 0.6718465 0.6134260 -0.14377115
# 4: -0.25213869 -1.7830320 -1.970855 -2.68332505 -1.8475551 0.4382902 -0.09209879
# 5: -0.31623401 -4.2931567 -1.822211 -2.75551916 -0.7066491 -2.1982640 -0.25213869
# 6: -0.72691916 -1.3219387 -2.079333 -0.12148295 -1.1398220 -2.2230474 -0.31623401
# L1.STD.HDW.MAN L1.STD.HDW.WRT L1.STD.HDW.CON L1.STD.HDW.TRA L1.STD.HDW.FIRE
# 1: NA NA NA NA NA
# 2: 2.2350583 1.946383 -0.03574399 1.0877811 1.0476507
# 3: 1.8693570 1.905081 1.23225734 1.0542315 0.9105622
# 4: -0.8212004 1.997253 -0.01783824 0.6718465 0.6134260
# 5: -1.7830320 -1.970855 -2.68332505 -1.8475551 0.4382902
# 6: -4.2931567 -1.822211 -2.75551916 -0.7066491 -2.1982640
Having prepared the data, the code below estimates the panel-VAR using lm
and creates the varest object:
# saving the names of the 6 sectors
nam <- names(data)[1:6]
pVAR <- list(varresult = setNames(lapply(seq_len(6), function(i) # list of 6 lm's each regressing
lm(as.formula(paste0(nam[i], "~ -1 + . ")), # the sector on all lags of
get_vars(data, c(i, 7:length(data)))[-1])), nam), # itself and other sectors, removing the missing first row
datamat = data[-1], # The full data containing levels and lags of the sectors, removing the missing first row
y = do.call(cbind, get_vars(data, 1:6)), # Only the levels data as matrix
type = "none", # No constant or tend term: We harmonized the data already
p = 1, # The lag-order
K = 6, # The number of variables
obs = nrow(data)-1, # The number of non-missing obs
totobs = nrow(data), # The total number of obs
restrictions = NULL,
call = quote(VAR(y = data)))
class(pVAR) <- "varest"
The significant serial-correlation test below suggests that the panel-VAR with one lag is ill-identified, but the sample size is also quite large so the test is prone to reject, and the test is likely also still picking up remaining cross-sectional heterogeneity. For the purposes of this vignette this shall not bother us.
serial.test(pVAR)
#
# Portmanteau Test (asymptotic)
#
# data: Residuals of VAR object pVAR
# Chi-squared = 1678.9, df = 540, p-value < 2.2e-16
By default the VAR is identified using a Choleski ordering of the direct impact matrix in which the first variable (here Agriculture) is assumed to not be directly impacted by any other sector in the current period, and this descends down to the last variable (Finance and Real Estate), which is assumed to be impacted by all other sectors in the current period. For structural identification it usually necessary to impose restrictions on the direct impact matrix in line with economic theory. I do not have any theories on the average worldwide interaction of broad economic sectors, but to aid identification I will compute the correlation matrix in growth rates and restrict the lowest coefficients to be 0, which should be better than just imposing a random Choleski ordering. This will also enable me to give a demonstration of the grouped tibble methods for collapse functions, discussed in more detail in the ‘collapse and dplyr’ vignette:
# This computes the pairwise correlations between standardized sectoral growth rates across countries
corr <- filter(GGDC10S, Variable == "VA") %>% # Subset rows: Only VA
group_by(Country) %>% # Group by country
get_vars(sec) %>% # Select the 6 sectors
fgrowth %>% # Compute Sectoral growth rates (a time-variable can be passsed, but not necessary here as the data is ordered)
fscale %>% # Scale and center (i.e. standardize)
pwcor # Compute Pairwise correlations
corr
# G1.AGR G1.MAN G1.WRT G1.CON G1.TRA G1.FIRE
# G1.AGR 1 .55 .59 .39 .52 .41
# G1.MAN .55 1 .67 .54 .65 .48
# G1.WRT .59 .67 1 .56 .66 .52
# G1.CON .39 .54 .56 1 .53 .46
# G1.TRA .52 .65 .66 .53 1 .51
# G1.FIRE .41 .48 .52 .46 .51 1
# We need to impose K*(K-1)/2 = 15 (with K = 6 variables) restrictions for identification
corr[corr <= sort(corr)[15]] <- 0
corr
# G1.AGR G1.MAN G1.WRT G1.CON G1.TRA G1.FIRE
# G1.AGR 1 .55 .59 .00 .00 .00
# G1.MAN .55 1 .67 .54 .65 .00
# G1.WRT .59 .67 1 .56 .66 .00
# G1.CON .00 .54 .56 1 .00 .00
# G1.TRA .00 .65 .66 .00 1 .00
# G1.FIRE .00 .00 .00 .00 .00 1
# The rest is unknown (i.e. will be estimated)
corr[corr > 0 & corr < 1] <- NA
# This estimates the Panel-SVAR using Maximum Likelihood:
pSVAR <- SVAR(pVAR, Amat = unclass(corr), estmethod = "direct")
pSVAR
#
# SVAR Estimation Results:
# ========================
#
#
# Estimated A matrix:
# STD.HDW.AGR STD.HDW.MAN STD.HDW.WRT STD.HDW.CON STD.HDW.TRA STD.HDW.FIRE
# STD.HDW.AGR 1.00000 -0.58705 -0.2490 0.0000 0.00000 0
# STD.HDW.MAN 0.45708 1.00000 0.2374 0.1524 -1.23083 0
# STD.HDW.WRT 0.09161 -1.31439 1.0000 2.2581 -0.08235 0
# STD.HDW.CON 0.00000 0.01723 -1.3247 1.0000 0.00000 0
# STD.HDW.TRA 0.00000 0.90374 -0.3327 0.0000 1.00000 0
# STD.HDW.FIRE 0.00000 0.00000 0.0000 0.0000 0.00000 1
Now this object is quite involved, which brings us to the actual subject of this section:
# psVAR$var$varresult is a list containing the 6 linear models fitted above, it is not displayed in full here.
str(pSVAR, give.attr = FALSE, max.level = 3)
# List of 13
# $ A : num [1:6, 1:6] 1 0.4571 0.0916 0 0 ...
# $ Ase : num [1:6, 1:6] 0 0 0 0 0 0 0 0 0 0 ...
# $ B : num [1:6, 1:6] 1 0 0 0 0 0 0 1 0 0 ...
# $ Bse : num [1:6, 1:6] 0 0 0 0 0 0 0 0 0 0 ...
# $ LRIM : NULL
# $ Sigma.U: num [1:6, 1:6] 97.705 12.717 13.025 0.984 26.992 ...
# $ LR :List of 5
# ..$ statistic: Named num 6218
# ..$ parameter: Named num 7
# ..$ p.value : Named num 0
# ..$ method : chr "LR overidentification"
# ..$ data.name: symbol data
# $ opt :List of 5
# ..$ par : num [1:14] 0.4571 0.0916 -0.587 -1.3144 0.0172 ...
# ..$ value : num 11538
# ..$ counts : Named int [1:2] 501 NA
# ..$ convergence: int 1
# ..$ message : NULL
# $ start : num [1:14] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
# $ type : chr "A-model"
# $ var :List of 10
# ..$ varresult :List of 6
# .. ..$ STD.HDW.AGR :List of 12
# .. ..$ STD.HDW.MAN :List of 12
# .. ..$ STD.HDW.WRT :List of 12
# .. ..$ STD.HDW.CON :List of 12
# .. ..$ STD.HDW.TRA :List of 12
# .. ..$ STD.HDW.FIRE:List of 12
# ..$ datamat :Classes 'data.table' and 'data.frame': 2060 obs. of 12 variables:
# .. ..$ STD.HDW.AGR : num [1:2060] -0.1438 -0.0921 -0.2521 -0.3162 -0.7269 ...
# .. ..$ STD.HDW.MAN : num [1:2060] 1.869 -0.821 -1.783 -4.293 -1.322 ...
# .. ..$ STD.HDW.WRT : num [1:2060] 1.91 2 -1.97 -1.82 -2.08 ...
# .. ..$ STD.HDW.CON : num [1:2060] 1.2323 -0.0178 -2.6833 -2.7555 -0.1215 ...
# .. ..$ STD.HDW.TRA : num [1:2060] 1.054 0.672 -1.848 -0.707 -1.14 ...
# .. ..$ STD.HDW.FIRE : num [1:2060] 0.911 0.613 0.438 -2.198 -2.223 ...
# .. ..$ L1.STD.HDW.AGR : num [1:2060] 0.6571 -0.1438 -0.0921 -0.2521 -0.3162 ...
# .. ..$ L1.STD.HDW.MAN : num [1:2060] 2.235 1.869 -0.821 -1.783 -4.293 ...
# .. ..$ L1.STD.HDW.WRT : num [1:2060] 1.95 1.91 2 -1.97 -1.82 ...
# .. ..$ L1.STD.HDW.CON : num [1:2060] -0.0357 1.2323 -0.0178 -2.6833 -2.7555 ...
# .. ..$ L1.STD.HDW.TRA : num [1:2060] 1.088 1.054 0.672 -1.848 -0.707 ...
# .. ..$ L1.STD.HDW.FIRE: num [1:2060] 1.048 0.911 0.613 0.438 -2.198 ...
# ..$ y : num [1:2061, 1:6] 0.6571 -0.1438 -0.0921 -0.2521 -0.3162 ...
# ..$ type : chr "none"
# ..$ p : num 1
# ..$ K : num 6
# ..$ obs : num 2060
# ..$ totobs : int 2061
# ..$ restrictions: NULL
# ..$ call : language VAR(y = data)
# $ iter : Named int 501
# $ call : language SVAR(x = pVAR, estmethod = "direct", Amat = unclass(corr))
When dealing with such a list-like object, we might be interested in its complexity by measuring the level of nesting. This can be done with ldepth
:
# The list-tree of this object has 5 levels of nesting
ldepth(pSVAR)
# [1] 5
# This data has a depth of 1, thus this dataset does not contain list-columns
ldepth(data)
# [1] 1
Further we might be interested in knowing whether this list-object contains non-atomic elements like call, terms or formulas. The function is.regular
in the collapse package checks if an object is atomic or list-like, and the recursive version is.unlistable
checks whether all objects in a nested structure are atomic or list-like:
# Is this object composed only of atomic elements e.g. can it be unlisted?
is.unlistable(pSVAR)
# [1] FALSE
Evidently this object is not unlistable, from viewing its structure we know that it contains several call and terms objects. We might also want to know if this object saves some kind of residuals or fitted values. This can be done using has_elem
, which also supports regular expression search of element names:
# Does this object contain an element with "fitted" in its name?
has_elem(pSVAR, "fitted", regex = TRUE)
# [1] TRUE
# Does this object contain an element with "residuals" in its name?
has_elem(pSVAR, "residuals", regex = TRUE)
# [1] TRUE
We might also want to know whether the object contains some kind of data-matrix. This can be checked by calling:
These functions can sometimes be helpful in exploring object, although for all practical purposes the viewer in Rstudio is very informative. A much greater advantage of having functions to search and check lists is the ability to write more complex programs with them (which I will not demonstrate here).
Having gathered some information about the pSVAR
object in the previous section, this section introduces several extractor functions to pull out elements from such lists: get_elem
can be used to pull out elements from lists in a simplified format11.
# This is the path to the residuals from a single equation
str(pSVAR$var$varresult$STD.HDW.AGR$residuals)
# Named num [1:2060] -0.722 -0.203 -0.223 0.083 -0.151 ...
# - attr(*, "names")= chr [1:2060] "1" "2" "3" "4" ...
# get_elem gets the residuals from all 6 equations and puts them in a top-level list
resid <- get_elem(pSVAR, "residuals")
str(resid, give.attr = FALSE)
# List of 6
# $ STD.HDW.AGR : Named num [1:2060] -0.722 -0.203 -0.223 0.083 -0.151 ...
# $ STD.HDW.MAN : Named num [1:2060] 0.362 -1.982 -1.144 -3.092 1.481 ...
# $ STD.HDW.WRT : Named num [1:2060] 0.388 0.648 -3.065 -0.419 -0.428 ...
# $ STD.HDW.CON : Named num [1:2060] 1.054 -1.071 -2.631 -0.626 2.26 ...
# $ STD.HDW.TRA : Named num [1:2060] 0.167 -0.238 -2.248 0.847 -0.139 ...
# $ STD.HDW.FIRE: Named num [1:2060] -0.0949 -0.3082 0.108 -2.1209 -0.0563 ...
# Qick conversion to matrix and plotting
plot.ts(qM(resid), main = "Panel-VAR Residuals")
Similarly, we could pull out and plot the fitted values:
# Regular expression search and retrieval of fitted values
plot.ts(qM(get_elem(pSVAR, "^fi", regex = TRUE)), main = "Panel-VAR Fitted Values")
Below I compute the main quantities of interest in SVAR analysis: The impulse response functions (IRF’s) and forecast error variance decompositions (FEVD’s):
# This computes orthogonalized impulse response functions
pIRF <- irf(pSVAR)
# This computes the forecast error variance decompositions
pFEVD <- fevd(pSVAR)
The pIRF
object contains the IRF’s with lower and upper confidence bounds and some atomic elements providing information about the object:
# See the structure of a vars IRF object:
str(pIRF, give.attr = FALSE)
# List of 11
# $ irf :List of 6
# ..$ STD.HDW.AGR : num [1:11, 1:6] 0.87 0.531 0.33 0.21 0.138 ...
# ..$ STD.HDW.MAN : num [1:11, 1:6] 0.274 0.1892 0.1385 0.1059 0.0833 ...
# ..$ STD.HDW.WRT : num [1:11, 1:6] 0.0526 0.0514 0.0463 0.0399 0.0335 ...
# ..$ STD.HDW.CON : num [1:11, 1:6] -0.1605 -0.1051 -0.0688 -0.0451 -0.0297 ...
# ..$ STD.HDW.TRA : num [1:11, 1:6] 0.342 0.258 0.199 0.155 0.123 ...
# ..$ STD.HDW.FIRE: num [1:11, 1:6] 0 0.0217 0.0263 0.0239 0.0191 ...
# $ Lower :List of 6
# ..$ STD.HDW.AGR : num [1:11, 1:6] 0.429 0.298 0.208 0.136 0.088 ...
# ..$ STD.HDW.MAN : num [1:11, 1:6] -0.4794 -0.289 -0.1769 -0.1105 -0.0665 ...
# ..$ STD.HDW.WRT : num [1:11, 1:6] -0.489 -0.317 -0.23 -0.159 -0.123 ...
# ..$ STD.HDW.CON : num [1:11, 1:6] -0.417 -0.272 -0.193 -0.141 -0.101 ...
# ..$ STD.HDW.TRA : num [1:11, 1:6] -0.3445 -0.1904 -0.12 -0.0926 -0.0715 ...
# ..$ STD.HDW.FIRE: num [1:11, 1:6] 0 -0.015 -0.0245 -0.0294 -0.0304 ...
# $ Upper :List of 6
# ..$ STD.HDW.AGR : num [1:11, 1:6] 1.084 0.69 0.467 0.322 0.234 ...
# ..$ STD.HDW.MAN : num [1:11, 1:6] 0.568 0.377 0.278 0.206 0.16 ...
# ..$ STD.HDW.WRT : num [1:11, 1:6] 0.3814 0.2363 0.1618 0.1193 0.0944 ...
# ..$ STD.HDW.CON : num [1:11, 1:6] 0.273 0.229 0.17 0.153 0.123 ...
# ..$ STD.HDW.TRA : num [1:11, 1:6] 0.349 0.26 0.203 0.159 0.127 ...
# ..$ STD.HDW.FIRE: num [1:11, 1:6] 0 0.0564 0.0734 0.0719 0.063 ...
# $ response : chr [1:6] "STD.HDW.AGR" "STD.HDW.MAN" "STD.HDW.WRT" "STD.HDW.CON" ...
# $ impulse : chr [1:6] "STD.HDW.AGR" "STD.HDW.MAN" "STD.HDW.WRT" "STD.HDW.CON" ...
# $ ortho : logi TRUE
# $ cumulative: logi FALSE
# $ runs : num 100
# $ ci : num 0.05
# $ boot : logi TRUE
# $ model : chr "svarest"
We could separately access the top-level atomic or list elements using atomic_elem
or list_elem
:
# Pool-out top-level atomic elements in the list
str(atomic_elem(pIRF))
# List of 8
# $ response : chr [1:6] "STD.HDW.AGR" "STD.HDW.MAN" "STD.HDW.WRT" "STD.HDW.CON" ...
# $ impulse : chr [1:6] "STD.HDW.AGR" "STD.HDW.MAN" "STD.HDW.WRT" "STD.HDW.CON" ...
# $ ortho : logi TRUE
# $ cumulative: logi FALSE
# $ runs : num 100
# $ ci : num 0.05
# $ boot : logi TRUE
# $ model : chr "svarest"
There are also recursive versions of atomic_elem
and list_elem
named reg_elem
and irreg_elem
which can be used to split nested lists into the atomic and non-atomic parts. These are not covered in this vignette.
vars supplies plot methods for IRF and FEVD objects using base graphics, for example:
plot(pIRF)
would give us 6 charts of all sectoral responses to each sectoral shock. In this section I however want to generate nicer plots using ggplot2
and also compute some statistics on the IRF data. Starting with the latter, the code below sums the 10-period impulse response coefficients of each sector in response to each sectoral impulse and stores them in a data.frame:
# Computing the cumulative impact after 10 periods
list_elem(pIRF) %>% # Pull out the sublist elements containing the IRF coefficients + CI's
rapply2d(function(x) round(fsum(x), 2)) %>% # Recursively apply the column-sums to coefficient matrices (could also use colSums)
unlist2d(c("Type", "Impulse")) # Recursively row-bind the result to a data.frame and add identifier columns
# Type Impulse STD.HDW.AGR STD.HDW.MAN STD.HDW.WRT STD.HDW.CON STD.HDW.TRA STD.HDW.FIRE
# 1 irf STD.HDW.AGR 2.37 0.31 0.52 0.00 0.75 0.11
# 2 irf STD.HDW.MAN 1.04 1.40 1.01 1.23 -0.20 0.75
# 3 irf STD.HDW.WRT 0.33 0.02 0.68 1.25 0.60 0.18
# 4 irf STD.HDW.CON -0.46 -0.10 -1.61 0.67 -0.46 -0.04
# 5 irf STD.HDW.TRA 1.44 2.02 1.33 1.34 2.38 0.82
# 6 irf STD.HDW.FIRE 0.13 -0.12 0.04 -0.14 -0.08 2.85
# 7 Lower STD.HDW.AGR 1.32 -0.40 -0.30 -0.67 -0.39 -0.25
# 8 Lower STD.HDW.MAN -1.24 0.21 -0.67 -1.05 -1.16 0.04
# 9 Lower STD.HDW.WRT -1.69 -1.80 -0.53 -2.20 -1.97 -0.77
# 10 Lower STD.HDW.CON -1.37 -1.83 -1.90 -0.13 -1.27 -0.67
# 11 Lower STD.HDW.TRA -1.03 -1.04 -1.48 -1.64 0.36 -0.62
# 12 Lower STD.HDW.FIRE -0.23 -0.49 -0.35 -0.57 -0.45 2.43
# 13 Upper STD.HDW.AGR 3.44 2.52 2.59 2.53 1.94 1.18
# 14 Upper STD.HDW.MAN 2.10 3.23 2.09 2.66 2.45 1.39
# 15 Upper STD.HDW.WRT 1.28 1.12 2.56 2.44 1.96 0.57
# 16 Upper STD.HDW.CON 1.35 1.61 1.58 3.83 2.06 1.02
# 17 Upper STD.HDW.TRA 1.49 2.08 1.91 1.91 3.30 0.87
# 18 Upper STD.HDW.FIRE 0.46 0.14 0.40 0.27 0.27 3.15
# Round result to 2 digits
The function rapply2d
used here is very similar to base::rapply
, with the difference that the result is not simplified / unlisted by default and that rapply2d
will treat data.frame’s like atomic objects and apply functions to them. unlist2d
is an efficient generalization of base::unlist
to 2-dimensions, or one could also think of it as a recursive generalization of do.call(rbind, ...)
. It efficiently unlists nested lists of data objects and creates a data.frame with identifier columns for each level of nesting on the left, and the content of the list in columns on the right.
The above cumulative coefficients suggest that Agriculture responds mostly to it’s own shock, and a bit to shocks in Transport and Storage, Wholesale and Retail Trade and Manufacturing. The Finance and Real Estate sector seems even more independent and really only responds to it’s own dynamics. Manufacturing and Transport and Storage seem to be pretty interlinked with the other broad sectors. Wholesale and Retail Trade and Construction exhibit some strange dynamics (i.e. WRT responds more to the CON shock that to it’s own shock, and CON responds strongly negatively to the WRT shock).
Let us use ggplot2
to create nice compact plots of the IRF’s and FEVD’s. For this task unlist2d
will again be extremely helpful in creating the data.frame representation required. Starting with the IRF’s, we will discard the upper and lower bounds and just use the impulses converted to a data.frame:
# This binds the matrices after adding integer row-names to them to a data.table
data <- pIRF$irf %>% # Get only the coefficient matrices, discard the confidence bounds
lapply(setRownames) %>% # Add integer rownames: setRownames(object, nm = seq_row(object))
unlist2d(idcols = "Impulse", # Recursive unlisting to data.table creating a factor id-column
row.names = "Time", # and saving the generated rownames in a variable called 'Time'
id.factor = TRUE, # -> Create Id column ('Impulse') as factor
DT = TRUE) # -> Output as data.table (default is data.frame)
head(data)
# Impulse Time STD.HDW.AGR STD.HDW.MAN STD.HDW.WRT STD.HDW.CON STD.HDW.TRA STD.HDW.FIRE
# 1: STD.HDW.AGR 1 0.86996584 -0.187344923 -0.08054962 -0.10347400 0.14250867 0.000000000
# 2: STD.HDW.AGR 2 0.53115414 -0.004310463 0.02878308 -0.05730488 0.11385039 -0.027348364
# 3: STD.HDW.AGR 3 0.33034056 0.068887682 0.07763893 -0.02047398 0.09576924 -0.018846740
# 4: STD.HDW.AGR 4 0.21048997 0.088854762 0.09274910 0.00486506 0.08235669 -0.002035014
# 5: STD.HDW.AGR 5 0.13808095 0.085218106 0.09045699 0.02015986 0.07106835 0.012522131
# 6: STD.HDW.AGR 6 0.09352334 0.072843227 0.08031275 0.02789669 0.06095866 0.021967877
# Coercing Time to numeric (from character)
data$Time <- as.numeric(data$Time)
# Using data.table's melt
data <- melt(data, 1:2)
head(data)
# Impulse Time variable value
# 1: STD.HDW.AGR 1 STD.HDW.AGR 0.86996584
# 2: STD.HDW.AGR 2 STD.HDW.AGR 0.53115414
# 3: STD.HDW.AGR 3 STD.HDW.AGR 0.33034056
# 4: STD.HDW.AGR 4 STD.HDW.AGR 0.21048997
# 5: STD.HDW.AGR 5 STD.HDW.AGR 0.13808095
# 6: STD.HDW.AGR 6 STD.HDW.AGR 0.09352334
# Here comes the plot:
ggplot(data, aes(x = Time, y = value, color = Impulse)) +
geom_line(size = I(1)) + geom_hline(yintercept = 0) +
labs(y = NULL, title = "Orthogonal Impulse Response Functions") +
scale_color_manual(values = rainbow(6)) +
facet_wrap(~ variable) +
theme_light(base_size = 14) +
scale_x_continuous(breaks = scales::pretty_breaks(n=7), expand = c(0, 0))+
scale_y_continuous(breaks = scales::pretty_breaks(n=7), expand = c(0, 0))+
theme(axis.text = element_text(colour = "black"),
plot.title = element_text(hjust = 0.5),
strip.background = element_rect(fill = "white", colour = NA),
strip.text = element_text(face = "bold", colour = "grey30"),
axis.ticks = element_line(colour = "black"),
panel.border = element_rect(colour = "black"))
To round things off, below I do the same thing for the FEVD’s:
# Rewriting more compactly...
data <- unlist2d(lapply(pFEVD, setRownames), idcols = "variable", row.names = "Time",
id.factor = TRUE, DT = TRUE)
data$Time <- as.numeric(data$Time)
head(data)
# variable Time STD.HDW.AGR STD.HDW.MAN STD.HDW.WRT STD.HDW.CON STD.HDW.TRA STD.HDW.FIRE
# 1: STD.HDW.AGR 1 0.7746187 0.07681153 0.002830159 0.02636630 0.1193733 0.0000000000
# 2: STD.HDW.AGR 2 0.7553643 0.08057485 0.003928797 0.02675628 0.1330331 0.0003426730
# 3: STD.HDW.AGR 3 0.7403424 0.08383274 0.004869063 0.02678709 0.1434171 0.0007516313
# 4: STD.HDW.AGR 4 0.7294514 0.08639314 0.005594877 0.02665921 0.1508400 0.0010613293
# 5: STD.HDW.AGR 5 0.7219114 0.08828974 0.006117446 0.02649257 0.1559379 0.0012508814
# 6: STD.HDW.AGR 6 0.7168379 0.08964083 0.006476796 0.02634178 0.1593511 0.0013516447
data <- melt(data, 1:2, variable.name = "Sector")
# Here comes the plot:
ggplot(data, aes(x = Time, y = value, fill = Sector)) +
geom_area(position = "fill", alpha = 0.8) +
labs(y = NULL, title = "Forecast Error Variance Decompositions") +
scale_fill_manual(values = rainbow(6)) +
facet_wrap(~ variable) +
theme_linedraw(base_size = 14) +
scale_x_continuous(breaks = scales::pretty_breaks(n=7), expand = c(0, 0))+
scale_y_continuous(breaks = scales::pretty_breaks(n=7), expand = c(0, 0))+
theme(plot.title = element_text(hjust = 0.5),
strip.background = element_rect(fill = "white", colour = NA),
strip.text = element_text(face = "bold", colour = "grey30"))
Both the IRF’s and the FEVD’s show some strange behavior for Manufacturing, Wholesale and Retail Trade and Construction. There are also not much dynamics in the FEVD, suggesting that longer lag-lengths might be appropriate. The most important point of critique for this analysis is the structural identification strategy which is highly dubious (as correlation does not imply causation and I am also restricting sectoral relationships with a lower correlation to be 0 in the current period). A better method could be to aggregate the World Input-Output Database and use those shares for identification (which would be another very nice collapse exercise, but not for this vignette).
To learn more about collapse, I recommend just examining the documentation help("collapse-documentation")
which is hierarchically organized, extensive and contains lots of examples.
Timmer, M. P., de Vries, G. J., & de Vries, K. (2015). “Patterns of Structural Change in Developing Countries.” . In J. Weiss, & M. Tribe (Eds.), Routledge Handbook of Industry and Development. (pp. 65-83). Routledge.
Mundlak, Yair. 1978. “On the Pooling of Time Series and Cross Section Data.” Econometrica 46 (1): 69–85.
in the Within data, the overall mean was added back after subtracting out country means, to preserve the level of the data, see also section 4.5.↩
qsu
uses a numerically stable online algorithm generalized from Welford’s Algorithm to compute variances.↩
Because missing values are stored as the smallest integer in C++, and the values of the factor are used directly to index result vectors in grouped computations. Subsetting a vector with the smallest integer would break the C++ code of the Fast Statistical Functions and terminate the R session, which must be avoided.↩
You may wonder why with weights the standard-deviations in the group ‘4.0.1’ are 0
while they were NA
without weights. This stirrs from the fact that group ‘4.0.1’ only has one observation, and in the bessel-corrected estimate of the variance there is a n - 1
in the denominator which becomes 0
if n = 1
and division by 0
becomes NA
in this case (fvar
was designed that way to match the behavior or stats::var
). In the weighted version the denominator is sum(w) - 1
, and if sum(w)
is not 1, then the denominator is not 0
. The standard-deviation however is still 0
because the sum of squares in the numerator is 0
. In other words this means that in a weighted aggregation singleton-groups are not treated like singleton groups unless the corresponding weight is 1
.↩
One can also add a weight-argument w = weights
here, but fmin
and fmax
don’t support weights and all S3 methods in this package give errors when encountering unknown arguments. To do a weighted aggregation one would have to either only use fmean
and fsd
, or employ a named list of functions wrapping fmin
and fmax
in a way that additional arguments are silently swallowed.↩
I.e. the most frequent value, if all values inside a group are either all equal or all distinct, fmode
returns the first value instead↩
If the list is unnamed, collap
uses all.vars(substitute(list(FUN1, FUN2, ...)))
to get the function names. Alternatively it is also possible to pass a character vector of function names)↩
BY.grouped_df
is probably only useful together with the expand.wide = TRUE
argument which dplyr does not have, because otherwise dplyr’s summarize
and mutate
are substantially faster on larger data.↩
Included as example data in collapse and summarized in section 1↩
I noticed there is a panelvar package, but I am more familiar with vars and panelvar can be pretty slow in my experience. We also have about 50 years of data here, so dynamic panel-bias is not a big issue.↩
The vars package also provides convenient extractor functions for some quantities, but get_elem
of course works in a much broader range of contexts.↩