Use with dplyr

Davis Vaughan

2023-01-24

Package motivation

tibbletime on its own has useful functions for manipulating time-based tibbles, but one of its most useful features is how nicely it plays with dplyr. Traditionally, performing grouped analysis over a time period with dplyr (like quarterly / monthly summaries) is doable, but it could be easier and typically requires use of the lubridate package along with the creation of multiple columns to group on. Below is an example of a monthly summary from a daily dataset.

library(tibbletime)
library(dplyr)
library(lubridate)

series <- create_series('2013' ~ '2017', 'day', class = "Date") %>%
  mutate(var = rnorm(n()))

series
## # A time tibble: 1,826 × 2
## # Index:         date
##    date          var
##    <date>      <dbl>
##  1 2013-01-01 -0.442
##  2 2013-01-02 -0.600
##  3 2013-01-03 -0.333
##  4 2013-01-04  1.11 
##  5 2013-01-05  0.971
##  6 2013-01-06  1.33 
##  7 2013-01-07 -0.879
##  8 2013-01-08 -0.158
##  9 2013-01-09 -1.05 
## 10 2013-01-10 -2.50 
## # … with 1,816 more rows
series %>%
  mutate(year = year(date), month = month(date)) %>%
  group_by(year, month) %>%
  summarise(mean_var = mean(var))
## # A tibble: 60 × 3
##     year month mean_var
##    <dbl> <dbl>    <dbl>
##  1  2013     1  -0.148 
##  2  2013     2  -0.213 
##  3  2013     3   0.461 
##  4  2013     4  -0.0984
##  5  2013     5   0.155 
##  6  2013     6  -0.112 
##  7  2013     7  -0.0698
##  8  2013     8   0.167 
##  9  2013     9   0.294 
## 10  2013    10  -0.377 
## # … with 50 more rows

This gets more difficult the more granular you go. Getting 5-minute summaries from minute or second data requires grouping on year, month, day, hour and minute columns.

Index manipulation

With tibbletime, rather than creating new columns to group on, you manipulate your original date column into something that corresponds to the period you are summarising at. The tibbletime way to do this is with collapse_by().

series %>%
  collapse_by("month") %>%
  group_by(date) %>%
  summarise(mean_var = mean(var))
## # A time tibble: 60 × 2
## # Index:         date
##    date       mean_var
##    <date>        <dbl>
##  1 2013-01-31  -0.148 
##  2 2013-02-28  -0.213 
##  3 2013-03-31   0.461 
##  4 2013-04-30  -0.0984
##  5 2013-05-31   0.155 
##  6 2013-06-30  -0.112 
##  7 2013-07-31  -0.0698
##  8 2013-08-31   0.167 
##  9 2013-09-30   0.294 
## 10 2013-10-31  -0.377 
## # … with 50 more rows

While collapse_by() directly manipulates the index column, the lower level collapse_index() function can be used inside of a call to mutate() to modify the index column and then save it as a new column. This can be useful if you don’t want to lose the original index column.

This works for more granular series too. Below we aggregate 5 second level data up to hourly. This is working with a faily sizable ~19 million row data set.

second_series <- create_series('2013' ~ '2015', '5 second')

second_series %>%
  mutate(var = rnorm(n())) %>%
  collapse_by("hour") %>%
  group_by(date) %>%
  summarise(mean_var = mean(var))
## # A time tibble: 26,280 × 2
## # Index:         date
##    date                mean_var
##    <dttm>                 <dbl>
##  1 2013-01-01 00:59:55 -0.00458
##  2 2013-01-01 01:59:55 -0.0516 
##  3 2013-01-01 02:59:55  0.0288 
##  4 2013-01-01 03:59:55  0.00153
##  5 2013-01-01 04:59:55  0.0543 
##  6 2013-01-01 05:59:55 -0.0140 
##  7 2013-01-01 06:59:55 -0.0291 
##  8 2013-01-01 07:59:55  0.0122 
##  9 2013-01-01 08:59:55 -0.0251 
## 10 2013-01-01 09:59:55  0.0255 
## # … with 26,270 more rows

Multiple series

One really powerful benefit of working in the tidyverse is being able to manipulate multiple series at once. Essentially we can create multiple layers of groupings, one for the stocks we are working with (like Facebook and Apple), and one for the period you want to summarise your data at (daily, yearly, etc).

Below we use create_series() to create two dummy hourly price series, combine them, and calculate the OHLC (Open, High, Low, Close) prices per day by first collapsing to daily with collapse_by() to have something to group on.

set.seed(123)

# Create price series of hourly movements for apple and facebook stock.
apple <- create_series('2014' ~ '2016', period = '1 hour') %>%
  mutate(price = 100 + cumsum(rnorm(n(), mean = 0, sd = .5)))

facebook <- create_series('2014' ~ '2016', period = '1 hour') %>%
  mutate(price = 150 + cumsum(rnorm(n(), mean = 0, sd = .5)))

# Bind them together and create a symbol column to group on
price_series <- bind_rows(list(apple = apple, facebook = facebook), .id = "symbol") %>%
  as_tbl_time(date) %>%
  group_by(symbol)

# Collapse to daily and transform to OHLC (Open, High, Low, Close), a 
# common financial transformation
price_series %>%
  collapse_by("day") %>%
  group_by(symbol, date) %>%
  summarise(
    open  = first(price),
    high  = max(price),
    low   = min(price),
    close = last(price)
  ) %>%
  slice(1:5)
## `summarise()` has grouped output by 'symbol'. You can override using the
## `.groups` argument.
## # A time tibble: 10 × 6
## # Index:         date
## # Groups:        symbol [2]
##    symbol   date                 open  high   low close
##    <chr>    <dttm>              <dbl> <dbl> <dbl> <dbl>
##  1 apple    2014-01-01 23:00:00  99.7  102.  99.6  99.9
##  2 apple    2014-01-02 23:00:00  99.6  102.  98.7 101. 
##  3 apple    2014-01-03 23:00:00 101.   103. 101.  101. 
##  4 apple    2014-01-04 23:00:00 102.   104. 101.  103. 
##  5 apple    2014-01-05 23:00:00 104.   105. 101.  101. 
##  6 facebook 2014-01-01 23:00:00 150.   152. 149.  149. 
##  7 facebook 2014-01-02 23:00:00 149.   151. 148.  149. 
##  8 facebook 2014-01-03 23:00:00 150.   150. 147.  148. 
##  9 facebook 2014-01-04 23:00:00 148.   150. 144.  145. 
## 10 facebook 2014-01-05 23:00:00 145.   146. 144.  146.