Designed to be used and scaled with the
tidyverse
The greatest benefit to tidyquant
is the ability to easily scale your financial analysis. Scaling is the process of creating an analysis for one security and then extending it to multiple groups. This is idea of scaling is incredibly useful to financial analysts because typically one wants to compare many securities to make informed decisions. Fortunately, the tidyquant
package integrates with the tidyverse
making scaling super simple!
All tidyquant
functions return data in the tibble
(tidy data frame) format, which allows for interaction within the tidyverse
. This means we can:
%>%
) for chaining operationsdplyr
and tidyr
: select
, filter
, group_by
, nest
/unnest
, spread
/gather
, etcpurrr
: mapping functions with map
We’ll go through some useful scaling techniques for getting and manipulating groups of data.
Load the tidyquant
package to get started.
# Loads tidyquant, tidyverse, lubridate, xts, quantmod, TTR
library(tidyquant)
A very basic example is retrieving the stock prices for multiple stocks. There are three primary ways to do this:
c("AAPL", "GOOG", "FB") %>%
tq_get(get = "stock.prices", from = "2016-01-01", to = "2017-01-01")
## # A tibble: 756 × 8
## symbol date open high low close volume adjusted
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AAPL 2016-01-04 102.61 105.37 102.00 105.35 67649400 102.61218
## 2 AAPL 2016-01-05 105.75 105.85 102.41 102.71 55791000 100.04079
## 3 AAPL 2016-01-06 100.56 102.37 99.87 100.70 68457400 98.08303
## 4 AAPL 2016-01-07 98.68 100.13 96.43 96.45 81094400 93.94347
## 5 AAPL 2016-01-08 98.55 99.11 96.76 96.96 70798000 94.44022
## 6 AAPL 2016-01-11 98.97 99.06 97.34 98.53 49739400 95.96942
## 7 AAPL 2016-01-12 100.55 100.69 98.84 99.96 49154200 97.36226
## 8 AAPL 2016-01-13 100.32 101.19 97.30 97.39 62439600 94.85905
## 9 AAPL 2016-01-14 97.96 100.48 95.74 99.52 63170100 96.93369
## 10 AAPL 2016-01-15 96.20 97.71 95.36 97.13 79010000 94.60580
## # ... with 746 more rows
The output is a single level tibble with all or the stock prices in one tibble. The auto-generated column name is “symbol”, which can be pre-emptively renamed by giving the vector a name (e.g. stocks <- c("AAPL", "GOOG", "FB")
) and then piping to tq_get
.
First, get a stock list in data frame format either by making the tibble or retrieving from tq_index
/ tq_exchange
. The stock symbols must be in the first column.
stock_list <- tibble(stocks = c("AAPL", "JPM", "CVX"),
industry = c("Technology", "Financial", "Energy"))
stock_list
## # A tibble: 3 × 2
## stocks industry
## <chr> <chr>
## 1 AAPL Technology
## 2 JPM Financial
## 3 CVX Energy
Second, send the stock list to tq_get
. Notice how the symbol and industry columns are automatically expanded the length of the stock prices.
stock_list %>%
tq_get(get = "stock.prices", from = "2016-01-01", to = "2017-01-01")
## # A tibble: 756 × 9
## stocks industry date open high low close volume
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AAPL Technology 2016-01-04 102.61 105.37 102.00 105.35 67649400
## 2 AAPL Technology 2016-01-05 105.75 105.85 102.41 102.71 55791000
## 3 AAPL Technology 2016-01-06 100.56 102.37 99.87 100.70 68457400
## 4 AAPL Technology 2016-01-07 98.68 100.13 96.43 96.45 81094400
## 5 AAPL Technology 2016-01-08 98.55 99.11 96.76 96.96 70798000
## 6 AAPL Technology 2016-01-11 98.97 99.06 97.34 98.53 49739400
## 7 AAPL Technology 2016-01-12 100.55 100.69 98.84 99.96 49154200
## 8 AAPL Technology 2016-01-13 100.32 101.19 97.30 97.39 62439600
## 9 AAPL Technology 2016-01-14 97.96 100.48 95.74 99.52 63170100
## 10 AAPL Technology 2016-01-15 96.20 97.71 95.36 97.13 79010000
## # ... with 746 more rows, and 1 more variables: adjusted <dbl>
Get an index…
tq_index("DOWJONES")
## # A tibble: 65 × 2
## symbol company
## <chr> <chr>
## 1 MMM 3M
## 2 ALK ALASKA AIR GROUP
## 3 AAL AMERICAN AIRLINES GROUP INC.
## 4 AEP AMERICAN ELECTRIC POWER
## 5 AXP AMERICAN EXPRESS
## 6 AWK AMERICAN WATER WORKS
## 7 AAPL APPLE
## 8 CAR AVIS BUDGET GROUP
## 9 CAT CATERPILLAR
## 10 CNP CENTERPOINT ENERGY
## # ... with 55 more rows
…or, get an exchange.
tq_exchange("NYSE")
## # A tibble: 3,161 × 7
## symbol company last.sale.price market.cap ipo.year
## <chr> <chr> <dbl> <chr> <dbl>
## 1 DDD 3D Systems Corporation 16.92 $1.9B NA
## 2 MMM 3M Company 183.41 $109.35B NA
## 3 WBAI 500.com Limited 13.14 $545.28M 2013
## 4 WUBA 58.com Inc. 33.70 $4.88B 2013
## 5 AHC A.H. Belo Corporation 6.40 $138.73M NA
## 6 ATEN A10 Networks, Inc. 9.76 $655.93M 2014
## 7 AAC AAC Holdings, Inc. 8.06 $191.08M 2014
## 8 AIR AAR Corp. 33.95 $1.17B NA
## 9 AAN Aaron's, Inc. 29.52 $2.14B NA
## 10 ABB ABB Ltd 23.01 $49.15B NA
## # ... with 3,151 more rows, and 2 more variables: sector <chr>,
## # industry <chr>
Send the index or exchange to tq_get
. Important Note: This can take several minutes depending on the size of the index or exchange, which is why only the first three stocks are evaluated in the vignette.
tq_index("DOWJONES") %>%
slice(1:3) %>%
tq_get(get = "stock.prices")
## # A tibble: 7,650 × 9
## symbol company date open high low close volume adjusted
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 MMM 3M 2007-01-03 77.53 78.85 77.38 78.26 3781500 59.92042
## 2 MMM 3M 2007-01-04 78.40 78.41 77.45 77.95 2968400 59.68306
## 3 MMM 3M 2007-01-05 77.89 77.90 77.01 77.42 2765200 59.27726
## 4 MMM 3M 2007-01-08 77.42 78.04 76.97 77.59 2434500 59.40742
## 5 MMM 3M 2007-01-09 78.00 78.23 77.44 77.68 1896800 59.47633
## 6 MMM 3M 2007-01-10 77.31 77.96 77.04 77.85 1787500 59.60649
## 7 MMM 3M 2007-01-11 78.05 79.03 77.88 78.65 2372500 60.21902
## 8 MMM 3M 2007-01-12 78.41 79.50 78.22 79.36 2582200 60.76264
## 9 MMM 3M 2007-01-16 79.48 79.62 78.92 79.56 2526600 60.91577
## 10 MMM 3M 2007-01-17 79.33 79.51 78.75 78.91 2711300 60.41810
## # ... with 7,640 more rows
You can use any applicable “getter” to get data for every stock in an index or an exchange! This includes: “stock.prices”, “key.ratios”, “key.stats”, “financials”, and more.
We can pipe a tibble of stock symbols to a mutation that maps the tq_get(get = "stock.prices")
function. The result is all of the stock prices in nested format.
tibble(symbol = c("AAPL", "GOOG", "AMZN", "FB")) %>%
mutate(stock.prices = map(.x = symbol, ~ tq_get(.x, get = "stock.prices")))
## # A tibble: 4 × 2
## symbol stock.prices
## <chr> <list>
## 1 AAPL <tibble [2,550 × 7]>
## 2 GOOG <tibble [2,550 × 7]>
## 3 AMZN <tibble [2,550 × 7]>
## 4 FB <tibble [1,195 × 7]>
In financial analysis, it’s very common to need data from various sources to combine in an analysis. For this reason multiple get
options (“compound getters”) can be used to return a “compound get”. A quick example:
c("AAPL", "GOOG") %>%
tq_get(get = c("stock.prices", "financials"))
## # A tibble: 2 × 3
## symbol stock.prices financials
## <chr> <list> <list>
## 1 AAPL <tibble [2,550 × 7]> <tibble [3 × 3]>
## 2 GOOG <tibble [2,550 × 7]> <tibble [3 × 3]>
This returns the stock prices and financials for each stock as one nested data frame! Any of the get
options that accept stock symbols can be used in this manner: "stock.prices"
, "financials"
, "key.ratios"
, "key.stats"
, "dividends"
, and "splits"
.
This capability becomes incredibly useful when combined with purrr
function mapping, which is discussed in Manipulating Financial Data with purrr.
Once you get the data, you typically want to do something with it. You can easily do this at scale. Let’s get the yearly returns for multiple stocks using tq_transform
. First, get the prices. Second, use group_by
to group by stock symbol. Third, apply the transformation. We can do this in one easy workflow:
c("AAPL", "GOOG", "FB") %>%
tq_get(get = "stock.prices", from = "2012-01-01", to = "2017-01-01") %>%
group_by(symbol) %>%
tq_transform(Ad, transform_fun = periodReturn, period = "yearly",
col_rename = "yearly.returns") %>%
ggplot(aes(x = year(date), y = yearly.returns, fill = symbol)) +
geom_bar(position = "dodge", stat = "identity") +
scale_y_continuous(labels = scales::percent) +
scale_x_continuous(breaks = seq(2008, 2017, by = 1)) +
labs(title = "AAPL, GOOG, FB: Annual Returns",
subtitle = "Transforming using quantmod functions is easy!",
x = "") +
theme(legend.position = "bottom")
Eventually you will want to begin modeling at scale! One of the best features of the tidyverse
is the ability to map functions to nested tibbles using purrr
. From the Many Models chapter of “R for Data Science”, we can apply the same modeling workflow to financial analysis. Using a two step workflow:
Let’s go through an example to illustrate. In our hypothetical situation, we want to compare the mean monthly log returns (MMLR).
First, let’s come up with a function to help us collect log returns. The function below performs three operations internally. It first gets the stock prices using tq_get()
. Then, it transforms the stock prices to period returns using tq_transform()
. We add the type = "log"
and period = "monthly"
arguments to ensure we retrieve a tibble of monthly log returns. Last, we take the mean of the monthly returns to get MMLR.
my_stock_analysis_fun <- function(stock.symbol) {
period.returns <- stock.symbol %>%
tq_get(get = "stock.prices") %>%
tq_transform(ohlc_fun = Ad, transform_fun = periodReturn,
type = "log", period = "monthly")
mean(period.returns$monthly.returns)
}
And, let’s test it out. We now have the mean monthly log returns over the past ten years.
my_stock_analysis_fun("AAPL")
## [1] 0.0206807
Now that we have one stock down, we can scale to many stocks. For brevity, we’ll randomly sample ten stocks from the S&P500 with a call to dplyr::sample_n()
.
set.seed(100)
stocks <- tq_index("SP500") %>%
sample_n(10)
stocks
## # A tibble: 10 × 2
## symbol company
## <chr> <chr>
## 1 EMC EMC
## 2 DVN DEVON ENERGY
## 3 MNK MALLINCKRODT PLC
## 4 AIG AMERICAN INTL
## 5 INTC INTEL
## 6 IVZ INVESCO
## 7 SWN SOUTHWESTERN ENERGY
## 8 FLS FLOWSERVE
## 9 LMT LOCKHEED MARTIN
## 10 CNP CENTERPOINT ENERGY
We can now apply our analysis function to the stocks using dplyr::mutate
and purrr::map_dbl
. The mutate()
function adds a column to our tibble, and the map_dbl()
function maps our my_stock_analysis_fun
to our tibble of stocks using the symbol
column.
stocks <- stocks %>%
mutate(mmlr = map_dbl(symbol, my_stock_analysis_fun)) %>%
arrange(desc(mmlr))
stocks
## # A tibble: 10 × 3
## symbol company mmlr
## <chr> <chr> <dbl>
## 1 LMT LOCKHEED MARTIN 0.011325008
## 2 FLS FLOWSERVE 0.009926539
## 3 CNP CENTERPOINT ENERGY 0.007445493
## 4 INTC INTEL 0.007381266
## 5 EMC EMC 0.007208655
## 6 IVZ INVESCO 0.004901501
## 7 MNK MALLINCKRODT PLC 0.003425571
## 8 DVN DEVON ENERGY -0.002154825
## 9 SWN SOUTHWESTERN ENERGY -0.005299089
## 10 AIG AMERICAN INTL -0.023569531
And, we’re done! We now have the MMLR for 10-years of stock data for 10 stocks. And, we can easily extend this to larger lists or stock indexes. For example, the entire S&P500 could be analyzed removing the sample_n()
following the call to tq_index("SP500")
.
Eventually you will run into a stock index, stock symbol, FRED data code, etc that cannot be retrieved. Possible reasons are:
This becomes painful when scaling if the functions return errors. So, the tq_get()
function is designed to handle errors gracefully. What this means is an NA
value is returned when an error is generated along with a gentle error warning.
tq_get("XYZ", "stock.prices")
## [1] NA
There are pros and cons to this approach that you may not agree with, but I believe helps in the long run. Just be aware of what happens:
Pros: Long running scripts are not interrupted because of one error
Cons: Errors can be inadvertently handled or flow downstream if the users does not read the warnings
Let’s see an example when using tq_get()
to get the stock prices for a long list of stocks with one BAD APPLE
. The argument complete_cases
comes in handy. The default is TRUE
, which removes “bad apples” so future analysis have complete cases to compute on. Note that a gentle warning stating that an error occurred and was dealt with by removing the rows from the results.
c("AAPL", "GOOG", "BAD APPLE") %>%
tq_get(get = "stock.prices", complete_cases = TRUE)
## Warning in value[[3L]](cond): Error at BAD APPLE during call to get =
## 'stock.prices'. Removing BAD APPLE.
## # A tibble: 5,100 × 8
## symbol date open high low close volume adjusted
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AAPL 2007-01-03 86.29 86.58 81.90 83.80 309579900 10.85709
## 2 AAPL 2007-01-04 84.05 85.95 83.82 85.66 211815100 11.09807
## 3 AAPL 2007-01-05 85.77 86.20 84.40 85.05 208685400 11.01904
## 4 AAPL 2007-01-08 85.96 86.53 85.28 85.47 199276700 11.07345
## 5 AAPL 2007-01-09 86.45 92.98 85.15 92.57 837324600 11.99333
## 6 AAPL 2007-01-10 94.75 97.80 93.45 97.00 738220000 12.56728
## 7 AAPL 2007-01-11 95.94 96.78 95.10 95.80 360063200 12.41180
## 8 AAPL 2007-01-12 94.59 95.06 93.23 94.62 328172600 12.25892
## 9 AAPL 2007-01-16 95.68 97.25 95.45 97.10 311019100 12.58023
## 10 AAPL 2007-01-17 97.56 97.60 94.82 94.95 411565000 12.30168
## # ... with 5,090 more rows
Now switching complete_cases = FALSE
will retain any errors as NA
values in a nested data frame. Notice that the error message and output change. The error message now states that the NA
values exist in the output and the return is a “nested” data structure.
c("AAPL", "GOOG", "BAD APPLE") %>%
tq_get(get = "stock.prices", complete_cases = FALSE)
## Warning in value[[3L]](cond): Error at BAD APPLE during call to get =
## 'stock.prices'.
## Warning in value[[3L]](cond): Returning as nested data frame.
## # A tibble: 3 × 2
## symbol stock.prices
## <chr> <list>
## 1 AAPL <tibble [2,550 × 7]>
## 2 GOOG <tibble [2,550 × 7]>
## 3 BAD APPLE <lgl [1]>
In both cases, the prudent user will review the warnings to determine what happened and whether or not this is acceptable. In the complete_cases = FALSE
example, if the user attempts to perform downstream computations at scale, the computations will likely fail grinding the analysis to a hault. But, the advantage is that the user will more easily be able to filter to the problem childs to determine what happened and decide whether this is acceptable or not.