Parallel Processing

Local Machine

When parallel_processing within forecast_time_series() is set to “local_machine”, each time series (including training models on the entire data set) is ran in parallel on the users local machine. Each time series will run on a separate core of the machine. Hyperparameter tuning, model refitting, and model averaging will be ran sequentially, which cannot be done in parallel since a parallel process is already running on the machine for each time series. This works well for data that contains many time series where you might only want to run a few simpler models, and in scenarios where cloud computing is not available.

If parallel_processing is set to NULL and inner_parallel is set to TRUE within forecast_time_series, then each time series is ran sequentially but the hyperparameter tuning, model refitting, and model averaging is ran in parallel. This works great for data that has a limited number of time series where you want to run a lot of back testing and build dozens of models within Finn.

Within Azure using Spark

To leverage the full power of Finn, running within Azure is the best choice in building production ready forecasts that can easily scale. The most efficient way to run Finn is to set parallel_processing to “spark” within forecast_time_series(). This will run each time series in parallel across a spark compute cluster.

Sparklyr is a great R package that allows you to run R code across a spark cluster. A user simply has to connect to a spark cluster then run Finn. Below is an example on how you can run Finn using spark on Azure Databricks. Also check out the growing R support with using spark on Azure Synapse.

# load CRAN libraries
library(finnts)
library(sparklyr)

install.packages("qs")
library(qs)

# connect to spark cluster
options(sparklyr.log.console = TRUE)
options(sparklyr.spark_apply.serializer = "qs") # uses the qs package to improve data serialization before sending to spark cluster

sc <- sparklyr::spark_connect(method = "databricks")

# call Finn with spark parallel processing
hist_data <- timetk::m4_monthly %>%
  dplyr::rename(Date = date) %>%
  dplyr::mutate(id = as.character(id))

data_sdf <- sparklyr::copy_to(sc, hist_data, "data_sdf", overwrite = TRUE)

run_info <- set_run_info(
  experiment_name = "finn_fcst", 
  run_name = "spark_run_1", 
  path = "/dbfs/mnt/example/folder" # important that you mount an ADLS path
)

forecast_time_series(
  run_info = run_info,
  input_data = data_sdf,
  combo_variables = c("id"),
  target_variable = "value",
  date_type = "month",
  forecast_horizon = 3, 
  parallel_processing = "spark", 
  return_data = FALSE
)

# return the outputs as a spark data frame
finn_output_tbl <- get_forecast_data(run_info = run_info, 
                                     return_type = "sdf")

The above example runs each time series on a separate core on a spark cluster. You can also submit multiple time series where each time series runs on a separate spark executor (VM) and then leverage all of the cores on that executor to run things like hyperparameter tuning or model refitting in parallel. This creates two levels of parallelization. One at the time series level, then another when doing things like hyperparameter tuning within a specific time series. To do that set inner_parallel to TRUE in forecast_time_series(). Also make sure that you adjust the number of spark executor cores to 1, that ensures that only 1 time series runs on an executor at a time. Leverage the “spark.executor.cores” argument when configuring your spark connection. This can be done using sparklyr or within the cluster manager itself within the Azure resource. Use the “num_cores” argument in the “forecast_time_series” function to control how many cores should be used within an executor when running things like hyperparameter tuning.

forecast_time_series() will be looking for a variable called “sc” to use when submitting tasks to the spark cluster, so make sure you use that as the variable name when connecting to spark. Also it’s important that you mount your spark session to an Azure Data Lake Storage (ADLS) account, and provide the mounted path to where you’d like your Finn results to be written to within set_run_info().