Using parabar with foreach

Mihai Constantin

16 December, 2024

Introduction

The goal of this article is to provide a minimal example of how to use the parabar and foreach packages together. The foreach package is a popular package that provides syntactic sugar for executing tasks sequentially (i.e., via the %do% operator) or in parallel (i.e., via the %dopar% operator). In this article, I will provide a brief introduction to the foreach package and show how it can be used to run tasks in parallel with the parabar package. If you are not yet familiar with the parabar package, make sure to check out the documentation for information on how to get started.

Overview

In a nutshell, the foreach package provides a way to iterate over a collection of elements. For iterating over the respective collection sequentially, one can use the %do% operator as follows:

# Load the library.
library(foreach)

# For each element.
foreach(i = 1:5) %do% {
    # Do something.
    i * 2
}
#> [[1]]
#> [1] 2
#>
#> [[2]]
#> [1] 4
#>
#> [[3]]
#> [1] 6
#>
#> [[4]]
#> [1] 8
#>
#> [[5]]
#> [1] 10

In this example, the line

# Load the library.
library(foreach)

loads the foreach package, making all of its functions and operators available in main session. More interestingly, the call

foreach(i = 1:5)

takes the named argument i = 1:5 provided as input and returns an iterator object of class foreach. Then, the %do% operator is used to execute the expression on the right-hand side of the operator

{
    # Do something.
    i * 2
}

for each element of the iterator object.

Note. The foreach::foreach function may take additional arguments that control the behavior of the iteration process, accumulation of the results, and the task execution. For example, by default, the foreach::foreach function returns the accumulated results as a list. However, the foreach::foreach can take a .combine argument that specifies how the results of each iteration should be combined into a single object. Specifying, for instance, .combine = c for the example above instructs foreach::foreach that we expect the results back as a vector instead of a list:

# For each element.
foreach(i = 1:5, .combine = c) %do% {
    # Do something.
    i * 2
}
#> [1]  2  4  6  8 10

Moreover, using the .final argument, we can provide a function that acts on the accumulated results right before their are provided back to the user. This is useful when we want to perform some final operation on the results before returning them. For example, suppose we want to sum the results of the iterations. We can do this as follows:

# For each element.
foreach(i = 1:5, .combine = c, .final = sum) %do% {
    # Do something.
    i * 2
}
#> [1] 30

As you may have noticed, the arguments that pertain to the behavior of the foreach::foreach function are prepended with a dot. There are more arguments available. For a complete list, see the documentation for foreach::foreach and the vignette Using the foreach package.

Running In Parallel

If we want to run a task in parallel, we need to provide a backend that supports parallelizing the task. Since the foreach package is not a parallelization package per se, it does not provide a backend for parallelizing tasks by default. Instead, it provides a flexible mechanism to register any parallelization backend with it, as long as that backend supports the %dopar% operator.

The workflow for running a task in parallel with the foreach package involves:

  1. Obtaining a parallelization backend.
  2. Registering the backend with the foreach package.
  3. Running the task in parallel using the %dopar% operator.

While the parabar package provides synchronous and asynchronous parallelization backends, it does not work out of the box with the foreach package. This is where the doParabar package comes into play. The doParabar encapsulated the necessary logic to adapt parabar backends to work seamlessly with the foreach package.

At a high level the doParabar package consists of two main functions:

Note. Two particularly relevant foreach::foreach arguments in the context of parallelizing R code are .export and .packages. The .export argument specifies the variables that need to be exported to the backend, while the packages argument specifies the packages that need to be loaded on the backend.

Using doParabar

Unlike other foreach adapter packages out there (e.g., doParallel), the the doParabar package does not automatically load other packages. Instead, I recommend to explicitly load the necessary packages in your scripts. In a similar vein, R package developers should add the necessary packages to the Imports field in the DESCRIPTION file of their package. Therefore, the first step in using parabar with foreach is to load the necessary packages:

# Load the packages.
library(doParabar)
library(parabar)
library(foreach)

Next, we proceed by using parabar to create an asynchronous parallelization backend that supports progress tracking as follows:

# Create an asynchronous `parabar` backend.
backend <- start_backend(
    cores = 2, cluster_type = "psock", backend_type = "async"
)

At this point, we have a parallelization backend that we can register with the foreach package. We do this via the registerDoParabar function:

# Register the backend with the `foreach` package.
registerDoParabar(backend)

To verify that the backend has been registered successfully, we can use some of the function provides by the foreach package to query information about the backend:

# Get the parallel backend name.
getDoParName()
#> [1] "doParabar (AsyncBackend)"
# Check that the parallel backend has been registered.
getDoParRegistered()
#> [1] TRUE
# Get the current version of backend registration.
getDoParVersion()
#> [1] "1.0.0"
# Get the number of cores used by the backend.
getDoParWorkers()
#> [1] 2

Now, we can use the %dopar% operator to run tasks in parallel. For example:

# Define some variables strangers to the backend.
x <- 10
y <- 100
z <- "Not to be exported."

# Used the registered backend to run a task in parallel via `foreach`.
results <- foreach(
    i = 1:300, .export = c("x", "y"), .combine = c
) %dopar% {
    # Sleep a bit to simulate a long-running task.
    Sys.sleep(0.01)

    # Compute and return.
    i + x + y
}
#> completed 0 out of 300 tasks [ 0%] [ 0s]
#> ...
#> completed 60 out of 300 tasks [ 20%] [ 1s]
#> ...
#> completed 300 out of 300 tasks [100%] [ 2s]
# Show a few results.
head(results, n = 10)
#>  [1] 111 112 113 114 115 116 117 118 119 120
tail(results, n = 10)
#>  [1] 401 402 403 404 405 406 407 408 409 410

Note. The doParabar package does not automatically export objects (i.e., or packages for that manner) to the backend. While this break “tradition” with other foreach adapter packages, it is a deliberate design choice made to encourage users to keep their scripts tidy and be mindful of what they export to the backend. (i.e., see the .export, .noexport, and .packages arguments of the foreach function).

We can verify that objects are not automatically exported to the backend by checking the value of the z variable on the backend. We expect this call to throw an error, since z was never exported to the backend:

# Verify that the variable `z` was not exported.
try(evaluate(backend, z))
#> Error : ! in callr subprocess.
#> Caused by error in `checkForRemoteErrors(lapply(cl, recvResult))`:
#> ! 2 nodes produced errors; first error: object 'z' not found

Finally, we can stop the backend when we are done with as we would normally do:

# Stop the backend.
stop_backend(backend)

Conclusion

In this article, I provided a short introduction on how to run tasks in parallel on parabar backends using foreach semantics. This integration is possible via the doParabar package, which provides an implementation for the %dopar% operator (i.e., the doPar function) and a function to register the implementation with the foreach package (i.e., the registerDoParabar function). The source code for the doParabar package can be consulted on GitHub at github.com/mihaiconstantin/doParabar. I kindly welcome any feedback or contributions to improving parabar or doParabar.