Parallel Workers on Other Machines

Introduction

Sometimes it is not sufficient to parallize on a single computer - it cannot provide all of the compute power we are looking for. When we hit this limit, a natural next level is to look at other computers near us, e.g. desktops in an office or other computers we have access to remotely. In this vignette, we will cover how to run parallel R workers on other machines. Sometimes we distinguish between local machines and on remote machines, where local machines are machines considered to be on the same local area network (LAN) and that might share a common file system. Remote machines are machines that are on a different network and that do not share a common file system with the main R computer. In most cases the distinction between local and remote machines does not matter, but in some cases we can take advantages of workers being local.

Regardless of running parallel workers on local or remote machines, we need a way to connect to the machines and launch R on them.

SSH and R configuration (once)

Verifying SSH access

The most common approach to connect to another machine is via Secure Shell (SSH). Linux, macOS, and MS Windows all have a built-in SSH client called ssh. Consider we have another Linux machine called n1.remote.org, it can be accessed via SSH, and we have an account alice on that machine. For the case of these instructions, it does not matter whether n1.remote.org is on our local network (LAN) or a remote machine on the internet. Also, to make it clear that we do not have to have the same username on n1.remote.org and on our local machine, we will use ally as the username on our local machine.

To access the alice user account on n1.remote.org from our local computer, we open a terminal on the local computer and then SSH to the other machine as:

{ally@local}$ ssh alice@n1.remote.org
alice@n1.remote.org's password: *************
{alice@n1}$

The commands to call are what follows after the prompt. The prompt on our local machine is {ally@local}$, which tells us that our username is ally and the name of the local machine is local. The prompt on the n1.remote.org machine is {alice@n1}$, which tells us that our username on that machine is alice and that the machine is called n1 on that system.

To return to our local machine, exit the SSH shell by typing exit;

{alice@n1}$ exit
{ally@local}$

If we get this far, we have confirmed that we have SSH access to this machine.

Configure password-less SSH access

Launching parallel R workers is typically done automatically in the background, which means it cumbersome, or even impossible, to enter the SSH password for each machine we wish to connect to. The solution is to configure SSH to connect with public-private keys, which pre-establish SSH authentication between the main machine and the machine to connect to. As this is common practice when working with SSH, there are numerous online tutorials explaining how to configure private-public SSH key pairs. Please consult one of them for the details, but the gist is to use (i) ssh-keygen to generate the public-private SSH keys on your local machine, and then (ii) ssh-copy-id to deploy the public key on the machine you want to connect to.

Step 1: Generate public-private SSH keys locally

{ally@local}$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/ally/.ssh/id_rsa): 
Created directory '/home/ally/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/ally/.ssh/id_rsa
Your public key has been saved in /home/ally/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:Sx48uXZTUL12SKKUzWB77e/Pm3TifqrDIbOnJ0pEWHY ally@local
The key's randomart image is:
+---[RSA 3072]----+
|        o E=..   |
|       + ooo+.o  |
|      . ..o..o.o |
|       o ..o .+ .|
|        S   .... |
|       + =o..  . |
|        * o= ...o|
|       o .o.=..++|
|        ...=.++=*|
+----[SHA256]-----+

Step 2: Copy the public SSH key to the other machine

{ally@local}$ ssh-copy-id alice@n1.remote.org
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/ally/.ssh/id_rsa.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
alice@n1.remote.org:s password: *************

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'alice@n1.remote.org'"
and check to make sure that only the key(s) you wanted were added.

At this point, we should be able to SSH to the other machine without having to enter a password;

{ally@local}$ ssh alice@n1.remote.org
{alice@n1}$

Type exit to return to your local machine.

Note, if you later want to connect to other machines, e.g. n2.remote.org or hpc.my-university.edu, you may re-use the above generated keys for those systems to. In other words, you do not have to use ssh-keygen to generate new keys for those machines.

Verifying R exists on the other machine

In order to run parallel R workers on another machine, it (i) needs to be installed on that machine, and (ii) ideally readily available by calling Rscript. Parallel R workers are launched via Rscript, instead of the more commonly known R command - both come with all R installation, i.e. if you have one of them, you have the other too.

To verify that R is installed on the other machine, SSH to the machine and call Rscript --version;

{ally@local}$ ssh alice@n1.remote.org
{alice@n1}$ Rscript --version
Rscript (R) version 4.4.2 (2024-10-31)

If you get:

{alice@n1}$ Rscript --version
Rscript: command not found

then R is either not installed on that machine, or it cannot be found. If it is installed, but cannot be found, make sure that environment variable PATH his configured properly on that machine.

Final checks

With password-less SSH access, and R being available, on the other machine, we should be able to SSH into the other machine and query the R version in a single call:

{ally@local}$ ssh alice@n1.remote.org Rscript --version
Rscript (R) version 4.4.2 (2024-10-31)
{ally@local}$

This is all that is needed to launch one or more parallel R workers on machine n1.remote.org running under user alice. We can test this from within R with the parallelly package using:

{ally@local}$ R --quiet
library(parallely)
cl <- makeClusterPSOCK("n1.remote.org", user = "alice")
print(cl)
#> Socket cluster with 1 nodes where 1 node is on host 'n1.remote.org'
#> (R version 4.4.2 (2024-10-31), platform x86_64-pc-linux-gnu)
parallel::stopCluster(cl)

If you want to run parallel workers on other machines, repeat the above for each machine. After this, you will be able to launch parallel R workers on these machines with little efforts.

Machine-specific SSH customization (recommended)

Some machines do not use the default port 22 to answer on SSH connection requests. If the machine uses another port, say, port 2201, then we canspecify that via option -p port, when we connect to it, e.g.

{ally@local}$ ssh -p 2201 alice@n1.remote.org

In R, we can specify argument port=port as in:

cl <- makeClusterPSOCK("n1.remote.org", port = 2201, user = "alice")

Now, it can be tedious to have to remember custom SSH ports and usernames when setting up remote workers in R. It also adds noise and distraction having such details in the R script, and not to mention the fact that the R script has a specific username hardcoded into the code makes the script less reproducible for other users - they need to change the code to match their username. One way to avoid having to give specific SSH options when calling ssh in the terminal, or makeClusterPSOCK() in R, is to configure these settings in SSH. This can be done via a file called ~/.ssh/config on your local machine. This file does not exist by default, so you would have to create it yourself, if missing. It is a plain text file, so you should use a plain text editor to create and edit it.

To configure SSH to use port 2201 and username alice whenever connecting to n1.remote.org, the ~/.ssh/config file should contain the following entry:

Host n1.remote.org
  User alice
  Port 2201

With this, we can connect to n1.remote.org by just using:

{ally@local}$ ssh n1.remote.org
{alice@n1}$

SSH will then connect to the machine as if we had specified also -p 2201 and -l alice. These settings will also be picked up when we connect via R, meaning the following will also work:

cl <- makeClusterPSOCK("n1.remote.org")

To achieve the same for other machines, add another entry for them, e.g.

Host n1.remote.org
  User alice
  Port 2201

Host n2.remote.org
  User alice
  Port 2201

Host hpc.my-university.edu
  User alice.bobson

When hosts on the same system share the same setting, one can use globbing to configure them the same way. For instance, the above can be shorted to:

Host n?.remote.org
  User alice
  Port 2201

Host hpc.my-university.edu
  User alice.bobson

Being able to connect to remote machines by just specifying their hostnames is convenient and simplifies also the R code. Because of this, we recommend setting up also ~/.ssh/config.

Examples

Example: Two parallel workers on a single remote machine

Our first example sets up two parallel workers on the remote machine n1.remote.org. For this to work, we need SSH access to the machine, and it must have R installed, as explained in the above section. Contrary to local parallel workers, the number of parallel workers on remote machines is specified by repeating the machine name an equal number of times;

library(parallelly)
workers <- c("n1.remote.org", "n1.remote.org")
cl <- makeClusterPSOCK(workers, user = "alice")
print(cl)
#> Socket cluster with 2 nodes where 2 nodes are on host 'n1.remote.org'
#> (R version 4.4.2 (2024-10-31), platform x86_64-pc-linux-gnu).

Comment: In the parallel package, a parallel worker is referred to a parallel node, or short node, which is why we use the same term in the parallelly package.

Note, contrary to parallel workers running on the local machine, parallel workers on remote machines are launched sequentially, that is one after each other. Because of this, the setup time for a remote parallel cluster will increase linearly with the number of remote parallel workers.

Technical details: If we would add verbose = TRUE to makeClusterPSOCK(), we would learn that the parallel workers are launched in the background by R using something like:

'/usr/bin/ssh' -R 11058:localhost:11058 -l alice n1.remote.org Rscript ...
'/usr/bin/ssh' -R 11059:localhost:11059 -l alice n1.remote.org Rscript ...

This tells us that there is one active SSH connection per parallel worker. It also reveals that that each of these connections uses a so called reverse tunnel, which is used to establish a unique communication channel between the main R process and the corresponding parallel worker. It also this use of reverse tunneling that avoids having to configure dynamic DNS (DDNS) and port-forwarding in our local firewalls, which is cumbersome and requires administrative rights. When using parallelly, there is no need for administrative rights - any non-privileged user can launch remote parallel R workers.

Example: Two parallel workers on two remote machines

This example sets up a parallel worker on each of two remote machines n1.remote.org and n2.remote.org. It works very similar to the previous example, but now the two SSH connections go to two different machines rather than the same.

library(parallelly)
workers <- c("n1.remote.org", "n2.remote.org")
cl <- makeClusterPSOCK(workers, user = "alice")
print(cl)
#> Socket cluster with 2 nodes where 1 node is on host 'n1.remote.org'
#> (R version 4.4.2 (2024-10-31), platform x86_64-pc-linux-gnu), 
#> 1 node is on host 'n2.remote.org' (R version 4.4.2 (2024-10-31),
#> platform x86_64-pc-linux-gnu)

Technical details: If we would add verbose = TRUE also in this case, we would see:

'/usr/bin/ssh' -R 11464:localhost:11464 -l alice n1.remote.org Rscript ...
'/usr/bin/ssh' -R 11465:localhost:11464 -l alice n2.remote.org Rscript ...

Recall, if we have configured SSH to pick up the username alice from ~/.ssh/config on our local machine, as shown in the previous section, we could have skipped the user argument, and just used:

workers <- c("n1.remote.org", "n2.remote.org")
cl <- makeClusterPSOCK(workers)

Note how these instructions for setting up a parallel cluster on these two machines would be the identical for another user that has configured their personal ~/.ssh/config file.

Example: Three parallel workers on two remote machines

When we now understand that we control the number of parallel workers on a specific machine by replicate the machine name, we also know how to launch different number of parallel workers on different machines. From now on, we will also assume that the remote username no longer has to be specified, because it has already been configured via the ~/.ssh/config file. With this, we can sets up two parallel workers on n1.remote.org and one on n2.remote.org, by:

library(parallelly)
workers <- c("n1.remote.org", "n1.remote.org", "n2.remote.org")
cl <- makeClusterPSOCK(workers)
print(cl)
#> Socket cluster with 3 nodes where 2 nodes are on host 'n1.remote.org'
#> (R version 4.4.2 (2024-10-31), platform x86_64-pc-linux-gnu), 
#> 1 node is on host 'n2.remote.org' (R version 4.4.2 (2024-10-31),
#> platform x86_64-pc-linux-gnu)

Again, the user argument does not have to be specified, because it is configured in ~/.ssh/config.

To generalize to many workers, we can use the rep() function. For example,

workers <- c(rep("n1.remote.org", 3), rep("n2.remote.org", 4))

sets up three workers on n1.remote.org and four on n2.remote.org, totaling seven parallel workers.

Example: A mix of local and remote workers

As an alternative to makeClusterPSOCK(n), we can use makeClusterPSOCK(workers) to set up parallelly workers running on the local machine. By convention, the name localhost is an alias to your local machine. This means, we can use:

library(parallelly)
workers <- rep("localhost", 4)
cl_local <- makeClusterPSOCK(workers)
print(cl_local)
#> Socket cluster with 4 nodes where 4 nodes are on host 'localhost'
#> (R version 4.4.2 (2024-10-31), platform x86_64-pc-linux-gnu)

to launch four local parallel workers. Note how we did not have to specify user = "ally". This is because the default username is always the local username. Next, assume we want to add another four parallel workers running on n1.remote.org. We already know we can set these up as:

library(parallelly)
workers <- rep("n1.remote.org", 4)
cl_remote <- makeClusterPSOCK(workers, user = "alice")
print(cl_remote)
#> Socket cluster with 4 nodes where 4 nodes are on host 'n1.remote.org'
#> (R version 4.4.2 (2024-10-31), platform x86_64-pc-linux-gnu).

At this point, we have two independent clusters of parallel workers: cl_local and cl_remote. We can combine them into a single cluster using:

cl <- c(cl_local, cl_remote)
print(cl)
#> Socket cluster with 8 nodes where 4 nodes are on host 'localhost'
#> (R version 4.4.2 (2024-10-31), platform x86_64-pc-linux-gnu), 4
#> nodes are on host 'n1.remote.org' (R version 4.4.2 (2024-10-31),
#> platform x86_64-pc-linux-gnu)

To emphasize the usefulness of customizing our SSH connections via ~/.ssh/config, if the remote username would already have been already configured there, we would be able to set up the full cluster in one single call, as in:

library(parallelly)
workers <- c(rep("localhost", 4), rep("n1.remote.org", 4)
cl <- makeClusterPSOCK(workers)

Example: Parallel workers on a remote machine accessed via dedicated login machine

Sometimes a remote machine, where we want to run R, is only accessible via an intermediate login machine, which in SSH terms may also be referred to as a “jumphost”. For example, assume machine secret1.remote.org can only be accessed by first logging into login.remote.org as in:

{ally@local}$ ssh alice@login.remote.org
{alice@login}$ ssh alice@secret1.remote.org
{alice@secret1}$

To achive the same in a single SSH call, we can specify the “jumphost” -J hostname option for SSH, as in:

{ally@local}$ ssh -J alice@login.remote.org alice@secret1.remote.org
{alice@secret1}$

We can use the rshopts argument of makeClusterPSOCK() to achieve the same when setting up parallel workers. To launch three parallel workers on secret1.remote.org, use:

workers <- rep("secret1.remote.org", 3)
cl <- makeClusterPSOCK(
  workers,
  rshopts = c("-J", "login.remote.org"),
  user = "alice"
)

A more convenient solution is to configure the jumphost in ~/.ssh/config, as in:

Host *.remote.org
  User alice

Host secret?.remote.org
  ProxyJump login.remote.org

This will cause any SSH connection to a machine on the remote.org network to use username alice. It will also cause any SSH connection to machines secret1.remote.org, secret2.remote.org, and so on, to use jumphost login.remote.org. You can verify that all this work by:

{ally@local}$ ssh login.remote.org
{alice@login}$

and then:

{ally@local}$ ssh secret1.remote.org
{alice@secret1}$

If the above work, then the following will work from within R:

library(parallelly)
workers <- rep("secret1.remote.org", 3)
cl <- makeClusterPSOCK(workers)

Special needs and tweaks

The above sections cover most common use cases for setting up a parallel cluster from a local Linux, macOS, and MS Windows machine. However, there are cases where there above does not work, or you prefer to use another solution. This section aims to cover such alternatives.

Example: Remote workers ignoring any remote .Rprofile settings

To launch parallel workers skipping any ~/.Rprofile settings on the remote machines, we can pass option --no-init-file to Rscript via argument rscript_args. For example,

workers <- rep("n1.remote.org", 2)
cl <- makeClusterPSOCK(workers, rscript_args = "--no-init-file")

will launch two parallel workers on n1.remote.org ignoring any .Rprofile files.

Example: Use PuTTY on MS Windows to connect to remote worker

If you run on an MS Windows machine and prefer to use PuTTY to manage your SSH connections, or for other reasons cannot use the built-in ssh client, you can tell makeClusterPSOCK() to use PuTTY and your PuTTY settings via various arguments.

Here is an example that launches two parallel workers on n1.remote.org running under user alice connecting via SSH port 2201 using PuTTY and public-private SSH keys in file C:/Users/ally/.ssh/putty.ppk:

workers <- "n1.remote.org"
cl <- makeClusterPSOCK(
  workers, 
  user = "alice",
  rshcmd = "<putty-plink>",
  rshopts = c("-P", 2201, "-i", "C:/Users/ally/.ssh/putty.ppk")
)

Example: Two remote workers running on MS Windows

Thus far we have considered our remote machines to run a Unix-like operating system, e.g. Linux or macOS. If your remote machines run MS Windows, you can use similar techniques to launch parallel workers there as well. For this to work, the remote MS Windows machines must accept incoming SSH connections, which is something most Windows machines are not configured to do by default. If you do not know set that up, or if you do not have the system permissions to do so, please reach out to you system administrator of those machines.

Assuming we have SSH access to two MS Windows machines, mswin1.remote.org and mswin2.remote.org, everything works the same as before, except that we need to specify also argument rscript_sh = "cmd";

workers <- c("mswin1.remote.org", "mswin2.remote.org")
cl <- makeClusterPSOCK(workers, rscript_sh = "cmd")

That argument specifies that the parallel R workers should be launched on the remote machines via MS Windows’ cmd.exe shell.