Toby Dylan Hocking | New parallel computing frameworks

Yesterday I met with Bernd Bischl’s research group at LMU in Munich. We talked about various R packages for machine learning and high performance parallel computing.

Parallel computing

rush

Marc Becker gave a talk about hpl a new parallel computing framework in R. Whereas the older package batchtools is a parallel implementataion of map/reduce that can be run on clusters (no dependencies between jobs) rush can be used if each worker needs to compute something that is dependent on the results from other workers. The typical application is Bayesian hyper-parameter optimization, which requires choosing a new hyper-parameter combination, based on all the previous evaluations.

This would have been useful for my PeakSegPath project, in which I used a postgres database (rush uses Redis). I computed a large number of optimal change-point models, for some large epigenomic data sets. Each model is a different penalty value, and the choice of new penalty values depends on the previously chosen values.

Apparently rush can be run inside of SLURM. Each rush worker needs to query a central redis database. So we would need to ask for one SLURM job that runs the redis database, and maybe saves results to disk periodically. And then we would also run a bunch of other SLURM jobs that are rush workers, communicating with the database. Do we have redis on Alliance Canada clusters? Two Alliance Canada clusters provide SQL databases, but not redis. Also redis is not listed on Available Software, so we would need to install it. Available Python Wheels lists redis, but I think this is the python client (not the server).

Parallel for loops

I have written about future, which provides async and parallel computing in R:

Most of the articles above are about machine learning benchmark experiments, which are embarrassingly parallel on data sets, algorithms, and train/test splits.

Small experiments can be computed in sequence (no need for parallel because it is so fast), ?mlr3resampling::pvalue has an example benchmark which is fast enough to compute on CRAN.
Medium experiments can be computed in a few minutes/hours by using all the CPUs in parallel on your PC (for example my laptop has 14 CPUs which could be used for reasonable computation times for most of the blogs above).
Large experiments are not feasible to compute on your PC, and instead should be computed on a cluster (100–10,000 parallel jobs).

A newer implementation of the same concept is mirai. The promises futures vignette explains the differences:

future infers variables which are needed in your parallel code (easier); mirai does not (more explicit).
mirai is faster.
mirai supports task cancellation/interruption.

It seems mirai has been developed for faster shiny apps, and the mirai promises vignette explains how to use it for that.

purpose	future	mirai
start computing	`future()`	`m <- mirai()`
how to compute	`plan()`	`daemons(.compute="name")`
check if done	`resolved()`	`unresolved()`
loop	`future_map()`	`mirai_map()`
get value	`value()`	`m[]`

mirai uses nanonext which is an R package interface to the nng C library (nanomsg next generation). The previous generations were nanomsg and ZeroMQ, a C++ asychronous messaging library, with corresponding R packages, rzmq and clustermq (1000x less overhead compared to batchtools, so useful for running many small jobs).

foreach is an older implementation of the parallel for loop (but not async eval of single expression).

crew

crew is an R package which imports mirai. The Introduction to crew vignette explains that it has “centralized interface and auto-scaling.” I think the centralized interface refers to the fact that there is a controller object which is used to send jobs and retreive results (push and pop like rush). The auto-scaling is explained: “crew automatically raises and lowers the number of workers in response to fluctuations in the task workload.” You can specify idle/wall time limits for workers, or a limit on the number of jobs each worker can process. There is a nice motivation for auto-scaling: “The two extremes of auto-scaling are clustermq-like persistent workers and future-like transient workers, and each is problematic in its own way.”

clustermq launches many workers, which stay running/idle while waiting for tasks to be submitted via Q() (multiple tasks per worker, idle time bad).
future launches one worker per task, and the worker terminates after finishing the task (worker launch overhead time bad).

crew.cluster package apparently has support for SLURM, but ?crew.cluster::crew_controller_slurm says it has not been tested. I guess in theory we should be able to use it like batchtools::submitJobs.

Dirk told me about doRedis, which has a vignette that explains it uses a central database to assign work to a variable number of workers. Therefore it seems very similar to mirai or crew.cluster + targets or clustermq.

Job pipelines

Make is the classic build system, drake was the first popular R implementation of the same idea, which now is considered superseded in favor of targets. The crew page in the targets book explains that we can run targets on the cluster by setting targets::tar_option_set(controller=crew.cluster::crew_controller_slurm(...)).

Conclusions and future work

Like with most things in R, there are a lot of different packages for parallel computing. In machine learning experiments, some algorithms can take a lot of time (deep neural networks), whereas others can be very fast (featureless baseline or linear model). Also there can be different data sets, with very different sizes in terms of numbers of rows and columns. For example Table 1 of our SOAK paper shows 20 data sets, with ~1,000 to ~1,000,000 rows, and 10 to ~10,000 features (1000x differences). So it does not really make sense to schedule these different algorithms and data sets with the same time limits (as is required by SLURM job arrays). A more efficient alternative would be to use crew.cluster, which could start a certain number of SLURM jobs, then keep sending more ML experiments to compute in them, until there is no more work to do, after which the SLURM jobs terminate due to the idle time limit.

This idea would also be useful in any other context with heterogenous run-times between jobs. Another example is data.table nightly revdep checker, which is currently implemented using SLURM shell scripts with job dependencies (setup job, check each package in parallel, analyze results). The dependencies between jobs could be modeled with targets, which could be used with crew.cluster to compute on SLURM. One added bit of complexity is that there are some packages which require less memory than others, and the Heterogeneous workers page explains that this can apparently be handed using crew controller groups: targets::tar_target(resources=targets::tar_resources(crew=targets::tar_resources_crew(controller="large_mem"))). We would have to start two different crew.cluster::crew_controller_slurm() (large_mem and small_mem). To investigate: how many workers would we need? The two slowest packages, which always take more than 9 hours, are haldensify (7 hours per R CMD check) and ctmm (4 hours per R CMD check). There are at least two checks required (master and CRAN versions of data.table), perhaps more if differences are found and git bisect is invoked. Currently checks are run using both R-devel and R-release in the same SLURM job, and that could perhaps be separated into two targets. Using the most recent check times, I get the following estimated number of workers, if we say that we want a result in 16 hours.

> jobs_dt <- data.table::fread("https://rcdata.nau.edu/genomic-ml/data.table-revdeps/analyze/2025-05-15/full_list_of_jobs.csv")
> ipat=list("[0-9]+", as.integer);nc::capture_first_df(jobs_dt, check.time=list(hours_only=ipat,":",minutes_only=ipat,":",seconds_only=ipat))
> jobs_dt[, hours := hours_only+minutes_only/60+seconds_only/60/60]
> large_dt <- data.table::fread("~/projects/data.table-revdeps/large_memory.csv")
> jobs_dt[,gigabytes := ifelse(Package %in% large_dt$Package, 16, 4)]
> jobs_dt[, .(hours=sum(hours)), by=gigabytes][, workers := ceiling(hours/16)][]
   gigabytes    hours workers
       <num>    <num>   <num>
1:         4 442.2181      28
2:        16  22.8375       2

We would have to start the long jobs right away, if we want them to finish before the 16 hour time limit, because the largest job would take about 14 hours (assuming no git bisect).

Overall it seems like rush/clustermq/crew.cluster would be worth investigating for ML benchmark experiments, and targets for the data.table revdep checker. In comparison to crew, rush seems to be more flexible (only centralized database, no need for central controller that sends jobs to workers). In detail, rush workers can each decide what they want to work on, then register that key with the central database, so the other workers can avoid working on the same problem. This architecture is a better fit for some problems such as hyper-parameter search. However crew seems easier to setup (no need for redis database), and sufficient for simpler problems like ML benchmark experiments (where all of the combinations can be enumerated in advance).