New parallel computing frameworks
Yesterday I met with Bernd Bischl’s research group at LMU in Munich. We talked about various R packages for machine learning and high performance parallel computing.
Parallel computing
rush
Marc Becker gave a talk about hpl a new parallel computing framework in R. Whereas the older package batchtools is a parallel implementataion of map/reduce that can be run on clusters (no dependencies between jobs) rush can be used if each worker needs to compute something that is dependent on the results from other workers. The typical application is Bayesian hyper-parameter optimization, which requires choosing a new hyper-parameter combination, based on all the previous evaluations.
This would have been useful for my PeakSegPath project, in which I used a postgres database (rush uses Redis). I computed a large number of optimal change-point models, for some large epigenomic data sets. Each model is a different penalty value, and the choice of new penalty values depends on the previously chosen values.
Apparently rush can be run inside of SLURM. Each rush worker needs to query a central redis database. So we would need to ask for one SLURM job that runs the redis database, and maybe saves results to disk periodically. And then we would also run a bunch of other SLURM jobs that are rush workers, communicating with the database. Do we have redis on Alliance Canada clusters? Two Alliance Canada clusters provide SQL databases, but not redis. Also redis is not listed on Available Software, so we would need to install it. Available Python Wheels lists redis, but I think this is the python client (not the server).
Parallel for loops
I have written about future, which provides async and parallel computing in R:
- future.batchtools
- user2019 debrief
- R batchtools on Monsoon
- Fast parameter exploration
- Comparing machine learning frameworks in R
- Generalization to new subsets in R
- Interpretable learning algorithms with built-in feature selection
- Cross-validation with variable size train sets
- When is it useful to train with combined subsets?
- The importance of hyper-parameter tuning
- New code for various kinds of cross-validation
Most of the articles above are about machine learning benchmark experiments, which are embarrassingly parallel on data sets, algorithms, and train/test splits.
- Small experiments can be computed in sequence (no need for parallel
because it is so fast),
?mlr3resampling::pvalue
has an example benchmark which is fast enough to compute on CRAN. - Medium experiments can be computed in a few minutes/hours by using all the CPUs in parallel on your PC (for example my laptop has 14 CPUs which could be used for reasonable computation times for most of the blogs above).
- Large experiments are not feasible to compute on your PC, and instead should be computed on a cluster (100–10,000 parallel jobs).
A newer implementation of the same concept is mirai. The promises futures vignette explains the differences:
- future infers variables which are needed in your parallel code (easier); mirai does not (more explicit).
- mirai is faster.
- mirai supports task cancellation/interruption.
It seems mirai has been developed for faster shiny apps, and the mirai promises vignette explains how to use it for that.
purpose | future | mirai |
---|---|---|
start computing | future() |
m <- mirai() |
how to compute | plan() |
daemons(.compute="name") |
check if done | resolved() |
unresolved() |
loop | future_map() |
mirai_map() |
get value | value() |
m[] |
mirai uses nanonext which is an R package interface to the nng C library (nanomsg next generation). The previous generations were nanomsg and ZeroMQ, a C++ asychronous messaging library, with corresponding R packages, rzmq and clustermq (1000x less overhead compared to batchtools, so useful for running many small jobs).
foreach is an older implementation of the parallel for loop (but not async eval of single expression).
crew
crew is an R package which imports mirai. The Introduction to crew vignette explains that it has “centralized interface and auto-scaling.” I think the centralized interface refers to the fact that there is a controller object which is used to send jobs and retreive results (push and pop like rush). The auto-scaling is explained: “crew automatically raises and lowers the number of workers in response to fluctuations in the task workload.” You can specify idle/wall time limits for workers, or a limit on the number of jobs each worker can process. There is a nice motivation for auto-scaling: “The two extremes of auto-scaling are clustermq-like persistent workers and future-like transient workers, and each is problematic in its own way.”
- clustermq launches many workers, which stay running/idle while
waiting for tasks to be submitted via
Q()
(multiple tasks per worker, idle time bad). - future launches one worker per task, and the worker terminates after finishing the task (worker launch overhead time bad).
crew.cluster package apparently has support for SLURM, but
?crew.cluster::crew_controller_slurm
says it has not been tested. I guess in theory we should be able to
use it like batchtools::submitJobs
.
Dirk told me about doRedis, which has a
vignette
that explains it uses a central database to assign work to a variable number of workers.
Therefore it seems very similar to mirai
or crew.cluster
+ targets
or clustermq
.
Job pipelines
Make is the classic build system,
drake was the first popular R
implementation of the same idea, which now is considered superseded in
favor of targets. The crew
page in the targets
book explains that we
can run targets on the cluster by setting
targets::tar_option_set(controller=crew.cluster::crew_controller_slurm(...))
.
Conclusions and future work
Like with most things in R, there are a lot of different packages for
parallel computing. In machine learning experiments, some algorithms
can take a lot of time (deep neural networks), whereas others can be
very fast (featureless baseline or linear model). Also there can be
different data sets, with very different sizes in terms of numbers of
rows and columns. For example Table 1 of our SOAK
paper shows 20 data sets, with
~1,000 to ~1,000,000 rows, and 10 to ~10,000 features (1000x
differences). So it does not really make sense to schedule these
different algorithms and data sets with the same time limits (as is
required by SLURM job arrays). A more efficient alternative would be
to use crew.cluster
, which could start a certain number of SLURM
jobs, then keep sending more ML experiments to compute in them, until
there is no more work to do, after which the SLURM jobs terminate due
to the idle time limit.
This idea would also be useful in any other context with heterogenous
run-times between jobs. Another example is data.table
nightly
revdep
checker,
which is currently implemented using SLURM shell scripts with job
dependencies (setup job, check each package in parallel, analyze
results). The dependencies between jobs could be modeled with
targets
, which could be used with crew.cluster
to compute on
SLURM. One added bit of complexity is that there are some packages
which require less memory than others, and the Heterogeneous workers
page
explains that this can apparently be handed using crew controller
groups:
targets::tar_target(resources=targets::tar_resources(crew=targets::tar_resources_crew(controller="large_mem")))
.
We would have to start two different
crew.cluster::crew_controller_slurm()
(large_mem
and small_mem
).
To investigate: how many workers would we need? The two slowest
packages, which always take more than 9 hours, are haldensify (7 hours
per R CMD check) and ctmm (4 hours per R CMD check). There are at
least two checks required (master and CRAN versions of data.table
),
perhaps more if differences are found and git bisect is invoked.
Currently checks are run using both R-devel and R-release in the same
SLURM job, and that could perhaps be separated into two targets.
Using the most recent check times, I get the following estimated
number of workers, if we say that we want a result in 16 hours.
> jobs_dt <- data.table::fread("https://rcdata.nau.edu/genomic-ml/data.table-revdeps/analyze/2025-05-15/full_list_of_jobs.csv")
> ipat=list("[0-9]+", as.integer);nc::capture_first_df(jobs_dt, check.time=list(hours_only=ipat,":",minutes_only=ipat,":",seconds_only=ipat))
> jobs_dt[, hours := hours_only+minutes_only/60+seconds_only/60/60]
> large_dt <- data.table::fread("~/projects/data.table-revdeps/large_memory.csv")
> jobs_dt[,gigabytes := ifelse(Package %in% large_dt$Package, 16, 4)]
> jobs_dt[, .(hours=sum(hours)), by=gigabytes][, workers := ceiling(hours/16)][]
gigabytes hours workers
<num> <num> <num>
1: 4 442.2181 28
2: 16 22.8375 2
We would have to start the long jobs right away, if we want them to finish before the 16 hour time limit, because the largest job would take about 14 hours (assuming no git bisect).
Overall it seems like rush/clustermq/crew.cluster would be worth
investigating for ML benchmark experiments, and targets for the
data.table
revdep checker. In comparison to crew, rush seems to be
more flexible (only centralized database, no need for central
controller that sends jobs to workers). In detail, rush workers can
each decide what they want to work on, then register that key with the
central database, so the other workers can avoid working on the same
problem. This architecture is a better fit for some problems such as
hyper-parameter search. However crew seems easier to setup (no need
for redis database), and sufficient for simpler problems like ML
benchmark experiments (where all of the combinations can be enumerated
in advance).