Toby Dylan Hocking | Testing PeakSegPipeline on Travis with SLURM

I originally developed my PeakSegPipeline R package to work with PBS/qsub, which is the batch/cluster system that was running on Compute Canada’s Guillimin (I was using that cluster during my postdoc at McGill). Now I am at NAU where the Monsoon cluster is running SLURM. In this note I describe how I got PeakSegPipeline running on SLURM.

First of all I wrote the new jobs_create function which returns a data.table with one row for every job and three columns (step, fun, arg). The idea is that each step is an integer from 1 to 6, and smaller numbers should be run first; fun is the name of a PeakSegPipeline function to be run with the argument arg. For example step 1 computes the target interval for labeled samples/contigs via problem.target. Step 2 should be run after all Step 1 jobs finish, because it is the model training, which uses the target intervals to learn a regression model. Step 3 requires the trained model to do prediction on all samples/contigs, etc.

Then I wrote the jobs_submit_batchtools function which takes the job data.table described above, and launches the jobs on a SLURM cluster. I decided to use the batchtools R package because it has great documentation. Even though batchtools does not directly support dependencies between jobs at this time, I managed to get launch jobs with dependencies by creating a new template. The idea is that batchtools uses the template to generate a shell script that is run via sbatch (which is the SLURM command to launch jobs). My template adds a line like #SBATCH --depend=afterok:PREV_JOB_ID where PREV_JOB_ID is the job ID of the previous step. The only trick is that for each step I had to create a “registry”, which is a directory that contains meta-data about the jobs (shell scripts, logs, etc). The registry directories I created are in data.dir/registry/STEP where STEP is the step number (from 1 to 6). To tell batchtools to use the special template I put the following code in my ~/.batchtools.conf.R file:

cluster.functions = makeClusterFunctionsSlurm(system.file(
  file.path("templates", "slurm-afterok.tmpl"),
  package="PeakSegPipeline",
  mustWork=TRUE))

Finally to test that these new functions are working properly, I had to install SLURM on Travis by adding the following commands to my .travis.yml config file:

before_install:
  - sudo mkdir /etc/slurm-llnl
  - sudo cp slurm.conf /etc/slurm-llnl
  - cp batchtools.conf.R tests/testthat
  - sudo apt-get install -y slurm-llnl
  - sudo /usr/sbin/create-munge-key
  - sudo service munge start

The first two lines install a basic SLURM config file, which I had to generate using /usr/share/doc/slurmctld/slurm-wlm-configurator.easy.html on my laptop. Note that Travis runs Ubuntu 14.04 which has the slurm-llnl package but my laptop is running Ubuntu 18.04 which has the slurm-wlm package. The last two lines were required to avoid the the following error

── 1. Error: index.html is created via batchtools (@test-pipeline-noinput.R#149)
Listing of jobs failed (exit code 1);
cmd: 'squeue --user=$USER --states=R,S,CG --noheader --format=%i -r'
output:
squeue: error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
squeue: error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
squeue: error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory
squeue: error: authentication: Socket communication error
slurm_load_jobs error: Protocol authentication error
1: jobs_submit_batchtools(jobs, res.list) at testthat/test-pipeline-noinput.R:149
2: batchtools::submitJobs(chunks, resources = resources, reg = reg)
3: .findOnSystem(reg = reg, cols = c("job.id", "batch.id"))
4: getBatchIds(reg, status = "all")
5: unique(cf$listJobsRunning(reg))
6: cf$listJobsRunning(reg)
7: listJobs(reg, args)
8: OSError("Listing of jobs failed", res)
9: stopf("%s (exit code %i);\ncmd: '%s'\noutput:\n%s", msg, res$exit.code, stri_flatten(c(res$sys.cmd, 
       res$sys.args), collapse = " "), stri_flatten(res$output, "\n"))

In the testing code that launches the SLURM jobs, I used the following code to wait until the last step of the pipeline has finished:

  reg.dir <- file.path(data.dir, "registry", "6")
  reg <- batchtools::loadRegistry(reg.dir)
  result <- batchtools::waitForJobs(reg=reg, sleep=function(i){
    system("squeue")
    10
  })

Note that squeue is used to display the job progress every 10 seconds, in order to avoid falsely failing builds (Travis kills builds which display no output after 10 minutes).

Finally I had to apt-get install texlive texlive-fonts-extra texinfo to avoid the following WARNING about incorrectly generated R documentation,

* checking PDF version of manual ... WARNING
LaTeX errors when creating PDF version.
This typically indicates Rd problems.