The purpose of this page is to compare efficiency of different methods for parsing text.

Problem

I recently wrote the test code below,

sacct.txt <- system.file(
  package="slurm", "extdata", "sacct-e-rorqual-2026-03-20.txt", mustWork=TRUE)
computed <- slurm::sacct_fields(paste("cat", sacct.txt))
sacct.lines <- readLines(sacct.txt)
expected <- strsplit(gsub(" +", " ", paste(sacct.lines, collapse=" ")), " ")[[1]]
identical(computed, expected)
## [1] TRUE

Are the two methods comparable speed?

Test

The atime package allows us to see differences in computation time as a function of data size.

ares <- atime::atime(
  setup={
    Nlines <- rep(sacct.lines, N)
  },
  seconds.limit = 0.1,
  do.call=do.call(c, strsplit(Nlines, " +")),
  unlist=unlist(strsplit(Nlines, " +")),
  strsplit=strsplit(paste(Nlines, collapse=" "), " +")[[1]],
  capture_all_str=nc::capture_all_str(Nlines, field="\\w+")$field)
plot(ares)
## Loading required namespace: ggplot2
## Loading required namespace: directlabels
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the directlabels package.
##   Please report the issue at <https://github.com/tdhock/directlabels/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

plot of chunk atime

We can see above that there are some constant factor time and memory differences.

session info

sessionInfo()
## R Under development (unstable) (2026-02-07 r89380)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=fr_FR.UTF-8       
##  [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/Toronto
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] directlabels_2025.6.24 vctrs_0.7.1            cli_3.6.5              knitr_1.51            
##  [5] rlang_1.1.7            xfun_0.56              otel_0.2.0             bench_1.1.4           
##  [9] generics_0.1.4         S7_0.2.1               data.table_1.18.2.1    glue_1.8.0            
## [13] nc_2026.2.20           scales_1.4.0           quadprog_1.5-8         grid_4.6.0            
## [17] evaluate_1.0.5         tibble_3.3.1           profmem_0.7.0          lifecycle_1.0.5       
## [21] compiler_4.6.0         dplyr_1.2.0            RColorBrewer_1.1-3     pkgconfig_2.0.3       
## [25] atime_2025.9.30        farver_2.1.2           lattice_0.22-9         R6_2.6.1              
## [29] tidyselect_1.2.1       pillar_1.11.1          slurm_2026.3.20        magrittr_2.0.4        
## [33] withr_3.0.2            tools_4.6.0            gtable_0.3.6           ggplot2_4.0.2