Toby Dylan Hocking | Comparing asymptotic timings of CSV read/write functions

Recently I submitted a project proposal to the NSF POSE program, about expanding the open-source ecosystem around the R data.table package. Part of the project involves developing new performance testing tools, that allow developers to see if their PRs cause any differences in time/memory usage. To implement this part of the project, I have created the atime R package for asymptotic performance testing, which means measuring and plotting time/memory usage as a function of data size.

Regular expressions in base R

This kind of analysis is useful for empirically determining the asymptotic complexity class (linear, quadratic, etc) of a given function. For example, the R function gregexpr() implements regular expression matching, and was quadratic time until I submitted a linear time patch in 2019. You can see the difference on the log-log plot (time vs data size); quadratic time functions appear as lines with larger slopes, in the R-3.5.2 panel. So asymptotic complexity testing can be useful for identifying parts of the code which may have sub-optimal implementation in terms of performance, and could be improved.

Reading and Writing CSV

The data.table package provides fread() and fwrite() functions, which implement reading/writing CSV data files. I used atime to compare the asymptotic complexity of these functions, with their analogs in base R and tidyverse. The code I used is in the compare-data.table-tidyverse.Rmd vignette, which was rendered on a NAU monsoon compute node (up to 64 cores/threads), and my personal MacBook Pro laptop (up to 2 cores/threads), both running R-4.2.3 (more reproducibility details below).

Reading CSV, variable number of rows

We begin with an example comparison where there is not much difference between the various methods. Below we plot the memory (top) and computation time (bottom) for reading CSV files with N rows, where N is on the X/horizontal axis, and the different methods are shown in different colors (darker shades of red/purple/green mean more threads).

The results above show that in terms of time, all methods are asymptotically equivalent (same slope). It is clear that data.table::fread is fastest for small number of rows, and readr::read_csv is slightly faster (by constant factors) for larger number of rows. It is also clear that it is faster (by constant factors) to use two threads instead of one. In terms of memory usage, readr::read_csv has a clear advantage in this comparison (constant versus linear asymptotic memory usage for the others), because it does not actually read all of the data into memory, until it is actually used in R for computation. The plot above was from my laptop, and the plot below is the same comparison, run on a compute node of the NAU Monsoon cluster.

The plot above shows essentially the same results as the previous one, except that there are three different values for number of threads (1, 32, 64). For data.table::fread and for readr::read_csv(lazy=FALSE), the number of threads does not make much difference. For readr::read_csv(lazy=TRUE), there are speedups when reading larger number of rows with larger number of threads. Below we do another comparison, where we collapse each row after reading it into memory.

The plot above shows little difference between the methods. In particular, the asymptotic difference in memory usage has disappeared, indicating there is no advantage to lazy data loading, if all of the data are actually used in the downstream R code computations.

Reading CSV, variable number of columns

In the comparison below, the data size that we vary on the X/horizontal axis is the number of columns (whereas in the previous section it was the number of rows).

The plot above shows that both time and memory for most methods is linear, except time is super-linear for utils::read.csv(). There is no advantage to using multiple threads. The plot above was for the NAU Monsoon compute node, and these trends are also observable when run on my laptop, as shown in the figure below.

Write CSV, variable number of columns

Below we show results for writing CSV files (previous sections above were about reading rather than writing), with different numbers of character columns (displayed on X/horizontal axis).

There is an interesting trend in the plot above: data.table::fwrite() is asymptotically more efficient (smaller slope) than both other methods. To more clearly see the asymptotic complexity class of each method, we can add reference lines, as shown below.

The plot above shows that data.table::fwrite() is linear (N) in the number of columns, whereas the other methods are quadratic (N^2). The plot below shows similar trends for writing real columns.

Finally, the plot below shows that all methods are asymptotically similar (linear time and constant memory), except utils::write.csv() has much larger memory usage.

Conclusion

Asymptotic performance testing is useful for identifying parts of code which can be improved in terms of time or memory usage. We have identified sub-optimal asymptotic complexity in the following functions:

utils::read.csv() could improve time usage, for large numbers of columns, https://github.com/tdhock/atime/issues/8.
utils::write.csv() and readr::write_csv() could improve both time and memory usage, for large numbers of columns, https://github.com/tdhock/atime/issues/9.
utils::write.csv() memory usage could be improved, for large number of rows, https://github.com/tdhock/atime/issues/10.

We observed that data.table::fread() and data.table::fwrite() provide implementations of CSV reading/writing that are asymptotically optimal, in terms of both time and memory usage. For future work, we would like to create examples that more clearly show how multi-threading could be used for speed improvements.

UPDATE 26 June 2023: Sebastian Meyer and Gabe Becker have helped to fix some of the issues in write.csv (and a related issue with replacing columns of data frames), which are available starting with R-4.3.0 (released 21 April 2023). I have computed the benchmarks above using several versions of R, for reproducibility:

R-4.2.3, which was the version of R used to create the figures above, the last release version with write.csv that is quadratic in the number of columns.
R-4.3.0, which was the first release of R with write.csv that is linear in the number of columns.
the most recent release of R, to check when/if the other issues are solved.
To re-run benchmarks we can use the code below:

th798@wind:~$ srun -t 24:00:00 --mem=32GB --cpus-per-task=4 --pty bash          srun: job 64668716 queued and waiting for resources
srun: job 64668716 has been allocated resources
th798@cn63:~$ module load anaconda3
th798@cn63:~$ conda activate emacs1
(emacs1) th798@cn63:~$ R -e 'rmarkdown::render("~/R/atime/vignettes/compare-data.table-tidyverse.Rmd")'