Toby Dylan Hocking | New packages for data storage and reshaping

This morning I was doing some reading on statistical software that may be worth mentioning as related work in my R Journal submission about data reshaping using regular expressions. For that paper I performed computational experiments in which I recorded the timings of various R functions for data reshaping, as a function of data set size. Recently, I computed results for the tidyfast::dt_pivot_longer function. Since it is a wrapper on top of data.table::melt, I expected it to be about as fast, and it is for the case of returning 0 capture columns (no conversion of reshaped variable names). However for the case of 4 capture columns, it is actually a bit slower, because currently these capture columns must be created via a post-processing step. The author has plans to eventually support additional features that should make this computation “fast” as the package name suggests.

On the data.table Articles wiki page I found a few other interesting related works. One article describes a comparison with stata data reshaping tools. Another article describes the tidyfst package (yes the a in “fast” is missing, no relation to the other “tidyfast” package), which supports some basic reshaping using the tidyfst::longer_dt function. Again this is just a wrapper on top of data.table::melt. The main difference/novelty of my proposed nc::capture_melt_* functions is that a concise/non-repetitive regular expression syntax is used to define the set of input columns to reshape, and the names/types of the output capture columns.

So what does the tidyfst package name mean? Well the “tidy” is a reference to the tidyverse, which provides some popular packages such as dplyr for data manipulation (tidyfst mimics dplyr syntax). On the topic of tidyverse, there is a funny post by Holger K. von Jouanne-Diedrich about why he does not use the tidyverse, which has an even funnier comment about how one recruiter views (unfavorably) tidyverse fanboys. I’m not so dogmatic, and I actually think the tidyverse is really great for the R community. Because of its ease of use and quality of documentation, it makes R much easier for newbies.

And actually the “fst” in “tidyfst” is a reference to its support for the fst package for data table serialization. The fst benchmarks show that it is apparently very fast, and supports “random access” which means that you don’t have to read the whole file into memory, you can specify row start/end indices and column names to read. Some other new serialization formats include feather and parquet, which can be quickly read into memory as either data table or arrow format. That may be useful for big data analysis, but for now using CSV (plain text rather than binary files) for most of my projects is simpler, and fast enough thanks to data.table::fread.

One exception is that we use qs for serialization of arbitrary R objects that are randomly generated during fuzz testing, for the RcppDeepState project that has been graciously funded by the R Consortium. Actually, we would have used readRDS for simplicity (minimize package dependencies), but it did not work from within RInside in a DeepState test harness.