Toby Dylan Hocking | Historical reverse imports

I recently read Norm Matloff’s TidyverseSkeptic essay, which describes the advantages of teaching base R, in contrast to teaching using the tidyverse (which is a collection of R packages which re-implements some base R functionality using a different API). I think that two properties of the tidyverse make it difficult to use and teach, (1) there are so many different packages and functions, and (2) the packages are changed (and functions deprecated) so frequently. You could make an argument that base R also suffers from (1) – there are a lot of functions in base R. But base R has a clear win in (2), due to its focus on backwards compatibility.

Another package which re-implements some base R functionality is data.table, which provides some functions for reading/writing CSV files (fread/fwrite), reshaping data (melt/dcast), and aggregation by group (single square bracket with by argument). Base R has similar functions for each purpose (read.csv, write.csv, reshape, tapply). Similar tidyverse functions are provided in three different packages (readr, tidyr, dplyr). There is an older/deprecated version of data reshaping in the reshape2 package, and an older/deprecated version of aggregation by group in the plyr package. I studied some of these functions in my recent R Journal research paper on data reshaping.

I use data.table for research and teaching, and I find that it is quite easy to use for both purposes, because of its terse SQL-like syntax, and the relatively small number of functions and concepts that need to be learned. Its efficiency is a major reason that I keep using R for most of my research. Norm Matloff described data.table as “a technically superior competitor to dplyr.” So I would expect that there would be a large number of people using data.table. But Norm Matloff claims that there are even more people using tidyverse due to RStudio’s marketing. To what extent is that true?

We can tell how many other R package developers are using a given R package by looking at the reverse dependencies on the corresponding CRAN pages. For example the CRAN page for data.table currently lists about a full web browser screen full of other packages under Reverse imports, meaning other packages which Import and use at least one function from data.table. That seems like a lot, but is it? I looked at the corresponding page for dplyr and it was at least twice as large. Has that always been the case? Did data.table ever have more reverse dependencies than dplyr? I set about to quickly answer these questions using the R code below.

The reverse dependencies for each CRAN package are stored in the packages.rds file (currently a 7.3MB binary file that can be read into R using the readRDS function). But that file just has the most recent data. What about historical data?

Lucky for us, the Microsoft R Application Network (MRAN) keeps a time machine of daily CRAN snapshots, so we can access historical packages.rds files going back to its inception on September 17th, 2014. For example here is how we would download the data from that first day:

library(data.table)
get_packages <- function(date){
  date.str <- paste(date)
  date.dir <- file.path("~/R/dt-deps-time", date.str)
  dir.create(date.dir,showWarnings=FALSE,recursive=TRUE)
  packages.rds <- file.path(date.dir, "packages.rds")
  if(!file.exists(packages.rds)){
    u <- paste0(
      "https://cran.microsoft.com/snapshot/",
      date.str,
      "/web/packages/packages.rds")
    print(packages.rds)
    download.file(u, packages.rds)
  }
  packages <- readRDS(packages.rds)
  data.table(packages)
}
pkg.dt <- get_packages("2014-09-17")
names(pkg.dt)

##  [1] "Package"                 "Version"                
##  [3] "Priority"                "Depends"                
##  [5] "Imports"                 "LinkingTo"              
##  [7] "Suggests"                "Enhances"               
##  [9] "License"                 "License_is_FOSS"        
## [11] "License_restricts_use"   "OS_type"                
## [13] "Archs"                   "MD5sum"                 
## [15] "NeedsCompilation"        "Authors@R"              
## [17] "Author"                  "BugReports"             
## [19] "Contact"                 "Copyright"              
## [21] "Description"             "Encoding"               
## [23] "Language"                "Maintainer"             
## [25] "Title"                   "URL"                    
## [27] "SystemRequirements"      "Type"                   
## [29] "Path"                    "Classification/ACM"     
## [31] "Classification/JEL"      "Classification/MSC"     
## [33] "Published"               "VignetteBuilder"        
## [35] "Additional_repositories" "Reverse depends"        
## [37] "Reverse imports"         "Reverse linking to"     
## [39] "Reverse suggests"        "Reverse enhances"       
## [41] "MD5sum"

(rev.imports.str <- pkg.dt["data.table", on="Package"][["Reverse imports"]])

## [1] "aLFQ, benford.analysis, Causata, DataCombine, eeptools, FAOSTAT, freqweights, gems, IAT, Kmisc, lar, lllcrc, miscset, optiRum, pxweb, qdapTools, randomNames, RAPIDR, RbioRXN, rbison, rfisheries, rgauges, rgbif, rlist, rnoaa, rplos, SGP, simPH, spocc, sweSCB, taxize, treebase, treemap"

We can see from the output above that the Reverse imports are stored as a text string, each package separated by a comma and space. We can get the reverse imports via a regular expression,

nc::capture_all_str(rev.imports.str, dep.pkg="[^, ]+")

##              dep.pkg
##               <char>
##  1:             aLFQ
##  2: benford.analysis
##  3:          Causata
##  4:      DataCombine
##  5:         eeptools
##  6:          FAOSTAT
##  7:      freqweights
##  8:             gems
##  9:              IAT
## 10:            Kmisc
## 11:              lar
## 12:           lllcrc
## 13:          miscset
## 14:          optiRum
## 15:            pxweb
## 16:        qdapTools
## 17:      randomNames
## 18:           RAPIDR
## 19:          RbioRXN
## 20:           rbison
## 21:       rfisheries
## 22:          rgauges
## 23:            rgbif
## 24:            rlist
## 25:            rnoaa
## 26:            rplos
## 27:              SGP
## 28:            simPH
## 29:            spocc
## 30:           sweSCB
## 31:           taxize
## 32:         treebase
## 33:          treemap
##              dep.pkg

or by splitting on the delimiter,

strsplit(rev.imports.str, ", ")[[1]]

##  [1] "aLFQ"             "benford.analysis" "Causata"          "DataCombine"     
##  [5] "eeptools"         "FAOSTAT"          "freqweights"      "gems"            
##  [9] "IAT"              "Kmisc"            "lar"              "lllcrc"          
## [13] "miscset"          "optiRum"          "pxweb"            "qdapTools"       
## [17] "randomNames"      "RAPIDR"           "RbioRXN"          "rbison"          
## [21] "rfisheries"       "rgauges"          "rgbif"            "rlist"           
## [25] "rnoaa"            "rplos"            "SGP"              "simPH"           
## [29] "spocc"            "sweSCB"           "taxize"           "treebase"        
## [33] "treemap"

Here is a function for returning the number of reverse imports for selected packages, for a given packages table:

get_num_rev_imports <- function(packages.dt, some.pkg.names){
  packages.dt[some.pkg.names, .(
    Package,
    n.rev.imports=sapply(strsplit(`Reverse imports`, ", "), length)
  ), on="Package"]
}
compare.pkgs <- c("data.table","dplyr","plyr","tidyr","readr","reshape2")
get_num_rev_imports(pkg.dt, compare.pkgs)

##       Package n.rev.imports
##        <char>         <int>
## 1: data.table            33
## 2:      dplyr            11
## 3:       plyr           167
## 4:      tidyr             1
## 5:      readr             1
## 6:   reshape2            87

Let’s look at the first day of every month since the start of 2015.

date.vec <- seq(as.IDate("2015-01-01"), as.IDate("2022-01-01"), by="month")
(rev.imp.counts <- data.table(date=date.vec)[, {
  date.pkgs <- get_packages(date)
  get_num_rev_imports(date.pkgs, compare.pkgs)
}, by=date])

##            date    Package n.rev.imports
##          <IDat>     <char>         <int>
##   1: 2015-01-01 data.table            39
##   2: 2015-01-01      dplyr            27
##   3: 2015-01-01       plyr           198
##   4: 2015-01-01      tidyr             2
##   5: 2015-01-01      readr             1
##  ---                                    
## 506: 2022-01-01      dplyr          2750
## 507: 2022-01-01       plyr           822
## 508: 2022-01-01      tidyr          1296
## 509: 2022-01-01      readr           473
## 510: 2022-01-01   reshape2           745

Finally we can plug these count data into a ggplot,

library(ggplot2)
expand.days <- 11*30
gg <- ggplot()+
  theme_bw()+
  scale_color_manual(values=c(
    plyr="grey50",
    reshape2="grey50",
    data.table="blue",
    tidyr="red",
    readr="red",
    dplyr="red"))+
  scale_x_date(
    breaks=seq(min(date.vec), max(date.vec), by="year"),
    limits=as.IDate(c(
      min(date.vec)-expand.days,
      max(date.vec)+expand.days
    )))
gg.imports <- gg+
  scale_y_log10()+
  geom_line(aes(
    date, n.rev.imports, color=Package),
    size=1,
    data=rev.imp.counts)
directlabels::direct.label(
  gg.imports,
  directlabels::dl.combine("left.polygons", "right.polygons"))

plot of chunk unnamed-chunk-6

We can see in the figure above that at the beginning of this time period, data.table had more reverse imports than readr/dplyr/tidyr (red), but fewer than plyr/reshape2 (grey). At the end of this time period we see the opposite pattern, which makes sense because plyr/reshape2 are now deprecated (it is actually surprising to see their reverse imports increasing over time). The number of reverse imports of dplyr surpassed that of data.table in 2015, and the same happened for tidyr in 2021.

What about the rate of change each month? Overall each is clearly increasing, but which is increasing the most?

(rev.imp.diff <- rev.imp.counts[, .(
  diff.rev.imports=diff(n.rev.imports),
  date=date[-1]
), by=Package])

##         Package diff.rev.imports       date
##          <char>            <int>     <IDat>
##   1: data.table                4 2015-02-01
##   2: data.table                1 2015-03-01
##   3: data.table                1 2015-04-01
##   4: data.table                6 2015-05-01
##   5: data.table                9 2015-06-01
##  ---                                       
## 500:   reshape2                7 2021-09-01
## 501:   reshape2               12 2021-10-01
## 502:   reshape2               12 2021-11-01
## 503:   reshape2                9 2021-12-01
## 504:   reshape2                4 2022-01-01

gg.diff <- gg+
  geom_line(aes(
    date, diff.rev.imports, color=Package),
    size=1,
    data=rev.imp.diff)+
  scale_y_continuous(
    "Monthly change in number of reverse imports",
    breaks=seq(0, 300, by=20))+
  coord_cartesian(ylim=c(-5, 100))
directlabels::direct.label(gg.diff, "right.polygons")

plot of chunk unnamed-chunk-7

The figure above shows that the number of new reverse imports per month is around 10–100, with some variation over packages and over time.

Overall this analysis shows a few interesting trends.

All packages, even ones that were deprecated, tend to have increased numbers of reverse imports over time.
Tidyverse packages did not have as many reverse imports as data.table in 2015, but that trend has reversed in recent years.

Exercise for the reader: modify the code above to perform the same analysis for package properties other than Reverse imports (for example, Reverse depends and Reverse suggests). Plot each time series in a different panel/facet of a ggplot. Hint: use .SDcols with sapply as below!

some.cols <- c("Reverse imports", "Reverse depends", "Reverse suggests")
pkg.dt[compare.pkgs, .(
  dep.type=some.cols,
  n.rev.deps=sapply(.SD, function(x)length(strsplit(x, ", ")[[1]]))
), .SDcols=some.cols, by=.EACHI, on="Package"]

##        Package         dep.type n.rev.deps
##         <char>           <char>      <int>
##  1: data.table  Reverse imports         33
##  2: data.table  Reverse depends         13
##  3: data.table Reverse suggests          6
##  4:      dplyr  Reverse imports         11
##  5:      dplyr  Reverse depends          3
##  6:      dplyr Reverse suggests          5
##  7:       plyr  Reverse imports        167
##  8:       plyr  Reverse depends         50
##  9:       plyr Reverse suggests         40
## 10:      tidyr  Reverse imports          1
## 11:      tidyr  Reverse depends          1
## 12:      tidyr Reverse suggests          1
## 13:      readr  Reverse imports          1
## 14:      readr  Reverse depends          1
## 15:      readr Reverse suggests          1
## 16:   reshape2  Reverse imports         87
## 17:   reshape2  Reverse depends         20
## 18:   reshape2 Reverse suggests         26