data.table CRAN diffs
As a part of my NSF POSE funded project about expanding the ecosystem
of users and contributors around the data.table
package in R, it
will be important to create tools that increase developer confidence
in the release process. One tool that could be useful is monitoring
the CRAN code to see if it is consistent with the corresponding tag in
the github repo. In this post I aim to create a proof of concept.
Getting old versions from CRAN archive
The data.table cran archive is
https://cran.r-project.org/src/contrib/Archive/data.table/
A previous post explained how to parse the Archive, which we do using the code below:
R.dir <- "~/R"
Archive.dir <- file.path(R.dir, "Archive")
dir.create(Archive.dir, showWarnings = FALSE)
Archive.url <- "https://cloud.r-project.org/src/contrib/Archive/"
get_Archive <- function(Package){
pkg.html <- file.path(Archive.dir, paste0(Package, ".html"))
if(!file.exists(pkg.html)){
u <- paste0(Archive.url, Package)
download.file(u, pkg.html)
}
nc::capture_all_str(
pkg.html,
tar.gz=list(
paste0(Package, "_"),
version="[0-9.-]+",
"[.]tar[.]gz"),
"</a>\\s+",
date.str=".*?",
"\\s"
)[order(date.str)]
}
(Archive.dt <- get_Archive("data.table"))
## tar.gz version date.str
## 1: data.table_1.0.tar.gz 1.0 2006-04-14
## 2: data.table_1.1.tar.gz 1.1 2008-08-27
## 3: data.table_1.2.tar.gz 1.2 2008-09-01
## 4: data.table_1.4.1.tar.gz 1.4.1 2010-05-03
## 5: data.table_1.5.tar.gz 1.5 2010-09-14
## 6: data.table_1.5.1.tar.gz 1.5.1 2011-01-08
## 7: data.table_1.5.2.tar.gz 1.5.2 2011-01-21
## 8: data.table_1.5.3.tar.gz 1.5.3 2011-02-11
## 9: data.table_1.6.tar.gz 1.6 2011-04-24
## 10: data.table_1.6.1.tar.gz 1.6.1 2011-06-29
## 11: data.table_1.6.2.tar.gz 1.6.2 2011-07-02
## 12: data.table_1.6.3.tar.gz 1.6.3 2011-08-04
## 13: data.table_1.6.4.tar.gz 1.6.4 2011-08-10
## 14: data.table_1.6.5.tar.gz 1.6.5 2011-08-25
## 15: data.table_1.6.6.tar.gz 1.6.6 2011-08-25
## 16: data.table_1.7.1.tar.gz 1.7.1 2011-10-22
## 17: data.table_1.7.2.tar.gz 1.7.2 2011-11-07
## 18: data.table_1.7.3.tar.gz 1.7.3 2011-11-25
## 19: data.table_1.7.4.tar.gz 1.7.4 2011-11-29
## 20: data.table_1.7.5.tar.gz 1.7.5 2011-12-04
## 21: data.table_1.7.6.tar.gz 1.7.6 2011-12-13
## 22: data.table_1.7.7.tar.gz 1.7.7 2011-12-15
## 23: data.table_1.7.8.tar.gz 1.7.8 2012-01-25
## 24: data.table_1.7.9.tar.gz 1.7.9 2012-01-31
## 25: data.table_1.7.10.tar.gz 1.7.10 2012-02-07
## 26: data.table_1.8.0.tar.gz 1.8.0 2012-07-16
## 27: data.table_1.8.2.tar.gz 1.8.2 2012-07-17
## 28: data.table_1.8.4.tar.gz 1.8.4 2012-11-09
## 29: data.table_1.8.6.tar.gz 1.8.6 2012-11-13
## 30: data.table_1.8.8.tar.gz 1.8.8 2013-03-06
## 31: data.table_1.8.10.tar.gz 1.8.10 2013-09-03
## 32: data.table_1.9.2.tar.gz 1.9.2 2014-02-27
## 33: data.table_1.9.4.tar.gz 1.9.4 2014-10-02
## 34: data.table_1.9.6.tar.gz 1.9.6 2015-09-19
## 35: data.table_1.9.8.tar.gz 1.9.8 2016-11-25
## 36: data.table_1.10.0.tar.gz 1.10.0 2016-12-03
## 37: data.table_1.10.2.tar.gz 1.10.2 2017-01-31
## 38: data.table_1.10.4.tar.gz 1.10.4 2017-02-01
## 39: data.table_1.10.4-1.tar.gz 1.10.4-1 2017-10-09
## 40: data.table_1.10.4-2.tar.gz 1.10.4-2 2017-10-12
## 41: data.table_1.10.4-3.tar.gz 1.10.4-3 2017-10-27
## 42: data.table_1.11.0.tar.gz 1.11.0 2018-05-01
## 43: data.table_1.11.2.tar.gz 1.11.2 2018-05-08
## 44: data.table_1.11.4.tar.gz 1.11.4 2018-05-27
## 45: data.table_1.11.6.tar.gz 1.11.6 2018-09-19
## 46: data.table_1.11.8.tar.gz 1.11.8 2018-09-30
## 47: data.table_1.12.0.tar.gz 1.12.0 2019-01-13
## 48: data.table_1.12.2.tar.gz 1.12.2 2019-04-07
## 49: data.table_1.12.4.tar.gz 1.12.4 2019-10-03
## 50: data.table_1.12.6.tar.gz 1.12.6 2019-10-18
## 51: data.table_1.12.8.tar.gz 1.12.8 2019-12-09
## 52: data.table_1.13.0.tar.gz 1.13.0 2020-07-24
## 53: data.table_1.13.2.tar.gz 1.13.2 2020-10-19
## 54: data.table_1.13.4.tar.gz 1.13.4 2020-12-08
## 55: data.table_1.13.6.tar.gz 1.13.6 2020-12-30
## 56: data.table_1.14.0.tar.gz 1.14.0 2021-02-21
## 57: data.table_1.14.2.tar.gz 1.14.2 2021-09-27
## 58: data.table_1.14.4.tar.gz 1.14.4 2022-10-17
## 59: data.table_1.14.6.tar.gz 1.14.6 2022-11-16
## tar.gz version date.str
The table above contains one row for every CRAN release of
data.table
.
github repo tags
Which of the versions above has a corresponding tag on github? First we get all the tags in the github repo:
dt.dir <- file.path(R.dir, "data.table")
(tag.tib <- gert::git_tag_list(repo=dt.dir))
## # A tibble: 61 × 3
## name ref commit
## * <chr> <chr> <chr>
## 1 1.10.0 refs/tags/1.10.0 81cf17e3c28d22dad818db22dafea3f7a830be2d
## 2 1.10.2 refs/tags/1.10.2 b40ec30b61a54040acfcd6ce62b17b43ca8cb272
## 3 1.10.4 refs/tags/1.10.4 8b201fd28f5d4afcc4be026a5d9eb4bb6dd62955
## 4 1.11.0 refs/tags/1.11.0 d59c486d80b1afb57fa898755f04b5f0809cb07f
## 5 1.11.2 refs/tags/1.11.2 18ebbf60f3f8d1d136bfccc53b993003eb757608
## 6 1.11.4 refs/tags/1.11.4 06d055b7295838925890a958c30eb76be9510895
## 7 1.11.6 refs/tags/1.11.6 a4e26b50beaf0bb2aac40bbf47f9d1745579154a
## 8 1.11.8 refs/tags/1.11.8 76bb569fd7736b5f89471a35357e6a971ae1d424
## 9 1.12.0 refs/tags/1.12.0 34796cd1524828df9bf13a174265cb68a09fcd77
## 10 1.12.2 refs/tags/1.12.2 86034855f9b305e948d83014af89352fc42e27f2
## # ℹ 51 more rows
intersection of tags and CRAN versions
The table below shows what versions are different between the github tags and CRAN:
library(data.table)
tag.dt <- data.table(tag.tib)[, .(version=name)]
setkey(tag.dt, version)
setkey(Archive.dt, version)
tag.dt[!Archive.dt]
## version
## 1: 1.14.8
## 2: 1.3
## 3: 1.4
## 4: 1.5.4
## 5: 1.6.7
## 6: 1.7.0
## 7: 1.9.0
Archive.dt[!tag.dt]
## tar.gz version date.str
## 1: data.table_1.0.tar.gz 1.0 2006-04-14
## 2: data.table_1.1.tar.gz 1.1 2008-08-27
## 3: data.table_1.10.4-1.tar.gz 1.10.4-1 2017-10-09
## 4: data.table_1.10.4-2.tar.gz 1.10.4-2 2017-10-12
## 5: data.table_1.10.4-3.tar.gz 1.10.4-3 2017-10-27
The tables above show that there are currently 7 tags which were never submitted to CRAN, and there are 5 CRAN versions which never got a tag. For this analysis, I am interested in the intersection (versions with both a tag on github, and a release on CRAN):
(both.dt <- tag.dt[Archive.dt, nomatch=0L])
## version tar.gz date.str
## 1: 1.10.0 data.table_1.10.0.tar.gz 2016-12-03
## 2: 1.10.2 data.table_1.10.2.tar.gz 2017-01-31
## 3: 1.10.4 data.table_1.10.4.tar.gz 2017-02-01
## 4: 1.11.0 data.table_1.11.0.tar.gz 2018-05-01
## 5: 1.11.2 data.table_1.11.2.tar.gz 2018-05-08
## 6: 1.11.4 data.table_1.11.4.tar.gz 2018-05-27
## 7: 1.11.6 data.table_1.11.6.tar.gz 2018-09-19
## 8: 1.11.8 data.table_1.11.8.tar.gz 2018-09-30
## 9: 1.12.0 data.table_1.12.0.tar.gz 2019-01-13
## 10: 1.12.2 data.table_1.12.2.tar.gz 2019-04-07
## 11: 1.12.4 data.table_1.12.4.tar.gz 2019-10-03
## 12: 1.12.6 data.table_1.12.6.tar.gz 2019-10-18
## 13: 1.12.8 data.table_1.12.8.tar.gz 2019-12-09
## 14: 1.13.0 data.table_1.13.0.tar.gz 2020-07-24
## 15: 1.13.2 data.table_1.13.2.tar.gz 2020-10-19
## 16: 1.13.4 data.table_1.13.4.tar.gz 2020-12-08
## 17: 1.13.6 data.table_1.13.6.tar.gz 2020-12-30
## 18: 1.14.0 data.table_1.14.0.tar.gz 2021-02-21
## 19: 1.14.2 data.table_1.14.2.tar.gz 2021-09-27
## 20: 1.14.4 data.table_1.14.4.tar.gz 2022-10-17
## 21: 1.14.6 data.table_1.14.6.tar.gz 2022-11-16
## 22: 1.2 data.table_1.2.tar.gz 2008-09-01
## 23: 1.4.1 data.table_1.4.1.tar.gz 2010-05-03
## 24: 1.5 data.table_1.5.tar.gz 2010-09-14
## 25: 1.5.1 data.table_1.5.1.tar.gz 2011-01-08
## 26: 1.5.2 data.table_1.5.2.tar.gz 2011-01-21
## 27: 1.5.3 data.table_1.5.3.tar.gz 2011-02-11
## 28: 1.6 data.table_1.6.tar.gz 2011-04-24
## 29: 1.6.1 data.table_1.6.1.tar.gz 2011-06-29
## 30: 1.6.2 data.table_1.6.2.tar.gz 2011-07-02
## 31: 1.6.3 data.table_1.6.3.tar.gz 2011-08-04
## 32: 1.6.4 data.table_1.6.4.tar.gz 2011-08-10
## 33: 1.6.5 data.table_1.6.5.tar.gz 2011-08-25
## 34: 1.6.6 data.table_1.6.6.tar.gz 2011-08-25
## 35: 1.7.1 data.table_1.7.1.tar.gz 2011-10-22
## 36: 1.7.10 data.table_1.7.10.tar.gz 2012-02-07
## 37: 1.7.2 data.table_1.7.2.tar.gz 2011-11-07
## 38: 1.7.3 data.table_1.7.3.tar.gz 2011-11-25
## 39: 1.7.4 data.table_1.7.4.tar.gz 2011-11-29
## 40: 1.7.5 data.table_1.7.5.tar.gz 2011-12-04
## 41: 1.7.6 data.table_1.7.6.tar.gz 2011-12-13
## 42: 1.7.7 data.table_1.7.7.tar.gz 2011-12-15
## 43: 1.7.8 data.table_1.7.8.tar.gz 2012-01-25
## 44: 1.7.9 data.table_1.7.9.tar.gz 2012-01-31
## 45: 1.8.0 data.table_1.8.0.tar.gz 2012-07-16
## 46: 1.8.10 data.table_1.8.10.tar.gz 2013-09-03
## 47: 1.8.2 data.table_1.8.2.tar.gz 2012-07-17
## 48: 1.8.4 data.table_1.8.4.tar.gz 2012-11-09
## 49: 1.8.6 data.table_1.8.6.tar.gz 2012-11-13
## 50: 1.8.8 data.table_1.8.8.tar.gz 2013-03-06
## 51: 1.9.2 data.table_1.9.2.tar.gz 2014-02-27
## 52: 1.9.4 data.table_1.9.4.tar.gz 2014-10-02
## 53: 1.9.6 data.table_1.9.6.tar.gz 2015-09-19
## 54: 1.9.8 data.table_1.9.8.tar.gz 2016-11-25
## version tar.gz date.str
The table above shows the versions in the intersection.
git status and diff
Below we do the following steps for each version:
- clone github repo and checkout that version/tag,
- download source package from CRAN, then untar it so that the files over-write the files in the github clone,
- run git status and git diff to see what files have changed or have been added.
versions.dir <- file.path(R.dir,"data.table-versions")
versions.dt.dir <- file.path(versions.dir, "data.table")
diff.csv <- file.path(versions.dir,"diff.csv")
status.csv <- file.path(versions.dir,"status.csv")
dir.create(versions.dir, showWarnings = FALSE)
if(file.exists(diff.csv)){
diff.dt <- fread(diff.csv)
status.dt <- fread(status.csv)
}else{
diff.dt.list <- list()
status.dt.list <- list()
for(version.i in 1:nrow(both.dt)){
version.row <- both.dt[version.i]
system(paste("cd",versions.dir,"&& git clone",dt.dir))
system(paste("cd",versions.dt.dir,"&& git checkout",version.row$version))
version.url <- paste0(Archive.url,"data.table/",version.row$tar.gz)
version.tar.gz <- file.path(versions.dir,version.row$tar.gz)
if(!file.exists(version.tar.gz))download.file(version.url, version.tar.gz)
system(paste("cd",versions.dir,"&& tar xf",version.row$tar.gz))
meta.row <- version.row[,.(version)]
status.dt.list[[version.i]] <- data.table(
meta.row, gert::git_status(repo=versions.dt.dir))
diff.dt.list[[version.i]] <- data.table(
meta.row, gert::git_diff(repo=versions.dt.dir))
dest.dir <- file.path(versions.dir,version.row$version)
unlink(dest.dir,recursive = TRUE)
file.rename(versions.dt.dir,dest.dir)
}
diff.dt <- rbindlist(diff.dt.list)
status.dt <- rbindlist(status.dt.list)
fwrite(diff.dt, diff.csv)
fwrite(status.dt, status.csv)
}
Below we look at the status of the most recent dot zero versions:
status.dt[version=="1.14.0"]
## version file status staged
## 1: 1.14.0 build/ new FALSE
## 2: 1.14.0 DESCRIPTION modified FALSE
## 3: 1.14.0 inst/cc new FALSE
## 4: 1.14.0 inst/doc/ new FALSE
## 5: 1.14.0 inst/tests/benchmark.Rraw.bz2 new FALSE
## 6: 1.14.0 inst/tests/froll.Rraw.bz2 new FALSE
## 7: 1.14.0 inst/tests/nafill.Rraw.bz2 new FALSE
## 8: 1.14.0 inst/tests/other.Rraw.bz2 new FALSE
## 9: 1.14.0 inst/tests/tests.Rraw.bz2 new FALSE
## 10: 1.14.0 inst/tests/types.Rraw.bz2 new FALSE
## 11: 1.14.0 MD5 new FALSE
## 12: 1.14.0 NEWS.md modified FALSE
## 13: 1.14.0 src/init.c modified FALSE
status.dt[version=="1.13.0"]
## version file status staged
## 1: 1.13.0 build/ new FALSE
## 2: 1.13.0 DESCRIPTION modified FALSE
## 3: 1.13.0 inst/cc new FALSE
## 4: 1.13.0 inst/doc/ new FALSE
## 5: 1.13.0 inst/tests/benchmark.Rraw.bz2 new FALSE
## 6: 1.13.0 inst/tests/froll.Rraw.bz2 new FALSE
## 7: 1.13.0 inst/tests/nafill.Rraw.bz2 new FALSE
## 8: 1.13.0 inst/tests/other.Rraw.bz2 new FALSE
## 9: 1.13.0 inst/tests/tests.Rraw.bz2 new FALSE
## 10: 1.13.0 inst/tests/types.Rraw.bz2 new FALSE
## 11: 1.13.0 MD5 new FALSE
## 12: 1.13.0 NEWS.md modified FALSE
## 13: 1.13.0 src/init.c modified FALSE
status.dt[version=="1.12.0"]
## version file status staged
## 1: 1.12.0 build/ new FALSE
## 2: 1.12.0 DESCRIPTION modified FALSE
## 3: 1.12.0 inst/doc/ new FALSE
## 4: 1.12.0 MD5 new FALSE
## 5: 1.12.0 NEWS.md modified FALSE
## 6: 1.12.0 src/init.c modified FALSE
status.dt[version=="1.11.0"]
## version file status staged
## 1: 1.11.0 build/ new FALSE
## 2: 1.11.0 DESCRIPTION modified FALSE
## 3: 1.11.0 inst/doc/ new FALSE
## 4: 1.11.0 MD5 new FALSE
## 5: 1.11.0 NEWS.md modified FALSE
Among the files which have been modified, init.c
is the only
code. How many modifications to C and R files are there over all
versions?
status.dt[grepl("[cR]$",file)&status=="modified"]
## version file status staged
## 1: 1.2 R/between.R modified FALSE
## 2: 1.2 R/c.factor.R modified FALSE
## 3: 1.2 R/data.table.R modified FALSE
## 4: 1.2 R/getdots.R modified FALSE
## 5: 1.2 R/last.R modified FALSE
## 6: 1.2 R/like.R modified FALSE
## 7: 1.2 R/plus.R modified FALSE
## 8: 1.2 R/setkey.R modified FALSE
## 9: 1.2 R/tables.R modified FALSE
## 10: 1.2 R/take.R modified FALSE
## 11: 1.2 R/test.data.table.R modified FALSE
## 12: 1.2 R/time.taken.R modified FALSE
## 13: 1.2 R/trim.R modified FALSE
## 14: 1.2 R/which.first.R modified FALSE
## 15: 1.2 R/which.last.R modified FALSE
## 16: 1.2 R/zzz.R modified FALSE
## 17: 1.2 src/duplist.c modified FALSE
## 18: 1.2 src/sortedmatch.c modified FALSE
## 19: 1.2 src/vecref.c modified FALSE
## 20: 1.11.4 R/setkey.R modified FALSE
## 21: 1.11.8 R/fcast.R modified FALSE
## 22: 1.11.8 R/fmelt.R modified FALSE
## 23: 1.11.8 R/fread.R modified FALSE
## 24: 1.12.0 src/init.c modified FALSE
## 25: 1.12.2 src/init.c modified FALSE
## 26: 1.12.4 src/init.c modified FALSE
## 27: 1.12.6 src/init.c modified FALSE
## 28: 1.12.8 src/init.c modified FALSE
## 29: 1.13.0 src/init.c modified FALSE
## 30: 1.13.2 src/init.c modified FALSE
## 31: 1.13.4 src/init.c modified FALSE
## 32: 1.13.6 src/init.c modified FALSE
## 33: 1.14.0 src/init.c modified FALSE
## version file status staged
The table above has 33 rows for modified code files. 19 of them are in version 1.2 (very old). Four rows are 1.11.* R code files. The rest are init.c modifications in 1.12.0 and newer.
init.dt <- diff.dt[new=="src/init.c"]
cat(init.dt[1, patch])
## diff --git a/src/init.c b/src/init.c
## index a5f6d4a..0b6c9ed 100644
## --- a/src/init.c
## +++ b/src/init.c
## @@ -326,6 +326,6 @@ SEXP hasOpenMP() {
##
## SEXP dllVersion() {
## // .onLoad calls this and checks the same as packageVersion() to ensure no R/C version mismatch, #3056
## - return(ScalarString(mkChar(""1.12.1"")));
## + return(ScalarString(mkChar(""1.12.0"")));
## }
##
cat(init.dt[.N, patch])
## diff --git a/src/init.c b/src/init.c
## index 714608c..0b442f0 100644
## --- a/src/init.c
## +++ b/src/init.c
## @@ -414,6 +414,6 @@ SEXP initLastUpdated(SEXP var) {
##
## SEXP dllVersion() {
## // .onLoad calls this and checks the same as packageVersion() to ensure no R/C version mismatch, #3056
## - return(ScalarString(mkChar(""1.14.1"")));
## + return(ScalarString(mkChar(""1.14.0"")));
## }
##
The above output shows that the only difference is the version number
(+ 1.14.0 is from CRAN, -1.14.1 is from github). This is consistent
with what Matt wrote in CRAN_Release.cmd
:
“DO NOT commit or push to GitHub. Leave 4 files
(.dev/CRAN_Release.cmd, DESCRIPTION, NEWS and init.c) edited and not
committed. Include these in a single and final bump commit below. DO
NOT even use a PR. Because PRs build binaries and we don’t want any
binary versions of even release numbers available from anywhere other
than CRAN… Bump dllVersion() in init.c … Push to master … run
git tag 1.14.8 96c..sha..d77
then git push origin 1.14.8
”
Conclusions
This code can be used to automatically create a report of differences and files added in CRAN, with respect to the corresponding github tag. This analysis may be useful for encouraging developer confidence/trust in the CRAN maintainer.