Parsing check logs using regular expressions
The goal of this blog post is to explain how to use
nc to parse CRAN check log files, as
we did in data.table-revdeps issue#7, to support our
NSF POSE funded project about expanding the open-source ecosystem
around R data.table
.
Example data to parse
For our project we want to create a system that automatically checks
reverse dependencies (revdeps) of data.table
, meaning all the other
R packages which depend on it (or Import etc). The system checks each
revdep, using two versions of data.table
(GitHub master and CRAN
release), and then reports if there are any differences in check
results. If there is a failure in the revdep check using data.table
GitHub master, but not using CRAN release, then we know there is some
revdep issue which should be addressed before sending the current
GitHub version to CRAN.
However, if a revdep does not have its dependencies available at check
time, then the check will always fail with an ERROR, or skip some
checks (for both versions of data.table
), thereby creating the
possibility of a false negative (there could be a significant
difference due to the new code on GitHub, but we will not be able to
detect it). So ideally the system should also detect and report these
dependencies which are not available, which appear in the check result
as below,
some.check.out <- "
* this is package 'scoper' version '1.3.0'
* package encoding: UTF-8
* checking package namespace information ... OK
* checking package dependencies ... ERROR
Packages required but not available: 'alakazam', 'shazam'
See section 'The DESCRIPTION file' in the 'Writing R Extensions'
manual.
* DONE
Status: 1 ERROR
* this is package 'margins' version '0.3.26'
* checking package namespace information ... OK
* checking package dependencies ... NOTE
Packages which this enhances but not available for checking:
'AER', 'betareg', 'ordinal', 'survey'
* checking if this is a source package ... OK
* checking package dependencies ... ERROR
Package required but not available: 'Rcmdr'
Package suggested but not available for checking: 'tkrplot'
See section 'The DESCRIPTION file' in the 'Writing R Extensions'
manual.
* checking package dependencies ... ERROR
Packages required but not available:
'adjclust', 'BiocGenerics', 'csaw', 'InteractionSet', 'limma',
'SummarizedExperiment', 'HiCDOC'
* checking package dependencies ... NOTE
Package suggested but not available for checking: 'gWidgets2tcltk'
* checking if this is a source package ... OK
* checking package dependencies ... ERROR
Packages required but not available: 'maftools', 'NMF'
Packages suggested but not available for checking:
'Biobase', 'Biostrings', 'BSgenome', 'BSgenome.Hsapiens.UCSC.hg19',
'GenomicRanges', 'GenSA', 'IRanges'
See section 'The DESCRIPTION file' in the 'Writing R Extensions'
manual.
* DONE
"
How to parse the wide variety of data in the text above? We would like to extract
- the type of dependency (required/suggested/enhances), which always appears before the phrase “but not available” and
- the names of the dependent packages, which always appear inside single quotes, after a colon (maybe on several lines).
Since the data are regularly structured text, we can parse them using regular expressions (regex).
Parsing using nc
nc is my R package for named capture regex, which we will use for this text parsing task. First, the code below defines a regex to capture the dependency type,
type.not.avail.pattern <- list(
type='suggested|enhances|required',
' but not available')
When using the nc package, we define a regex as a list.
- Values in the list are concatenated to form the regex, and
- each named list element becomes a capture group, and the name used in R code becomes the column name in the resulting data table.
We use the regex in the code below to parse the dependency types,
nc::capture_all_str(some.check.out, type.not.avail.pattern)
## type
## <char>
## 1: required
## 2: enhances
## 3: required
## 4: suggested
## 5: required
## 6: suggested
## 7: required
## 8: suggested
The output above is a data table with one row for every match, and one
column for every capture group (only one, type
).
One advantage of the nc package is that it makes it easy to build complex regex from simple pieces.
For example consider the code below, which starts with the previous regex,
then adds another group before.colon
, and matches up to the colon:
up.to.colon.pattern <- list(
type.not.avail.pattern,
before.colon='.*?',
':')
nc::capture_all_str(some.check.out, up.to.colon.pattern)
## type before.colon
## <char> <char>
## 1: required
## 2: enhances for checking
## 3: required
## 4: suggested for checking
## 5: required
## 6: suggested for checking
## 7: required
## 8: suggested for checking
The output above contains a new column before.colon
which contains
the text captured before the colon.
Below we define a new regex that captures the text after the colon,
one or more lines (non-greedy), up to the next line which starts with star or newline.
one.or.more.lines.non.greedy <- '(?:.*\n)+?'
up.to.deps.pattern <- list(
up.to.colon.pattern,
deps=one.or.more.lines.non.greedy,
"[*|\n]")
(some.check.dt <- nc::capture_all_str(some.check.out, up.to.deps.pattern))
## type before.colon
## <char> <char>
## 1: required
## 2: enhances for checking
## 3: required
## 4: suggested for checking
## 5: required
## 6: suggested for checking
## 7: required
## 8: suggested for checking
## deps
## <char>
## 1: 'alakazam', 'shazam'\n
## 2: \n 'AER', 'betareg', 'ordinal', 'survey'\n
## 3: 'Rcmdr'\n
## 4: 'tkrplot'\n
## 5: \n 'adjclust', 'BiocGenerics', 'csaw', 'InteractionSet', 'limma',\n 'SummarizedExperiment', 'HiCDOC'\n
## 6: 'gWidgets2tcltk'\n
## 7: 'maftools', 'NMF'\n
## 8: \n 'Biobase', 'Biostrings', 'BSgenome', 'BSgenome.Hsapiens.UCSC.hg19',\n 'GenomicRanges', 'GenSA', 'IRanges'\n
The output above contains a new column deps
with all of the text
(over possibly several lines) that contains the dependent package
names. Another way to view the dependent packages is shown below as a
character string,
some.check.dt[["deps"]]
## [1] " 'alakazam', 'shazam'\n"
## [2] "\n 'AER', 'betareg', 'ordinal', 'survey'\n"
## [3] " 'Rcmdr'\n"
## [4] " 'tkrplot'\n"
## [5] "\n 'adjclust', 'BiocGenerics', 'csaw', 'InteractionSet', 'limma',\n 'SummarizedExperiment', 'HiCDOC'\n"
## [6] " 'gWidgets2tcltk'\n"
## [7] " 'maftools', 'NMF'\n"
## [8] "\n 'Biobase', 'Biostrings', 'BSgenome', 'BSgenome.Hsapiens.UCSC.hg19',\n 'GenomicRanges', 'GenSA', 'IRanges'\n"
Downloading log files
In this section, we download some log files which we can analyze using the approach in the previous section. First in the code below, we define a local directory to save the log files,
local.dir <- "~/teaching/regex-tutorial/cran-check-logs"
dir.create(local.dir, showWarnings = FALSE)
Note that the directory defined above is actually in a clone of my regex-tutorial repo (which has a copy of these data). Then in the code below, we download a CSV summary of checks from the revdep check server:
analyze.url <- "https://rcdata.nau.edu/genomic-ml/data.table-revdeps/analyze/"
remote.url.prefix <- paste0(
analyze.url,
"2024-01-22", #strftime(Sys.time(), "%Y-%m-%d"),
"/")
remote.csv <- paste0(remote.url.prefix, "full_list_of_jobs.csv")
(jobs.dt <- data.table::fread(remote.csv))
## task time MB State Package sig.diffs not.avail config.fail dl.fail
## <int> <char> <int> <char> <char> <int> <int> <int> <int>
## 1: 1 00:03:32 291 COMPLETED Ac3net 0 0 0 0
## 2: 2 00:06:04 558 COMPLETED accessibility 0 0 0 0
## 3: 3 00:10:47 443 COMPLETED acdcR 0 0 0 0
## 4: 4 00:08:20 389 COMPLETED Achilles 0 0 0 0
## 5: 5 00:11:11 478 COMPLETED actel 0 1 1 0
## ---
## 1463: 1463 00:02:28 207 COMPLETED youngSwimmers 0 0 0 0
## 1464: 1464 00:05:08 549 COMPLETED zebu 0 0 0 0
## 1465: 1465 00:07:37 586 COMPLETED zeitgebr 0 0 0 0
## 1466: 1466 00:05:51 413 COMPLETED ZIprop 0 0 0 0
## 1467: 1467 00:57:27 1283 COMPLETED zoomGroupStats 0 0 0 0
The table above contains one row for each revdep of data.table
.
The not.avail
column indicates the number of “not available”
messages which were output while checking the revdep.
(not.avail.logs <- jobs.dt[not.avail>0])
## task time MB State Package sig.diffs not.avail config.fail dl.fail
## <int> <char> <int> <char> <char> <int> <int> <int> <int>
## 1: 5 00:11:11 478 COMPLETED actel 0 1 1 0
## 2: 16 00:11:27 489 COMPLETED agriutilities 0 1 0 0
## 3: 27 00:09:09 413 COMPLETED AMR 0 4 0 0
## 4: 29 00:08:31 585 COMPLETED Anaconda 0 1 0 0
## 5: 41 00:24:53 1280 COMPLETED aPEAR 0 4 0 0
## ---
## 149: 1434 00:03:33 354 COMPLETED WGScan 0 1 0 0
## 150: 1437 00:13:18 1461 COMPLETED wiad 0 1 0 0
## 151: 1439 00:10:54 598 COMPLETED wilson 0 1 0 0
## 152: 1449 00:02:53 534 COMPLETED wTO 0 1 0 0
## 153: 1461 00:05:31 577 COMPLETED xplorerr 0 1 0 0
The table above has one row for every package that depends/imports/etc
data.table
, and that has some dependencies which were not available
at checking. The code block below is a for loop which downloads the
log for for each of those packages.
for(pkg.i in 1:nrow(not.avail.logs)){
pkg.row <- not.avail.logs[pkg.i]
pkg.txt <- paste0(pkg.row$Package, ".txt")
local.txt <- file.path(local.dir, pkg.txt)
if(!file.exists(local.txt)){
remote.txt <- paste0(remote.url.prefix, pkg.txt)
download.file(remote.txt, local.txt)
}
if(file.size(local.txt)>1024*1024){
unlink(local.txt)
}
}
Another regex to parse package list strings
Next, we use the previous regex to parse one representative log file using the code below.
(one.log.txt <- file.path(local.dir, "CNVScope.txt"))
## [1] "~/teaching/regex-tutorial/cran-check-logs/CNVScope.txt"
(one.not.avail.dt <- nc::capture_all_str(one.log.txt, up.to.deps.pattern))
## type before.colon
## <char> <char>
## 1: required
## 2: suggested for checking
## 3: required
## 4: suggested for checking
## 5: required
## 6: suggested for checking
## 7: required
## 8: suggested for checking
## deps
## <char>
## 1: \n 'GenomicInteractions', 'biomaRt', 'rtracklayer'\n
## 2: \n 'InteractionSet', 'GenomicRanges', 'GenomicFeatures', 'GenomeInfoDb',\n 'BSgenome.Hsapiens.UCSC.hg19'\n
## 3: \n 'GenomicInteractions', 'biomaRt', 'rtracklayer'\n
## 4: \n 'InteractionSet', 'GenomicRanges', 'GenomicFeatures', 'GenomeInfoDb',\n 'BSgenome.Hsapiens.UCSC.hg19'\n
## 5: \n 'GenomicInteractions', 'biomaRt', 'rtracklayer'\n
## 6: \n 'InteractionSet', 'GenomicRanges', 'GenomicFeatures', 'GenomeInfoDb',\n 'BSgenome.Hsapiens.UCSC.hg19'\n
## 7: \n 'GenomicInteractions', 'biomaRt', 'rtracklayer'\n
## 8: \n 'InteractionSet', 'GenomicRanges', 'GenomicFeatures', 'GenomeInfoDb',\n 'BSgenome.Hsapiens.UCSC.hg19'\n
The code above parses the log file using the same regex as in the
previous section. It returns a data table with one row per message
about dependent packages missing during checks. We can further parse
the deps
column into a new column with one dependent package name on
each row, using the code below. Note we use a new regex
quoted.pattern
which can be interpreted as follows:
- first match a single quote,
- then match can capture zero or more (non-greedy) of anything except a newline, (this will capture the missing dependent package name)
- then match another single quote.
quoted.pattern <- list("'", dep.pkg=".*?", "'")
In the code below, we use by=.(deps,type)
so that the regex matching
of quoted.pattern
happens for each unique value of deps
.
one.not.avail.dt[
, nc::capture_all_str(deps, quoted.pattern)
, by=.(deps,type)]
## deps
## <char>
## 1: \n 'GenomicInteractions', 'biomaRt', 'rtracklayer'\n
## 2: \n 'GenomicInteractions', 'biomaRt', 'rtracklayer'\n
## 3: \n 'GenomicInteractions', 'biomaRt', 'rtracklayer'\n
## 4: \n 'InteractionSet', 'GenomicRanges', 'GenomicFeatures', 'GenomeInfoDb',\n 'BSgenome.Hsapiens.UCSC.hg19'\n
## 5: \n 'InteractionSet', 'GenomicRanges', 'GenomicFeatures', 'GenomeInfoDb',\n 'BSgenome.Hsapiens.UCSC.hg19'\n
## 6: \n 'InteractionSet', 'GenomicRanges', 'GenomicFeatures', 'GenomeInfoDb',\n 'BSgenome.Hsapiens.UCSC.hg19'\n
## 7: \n 'InteractionSet', 'GenomicRanges', 'GenomicFeatures', 'GenomeInfoDb',\n 'BSgenome.Hsapiens.UCSC.hg19'\n
## 8: \n 'InteractionSet', 'GenomicRanges', 'GenomicFeatures', 'GenomeInfoDb',\n 'BSgenome.Hsapiens.UCSC.hg19'\n
## type dep.pkg
## <char> <char>
## 1: required GenomicInteractions
## 2: required biomaRt
## 3: required rtracklayer
## 4: suggested InteractionSet
## 5: suggested GenomicRanges
## 6: suggested GenomicFeatures
## 7: suggested GenomeInfoDb
## 8: suggested BSgenome.Hsapiens.UCSC.hg19
The output above is a table with one row per dependent package that was missing during the check. This second regex has successfully transformed the long character string with quoted package names, into a data table with one row for each package name.
Wrapping both regex operations in a function
To wrap the two regex operations above, we define the function below.
Note that since nc::capture_all_str
returns a data table,
and the data table square brackets also returns a data table,
we can use the square brackets to define a chain/pipeline of operations.
- First, use
up.to.deps.pattern
to get a table with one row for each message about dependent packages missing during the check, - Second, use
quoted.pattern
on each value ofdeps
, to get another table, with one row for each missing dependent package, - Finally, return only the
type
anddep.pkg
columns.
read_log <- function(log.txt){
nc::capture_all_str(
log.txt, up.to.deps.pattern
)[
, nc::capture_all_str(deps, quoted.pattern)
, by=.(type, deps)
][
, .(type, dep.pkg)
]
}
read_log(one.log.txt)
## type dep.pkg
## <char> <char>
## 1: required GenomicInteractions
## 2: required biomaRt
## 3: required rtracklayer
## 4: suggested InteractionSet
## 5: suggested GenomicRanges
## 6: suggested GenomicFeatures
## 7: suggested GenomeInfoDb
## 8: suggested BSgenome.Hsapiens.UCSC.hg19
The output above includes one row for each dependent package which was
missing during the package check (same as before, but now in a
function, and omitting the deps
column).
Parsing several log files
In this section we use the function that we created in the previous section, to parse several check log files. The code below defines a glob, for defining a set of log files to analyze,
(log.glob <- file.path(local.dir, "*.txt"))
## [1] "~/teaching/regex-tutorial/cran-check-logs/*.txt"
The call to nc:capture_first_glob
in the code below can be
interpreted as follows:
- the first argument is the glob specifying files to read,
- the argument named
READ
is a function that is used to read each file, (should return a data table) - and the other arguments specify a regex which is matched to each file name: the
Package
name to capture is anything except slash, one or more (up to.txt
).
(log.dt <- nc::capture_first_glob(
log.glob,
Package="[^/]+", "[.]txt$",
READ=read_log))
## Package type dep.pkg
## <char> <char> <char>
## 1: AMR enhances cleaner
## 2: AMR enhances janitor
## 3: AMR enhances skimr
## 4: AMR enhances tsibble
## 5: Anaconda required DESeq2
## ---
## 244: vaersvax suggested vaersND
## 245: wTO required Rfast
## 246: wiad suggested rgdal
## 247: wilson required DESeq2
## 248: xplorerr suggested rbokeh
The output above is a data table, which has one row for each missing
dependent package reported in each log file. The Package
column
contains the name of the package which was checked, and the dep.pkg
column contains the name of the dependent package which was missing
during that check.
If we want to report only the first log which contains a reference to
a missing dependent package, we can use the code below. Note that
.SD
is a data table that means “subset of data” corresponding to the
current by
group. And keyby
is used instead of by
to ensure that
the output is sorted by type
, then by dep.pkg
.
(first.missing <- log.dt[, .SD[1], keyby=.(type, dep.pkg)])
## Key: <type, dep.pkg>
## type dep.pkg Package
## <char> <char> <char>
## 1: enhances AER margins
## 2: enhances MNP prediction
## 3: enhances VGAM prediction
## 4: enhances WGCNA dendextend
## 5: enhances aod prediction
## ---
## 158: suggested toscaData tosca
## 159: suggested vaers vaersNDvax
## 160: suggested vaersND vaersNDvax
## 161: suggested wTO CoDiNA
## 162: suggested xcms LCMSQA
The table above has one row for each unique combination of type
and
dep.pkg
, where the value of Package
is the first log which
reported that missing dependency.
Conclusion
We have seen how to use the nc
R package to parse CRAN check log
files. More generally, nc
is useful whenever you have regularly
structured text data/files, from which you would like to extract a
data table, with one column for each capture group in a regex.
nc::capture_all_str
matches a regex to a text string/file, and returns a data table with one row for each match.nc::capture_first_glob
inputs a glob, gets a list of matching file names, then matches a regex to each file name, and combines the capture groups from that match, with whatever data table is returned by calling theREAD
function on each file name.
Exercise for the reader
Imagine that the files were numbered (12.txt
) instead of named
(actel.txt
). In that case how could you create the Package
column?
It is reported in the package check log, so try to modify the regex
code to parse that out of each log file. Hint: the relevant part of
the log is below.
partial.log <- "
* using session charset: ASCII
* checking for file 'actel/DESCRIPTION' ... OK
* this is package 'actel' version '1.3.0'
* package encoding: UTF-8
* checking package namespace information ... OK
* checking package dependencies ... NOTE
Package suggested but not available for checking: 'gWidgets2tcltk'
* checking if this is a source package ... OK
"
Session info
sessionInfo()
## R version 4.3.2 (2023-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C LC_TIME=English_United States.utf8
##
## time zone: America/Phoenix
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_4.3.2 nc_2023.8.24 tools_4.3.2 data.table_1.14.99 knitr_1.45 xfun_0.41
## [7] evaluate_0.23