Toby Dylan Hocking | Capturing regular expressions

The goal of this blog post is to explain a couple of interesting regular expression parsing techniques that were useful in my recent work.

nc intro

nc is my R package for named capture regular expressions (regex). It provides functions that make it easy to capture data from regularly structured text, and output a data table. For example, we can consider the problem of converting a typical text representation of elapsed time, 98:76:54 (98 hours, 76 minutes, 54 seconds) to a numeric variable. Below we define the text data to parse, which is called a “subject” in the context of regex matching:

elapsed.subject <- c("98:76:54","01:23:45")

Next we define a regex sub-pattern to match an integer,

int.pattern <- list("[0-9]+", as.integer)

Note in the code above that we define the regex sub-pattern as a list in R, which combines the pattern string with a conversion function. Next we use that sub-pattern list three times to create an overall pattern for matching to the subject/time text,

elapsed.pattern <- list(
  hours=int.pattern, ":",
  minutes=int.pattern, ":",
  seconds=int.pattern)

The pattern is defined in the code above as a list with five elements. Each of the three named elements becomes a capture group, and the name is used as a column name in the output (see below). The two un-named elements (both colons) define text to match between the capture groups. All elements of the pattern list are concatenated to obtain the final regex which is matched to each subject, using the code below.

(elapsed.dt <- nc::capture_first_vec(elapsed.subject, elapsed.pattern))

##    hours minutes seconds
##    <int>   <int>   <int>
## 1:    98      76      54
## 2:     1      23      45

The output above is a data table with one row per subject, and one column per capture group. Each column is an integer, because as.integer was specified as the type conversion for int.pattern in the code above. To compute the overall time, we can use the code below,

elapsed.dt[
  , overall.minutes := seconds/60+minutes+hours*60
][]

##    hours minutes seconds overall.minutes
##    <int>   <int>   <int>           <num>
## 1:    98      76      54         5956.90
## 2:     1      23      45           83.75

Parsing log files

Now suppose we have a bunch of log files, created from using the time command, as below.

R.commands <- c('R.version','library(nc);example(capture_all_str)')
log.subject <- character()
for(cmd.i in seq_along(R.commands)){
  R.cmd <- R.commands[[cmd.i]]
  time.cmd <- sprintf("time R -e '%s' 2>&1", R.cmd)
  log.lines <- system(time.cmd, intern=TRUE)
  print(tail(log.lines))
  log.subject[[R.cmd]] <- paste(log.lines, collapse="\n")
}

## [1] "version.string R version 4.3.2 (2023-10-31)"                                    
## [2] "nickname       Eye Holes                   "                                    
## [3] "> "                                                                             
## [4] "> "                                                                             
## [5] "0.54user 0.06system 0:00.61elapsed 99%CPU (0avgtext+0avgdata 52668maxresident)k"
## [6] "0inputs+8outputs (2major+13316minor)pagefaults 0swaps"                          
## [1] "2: Transformation introduced infinite values in continuous x-axis "              
## [2] "3: Transformation introduced infinite values in continuous x-axis "              
## [3] "> "                                                                              
## [4] "> "                                                                              
## [5] "6.72user 0.23system 0:07.03elapsed 98%CPU (0avgtext+0avgdata 126084maxresident)k"
## [6] "0inputs+56outputs (2major+30902minor)pagefaults 0swaps"

Above we see the last few lines in each time command log. The timings shown are in seconds (for user, system, elapsed). Another way to view the data is via the last few characters of each log, via the code below.

substr(log.subject, nchar(log.subject)-150, nchar(log.subject))

##                                                                                                                                                     R.version 
## "           \n> \n> \n0.54user 0.06system 0:00.61elapsed 99%CPU (0avgtext+0avgdata 52668maxresident)k\n0inputs+8outputs (2major+13316minor)pagefaults 0swaps" 
##                                                                                                                          library(nc);example(capture_all_str) 
## "s x-axis \n> \n> \n6.72user 0.23system 0:07.03elapsed 98%CPU (0avgtext+0avgdata 126084maxresident)k\n0inputs+56outputs (2major+30902minor)pagefaults 0swaps"

In both cases above, we see the number of seconds is a numeric variable with a decimal point. Below we define a pattern to match a numeric variable encoded in text,

num.pattern <- list("[0-9.]+", as.numeric)

Below we combine the sub-pattern above with a suffix to get a first partial match. Note that we use capture_all_str inside of by=R.cmd, so that it is run for each unique value of R.cmd. Since the log files have the same name as the command that was run, capture_all_str will read each log file, and find each

user.pattern <- list(user=num.pattern, "user")
nc::capture_first_vec(log.subject, user.pattern)

##     user
##    <num>
## 1:  0.54
## 2:  6.72

The output above is a data table with one row per log file, and one column per capture group defined in the regex. Below we define a more complex pattern, to additionally capture the system time,

user.system.pattern <- list(user.pattern, " ", system=num.pattern, "system")
nc::capture_first_vec(log.subject, user.system.pattern)

##     user system
##    <num>  <num>
## 1:  0.54   0.06
## 2:  6.72   0.23

Exercise for the reader: modify the pattern to additional capture elapsed, CPU, etc.

Parsing multi-line log files

Now suppose we had to parse POSIX instead of GNU time, as in the code below, which includes the -p flag to time.

posix.subject <- character()
for(cmd.i in seq_along(R.commands)){
  R.cmd <- R.commands[[cmd.i]]
  time.cmd <- sprintf("time -p R -e '%s' 2>&1", R.cmd)
  log.lines <- system(time.cmd, intern=TRUE)
  print(tail(log.lines))
  posix.subject[[R.cmd]] <- paste(log.lines, collapse="\n")
}

## [1] "nickname       Eye Holes                   " "> "                                         
## [3] "> "                                          "real 0.68"                                  
## [5] "user 0.56"                                   "sys 0.06"                                   
## [1] "3: Transformation introduced infinite values in continuous x-axis "
## [2] "> "                                                                
## [3] "> "                                                                
## [4] "real 6.98"                                                         
## [5] "user 6.75"                                                         
## [6] "sys 0.14"

Another way to view the data is via the last few characters of each log, via the code below.

substr(posix.subject, nchar(posix.subject)-50, nchar(posix.subject))

##                                                  R.version                       library(nc);example(capture_all_str) 
## "                \n> \n> \nreal 0.68\nuser 0.56\nsys 0.06" "ntinuous x-axis \n> \n> \nreal 6.98\nuser 6.75\nsys 0.14"

Again we can parse using a regex, which we begin to build in the code below.

real.pattern <- list("real ", real=num.pattern)
nc::capture_first_vec(posix.subject, real.pattern)

##     real
##    <num>
## 1:  0.68
## 2:  6.98

We build a more complex regex in the code below,

real.user.pattern <- list(real.pattern, "\nuser ", user=num.pattern)
nc::capture_first_vec(posix.subject, real.user.pattern)

##     real  user
##    <num> <num>
## 1:  0.68  0.56
## 2:  6.98  6.75

Exercise for the reader: create a more complex regex that additionally matches the sys time.

Parse all times in log

In the last sections, we created a new pattern for each of the times. But is it possible to have a single regex that matches all of the times? Yes! We can use capture_all_str, which inputs a text string (or file) to parse with a regex. To get that to work with multiple subjects, each which is a multi-line log string/file, we need to use it inside of a data table by clause, as below,

data.table(R.cmd=R.commands)[, nc::capture_all_str(
  posix.subject[[R.cmd]], type="real|user|sys", " ", seconds=num.pattern
), by=R.cmd]

##                                   R.cmd   type seconds
##                                  <char> <char>   <num>
## 1:                            R.version   real    0.68
## 2:                            R.version   user    0.56
## 3:                            R.version    sys    0.06
## 4: library(nc);example(capture_all_str)   real    6.98
## 5: library(nc);example(capture_all_str)   user    6.75
## 6: library(nc);example(capture_all_str)    sys    0.14

Exercise for the reader: do something similar with log.subject instead of posix.subject.

More complex regex

For a more challenging example, let us consider data from python doc strings, taken from torchvision.

doc.strings <- c('`The Rendered SST2 Dataset <https://github.com/openai/CLIP/blob/main/data/rendered-sst2.md>`_.\n\n    Rendered SST2 is an image classification dataset used to evaluate the models capability on optical\n    character recognition. This dataset was generated by rendering sentences in the Standford Sentiment\n    Treebank v2 dataset.\n\n    This dataset contains two classes (positive and negative) and is divided in three splits: a  train\n    split containing 6920 images (3610 positive and 3310 negative), a validation split containing 872 images\n    (444 positive and 428 negative), and a test split containing 1821 images (909 positive and 912 negative).\n\n    Args:\n        root (string): Root directory of the dataset.\n        split (string, optional): The dataset split, supports ``"train"`` (default), `"val"` and ``"test"``.\n        transform (callable, optional): A function/transform that  takes in an PIL image and returns a transformed\n            version. E.g, ``transforms.RandomCrop``.\n        target_transform (callable, optional): A function/transform that takes in the target and transforms it.\n        download (bool, optional): If True, downloads the dataset from the internet and\n            puts it in root directory. If dataset is already downloaded, it is not\n            downloaded again. Default is False.\n    ', "`WIDERFace <http://shuoyang1213.me/WIDERFACE/>`_ Dataset.\n\n    Args:\n        root (string): Root directory where images and annotations are downloaded to.\n            Expects the following folder structure if download=False:\n\n            .. code::\n\n                <root>\n                    \u2514\u2500\u2500 widerface\n                        \u251c\u2500\u2500 wider_face_split ('wider_face_split.zip' if compressed)\n                        \u251c\u2500\u2500 WIDER_train ('WIDER_train.zip' if compressed)\n                        \u251c\u2500\u2500 WIDER_val ('WIDER_val.zip' if compressed)\n                        \u2514\u2500\u2500 WIDER_test ('WIDER_test.zip' if compressed)\n        split (string): The dataset split to use. One of {``train``, ``val``, ``test``}.\n            Defaults to ``train``.\n        transform (callable, optional): A function/transform that  takes in a PIL image\n            and returns a transformed version. E.g, ``transforms.RandomCrop``\n        target_transform (callable, optional): A function/transform that takes in the\n            target and transforms it.\n        download (bool, optional): If true, downloads the dataset from the internet and\n            puts it in root directory. If dataset is already downloaded, it is not\n            downloaded again.\n\n    ", '`EMNIST <https://www.westernsydney.edu.au/bens/home/reproducible_research/emnist>`_ Dataset.\n\n    Args:\n        root (string): Root directory of dataset where ``EMNIST/raw/train-images-idx3-ubyte``\n            and  ``EMNIST/raw/t10k-images-idx3-ubyte`` exist.\n        split (string): The dataset has 6 different splits: ``byclass``, ``bymerge``,\n            ``balanced``, ``letters``, ``digits`` and ``mnist``. This argument specifies\n            which one to use.\n        train (bool, optional): If True, creates dataset from ``training.pt``,\n            otherwise from ``test.pt``.\n        download (bool, optional): If True, downloads the dataset from the internet and\n            puts it in root directory. If dataset is already downloaded, it is not\n            downloaded again.\n        transform (callable, optional): A function/transform that  takes in an PIL image\n            and returns a transformed version. E.g, ``transforms.RandomCrop``\n        target_transform (callable, optional): A function/transform that takes in the\n            target and transforms it.\n    ')
cat(doc.strings, sep="\n\n-----\n\n")

## `The Rendered SST2 Dataset <https://github.com/openai/CLIP/blob/main/data/rendered-sst2.md>`_.
## 
##     Rendered SST2 is an image classification dataset used to evaluate the models capability on optical
##     character recognition. This dataset was generated by rendering sentences in the Standford Sentiment
##     Treebank v2 dataset.
## 
##     This dataset contains two classes (positive and negative) and is divided in three splits: a  train
##     split containing 6920 images (3610 positive and 3310 negative), a validation split containing 872 images
##     (444 positive and 428 negative), and a test split containing 1821 images (909 positive and 912 negative).
## 
##     Args:
##         root (string): Root directory of the dataset.
##         split (string, optional): The dataset split, supports ``"train"`` (default), `"val"` and ``"test"``.
##         transform (callable, optional): A function/transform that  takes in an PIL image and returns a transformed
##             version. E.g, ``transforms.RandomCrop``.
##         target_transform (callable, optional): A function/transform that takes in the target and transforms it.
##         download (bool, optional): If True, downloads the dataset from the internet and
##             puts it in root directory. If dataset is already downloaded, it is not
##             downloaded again. Default is False.
##     
## 
## -----
## 
## `WIDERFace <http://shuoyang1213.me/WIDERFACE/>`_ Dataset.
## 
##     Args:
##         root (string): Root directory where images and annotations are downloaded to.
##             Expects the following folder structure if download=False:
## 
##             .. code::
## 
##                 <root>
##                     └── widerface
##                         ├── wider_face_split ('wider_face_split.zip' if compressed)
##                         ├── WIDER_train ('WIDER_train.zip' if compressed)
##                         ├── WIDER_val ('WIDER_val.zip' if compressed)
##                         └── WIDER_test ('WIDER_test.zip' if compressed)
##         split (string): The dataset split to use. One of {``train``, ``val``, ``test``}.
##             Defaults to ``train``.
##         transform (callable, optional): A function/transform that  takes in a PIL image
##             and returns a transformed version. E.g, ``transforms.RandomCrop``
##         target_transform (callable, optional): A function/transform that takes in the
##             target and transforms it.
##         download (bool, optional): If true, downloads the dataset from the internet and
##             puts it in root directory. If dataset is already downloaded, it is not
##             downloaded again.
## 
##     
## 
## -----
## 
## `EMNIST <https://www.westernsydney.edu.au/bens/home/reproducible_research/emnist>`_ Dataset.
## 
##     Args:
##         root (string): Root directory of dataset where ``EMNIST/raw/train-images-idx3-ubyte``
##             and  ``EMNIST/raw/t10k-images-idx3-ubyte`` exist.
##         split (string): The dataset has 6 different splits: ``byclass``, ``bymerge``,
##             ``balanced``, ``letters``, ``digits`` and ``mnist``. This argument specifies
##             which one to use.
##         train (bool, optional): If True, creates dataset from ``training.pt``,
##             otherwise from ``test.pt``.
##         download (bool, optional): If True, downloads the dataset from the internet and
##             puts it in root directory. If dataset is already downloaded, it is not
##             downloaded again.
##         transform (callable, optional): A function/transform that  takes in an PIL image
##             and returns a transformed version. E.g, ``transforms.RandomCrop``
##         target_transform (callable, optional): A function/transform that takes in the
##             target and transforms it.
## 

There is some structure in these doc strings, so it is possible to parse them using regex.

Title and URL on first line.
optional multi-line description below.
Args: section below.
Each arg name (type): description.

Here we focus just on parsing each argument (others are exercises for the reader). The pattern is relatively straightforward, if we want to just get one line:

before.name <- " +"
name.pattern <- "[^ ]+"
after.name <- " [(]"
name.type.pattern <- list(before.name, name=name.pattern, after.name, type=".*?", "[)]: ")
nc::capture_all_str(doc.strings, name.type.pattern, description=".*")

##                 name               type                                                                description
##               <char>             <char>                                                                     <char>
##  1:             root             string                                             Root directory of the dataset.
##  2:            split   string, optional The dataset split, supports ``"train"`` (default), `"val"` and ``"test"``.
##  3:        transform callable, optional A function/transform that  takes in an PIL image and returns a transformed
##  4: target_transform callable, optional           A function/transform that takes in the target and transforms it.
##  5:         download     bool, optional                       If True, downloads the dataset from the internet and
##  6:             root             string             Root directory where images and annotations are downloaded to.
##  7:            split             string           The dataset split to use. One of {``train``, ``val``, ``test``}.
##  8:        transform callable, optional                            A function/transform that  takes in a PIL image
##  9: target_transform callable, optional                                     A function/transform that takes in the
## 10:         download     bool, optional                       If true, downloads the dataset from the internet and
## 11:             root             string     Root directory of dataset where ``EMNIST/raw/train-images-idx3-ubyte``
## 12:            split             string              The dataset has 6 different splits: ``byclass``, ``bymerge``,
## 13:            train     bool, optional                             If True, creates dataset from ``training.pt``,
## 14:         download     bool, optional                       If True, downloads the dataset from the internet and
## 15:        transform callable, optional                           A function/transform that  takes in an PIL image
## 16: target_transform callable, optional                                     A function/transform that takes in the

But what if we want to get all of the lines of the description? We could try a multi-line greedy match, but that gives us only one row with too much in the description.

str(nc::capture_all_str(doc.strings, name.type.pattern, description="(?:.*\n)*"))

## Classes 'data.table' and 'data.frame':	1 obs. of  3 variables:
##  $ name       : chr "root"
##  $ type       : chr "string"
##  $ description: chr "Root directory of the dataset.\n        split (string, optional): The dataset split, supports ``\"train\"`` (de"| __truncated__
##  - attr(*, ".internal.selfref")=<externalptr>

The trick to getting this to work is to be more specific about what kinds of lines are allowed to match in the description. Basically, we can add a line if it is not going to match another argument. To do that we need negative lookahead.

not.arg <- list(
  "(?!",#negative lookahead, makes match fail if another argument name on this line.
  before.name, name.pattern, after.name, ")")
desc.pattern <- list(description=list(
  ".*\n",#first line
  nc::quantifier(not.arg, ".*\n", "*")))
arg.dt <- nc::capture_all_str(doc.strings, name.type.pattern, desc.pattern)
arg.dt[, .(name, type, desc=substr(description, 1, 40))]

##                 name               type                                      desc
##               <char>             <char>                                    <char>
##  1:             root             string          Root directory of the dataset.\n
##  2:            split   string, optional  The dataset split, supports ``"train"`` 
##  3:        transform callable, optional  A function/transform that  takes in an P
##  4: target_transform callable, optional  A function/transform that takes in the t
##  5:         download     bool, optional  If True, downloads the dataset from the 
##  6:             root             string  Root directory where images and annotati
##  7:            split             string  The dataset split to use. One of {``trai
##  8:        transform callable, optional  A function/transform that  takes in a PI
##  9: target_transform callable, optional A function/transform that takes in the\n 
## 10:         download     bool, optional  If true, downloads the dataset from the 
## 11:             root             string  Root directory of dataset where ``EMNIST
## 12:            split             string  The dataset has 6 different splits: ``by
## 13:            train     bool, optional  If True, creates dataset from ``training
## 14:         download     bool, optional  If True, downloads the dataset from the 
## 15:        transform callable, optional  A function/transform that  takes in an P
## 16: target_transform callable, optional A function/transform that takes in the\n

arg.dt[, cat(sprintf("%s (%s): %s", name, type, description),sep="\n")]

## root (string): Root directory of the dataset.
## 
## split (string, optional): The dataset split, supports ``"train"`` (default), `"val"` and ``"test"``.
## 
## transform (callable, optional): A function/transform that  takes in an PIL image and returns a transformed
##             version. E.g, ``transforms.RandomCrop``.
## 
## target_transform (callable, optional): A function/transform that takes in the target and transforms it.
## 
## download (bool, optional): If True, downloads the dataset from the internet and
##             puts it in root directory. If dataset is already downloaded, it is not
##             downloaded again. Default is False.
##     
## `WIDERFace <http://shuoyang1213.me/WIDERFACE/>`_ Dataset.
## 
##     Args:
## 
## root (string): Root directory where images and annotations are downloaded to.
##             Expects the following folder structure if download=False:
## 
##             .. code::
## 
##                 <root>
##                     └── widerface
##                         ├── wider_face_split ('wider_face_split.zip' if compressed)
##                         ├── WIDER_train ('WIDER_train.zip' if compressed)
##                         ├── WIDER_val ('WIDER_val.zip' if compressed)
##                         └── WIDER_test ('WIDER_test.zip' if compressed)
## 
## split (string): The dataset split to use. One of {``train``, ``val``, ``test``}.
##             Defaults to ``train``.
## 
## transform (callable, optional): A function/transform that  takes in a PIL image
##             and returns a transformed version. E.g, ``transforms.RandomCrop``
## 
## target_transform (callable, optional): A function/transform that takes in the
##             target and transforms it.
## 
## download (bool, optional): If true, downloads the dataset from the internet and
##             puts it in root directory. If dataset is already downloaded, it is not
##             downloaded again.
## 
##     
## `EMNIST <https://www.westernsydney.edu.au/bens/home/reproducible_research/emnist>`_ Dataset.
## 
##     Args:
## 
## root (string): Root directory of dataset where ``EMNIST/raw/train-images-idx3-ubyte``
##             and  ``EMNIST/raw/t10k-images-idx3-ubyte`` exist.
## 
## split (string): The dataset has 6 different splits: ``byclass``, ``bymerge``,
##             ``balanced``, ``letters``, ``digits`` and ``mnist``. This argument specifies
##             which one to use.
## 
## train (bool, optional): If True, creates dataset from ``training.pt``,
##             otherwise from ``test.pt``.
## 
## download (bool, optional): If True, downloads the dataset from the internet and
##             puts it in root directory. If dataset is already downloaded, it is not
##             downloaded again.
## 
## transform (callable, optional): A function/transform that  takes in an PIL image
##             and returns a transformed version. E.g, ``transforms.RandomCrop``
## 
## target_transform (callable, optional): A function/transform that takes in the
##             target and transforms it.

## NULL

Note how the above output indeed captures each argument description, but some of them include more than necessary (title of next data set is included in download arg description). Exercise for the reader: fix this by putting capture_all_str inside of by and creating a new regex to parse out the different sections of the doc string (title, url, args), and then use the regex that we created above to parse the args section.

Conclusion

We have seen various applications of regular expressions using nc in R. For even more practice, I recommend reading my regex-tutorial repo.

Session info

sessionInfo()

## R version 4.3.2 (2023-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C               LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8    LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/Phoenix
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  utils     datasets  grDevices methods   base     
## 
## other attached packages:
## [1] future_1.33.1     ggplot2_3.4.4     data.table_1.15.0
## 
## loaded via a namespace (and not attached):
##  [1] future.apply_1.11.1    gtable_0.3.4           dplyr_1.1.4            compiler_4.3.2         crayon_1.5.2          
##  [6] tidyselect_1.2.0       parallel_4.3.2         globals_0.16.2         scales_1.3.0           uuid_1.2-0            
## [11] RhpcBLASctl_0.23-42    R6_2.5.1               mlr3tuning_0.19.2      labeling_0.4.3         generics_0.1.3        
## [16] knitr_1.45             palmerpenguins_0.1.1   backports_1.4.1        checkmate_2.3.1        tibble_3.2.1          
## [21] munsell_0.5.0          paradox_0.11.1         pillar_1.9.0           mlr3tuningspaces_0.4.0 mlr3measures_0.5.0    
## [26] rlang_1.1.3            utf8_1.2.4             xfun_0.41              lgr_0.4.4              mlr3_0.17.2           
## [31] mlr3misc_0.13.0        cli_3.6.2              withr_3.0.0            magrittr_2.0.3         digest_0.6.34         
## [36] grid_4.3.2             mlr3learners_0.5.8     bbotk_0.7.3            nc_2024.2.6            lifecycle_1.0.4       
## [41] vctrs_0.6.5            evaluate_0.23          glue_1.7.0             farver_2.1.1           listenv_0.9.1         
## [46] codetools_0.2-19       parallelly_1.36.0      fansi_1.0.6            colorspace_2.1-0       tools_4.3.2           
## [51] pkgconfig_2.0.3