My paper about regular expressions for data reshaping was recently accepted into R journal. It describes the novelty of my nc R package, which solves the problem of “separation of related concepts” which happens with other data reshaping packages. I wrote “the nc syntax should be preferred for more complex patterns (with more groups) in order to keep the group names and sub-patterns closer and easier to maintain/read in the code.” This is related to the idea that a data table is often easier to read/edit when defined as a collection of rows (versus columns).

medRxiv paper code example

In our recent paper about spatially explicit stochastic disease models (currently in peer review), there is some R code that defines time windows. It was my suggestion to present the data/code as follows, with each time window on a line:

one.window <- function(start, end, r0)data.frame(start, end, r0)
library(lubridate)
(time.window.args <- rbind(# Specify the components of 5 time windows
  one.window(mdy("1-1-20"),mdy("1-31-20"),3.0),
  one.window(mdy("2-1-20"),mdy("2-15-20"),0.8),
  one.window(mdy("2-16-20"),mdy("3-10-20"),0.8),
  one.window(mdy("3-11-20"),mdy("3-21-20"),1.4),
  one.window(mdy("3-22-20"),mdy("5-1-20"),1.4)))
##        start        end  r0
## 1 2020-01-01 2020-01-31 3.0
## 2 2020-02-01 2020-02-15 0.8
## 3 2020-02-16 2020-03-10 0.8
## 4 2020-03-11 2020-03-21 1.4
## 5 2020-03-22 2020-05-01 1.4

The code above is easy to read because all of the information related to a given time window is defined on the same line. It is easy to add/remove/change windows for the same reason. In contrast, another way to define these data would be with a column on each line,

data.frame(
  start=c("1-1-20","2-1-20","2-16-20","3-11-20","3-22-20"),
  end=c("1-31-20","2-15-20","3-10-20","3-21-20","5-1-20"),
  r0=c(3,0.8,0.8,1.4,1.4))
##     start     end  r0
## 1  1-1-20 1-31-20 3.0
## 2  2-1-20 2-15-20 0.8
## 3 2-16-20 3-10-20 0.8
## 4 3-11-20 3-21-20 1.4
## 5 3-22-20  5-1-20 1.4

This format above is much more difficult to read and understand – what is the r0 value for Mar 17? You have to first figure out that day falls in the fourth start/end, then you have to find the fourth r0 value. Reading is a bit easier if you add a bunch of spaces, as below:

data.frame(
  start =c( "1-1-20", "2-1-20", "2-16-20", "3-11-20", "3-22-20"),
  end   =c("1-31-20","2-15-20", "3-10-20", "3-21-20",  "5-1-20"),
  r0    =c(        3,      0.8,       0.8,       1.4,       1.4))
##     start     end  r0
## 1  1-1-20 1-31-20 3.0
## 2  2-1-20 2-15-20 0.8
## 3 2-16-20 3-10-20 0.8
## 4 3-11-20 3-21-20 1.4
## 5 3-22-20  5-1-20 1.4

Column-wise storage also makes it more difficult to add/remove/edit entries. This observation helps to explain why CSV files are such a popular data storage format.

Analog with regular expressions for data reshaping

In wide-to-long data reshaping with regular column names we can use a regex pattern to (1) identify the input columns to reshape, and (2) extract data from those columns for storage in the output. For example consider the following data,

# similar to data(who, package="tidyr")
who <- data.frame(country="usa", new_sp_m5564=1, newrel_f65=2, new_ep_f014=3)

Each column after the first has a regular name:

  • new,
  • an optional underscore,
  • a diagnosis code (sp, rel, ep),
  • sex (m or f),
  • a min age in years,
  • an optional max age.

Say we want to reshape these “wide” data into a “longer” table with columns diagnosis, sex, min_years, max_years. The most basic method is by using a capturing regular expression,

pattern <- "new_?(.*)_(.)(0|[1-9]{2})([0-9]{0,2})"
proto <- data.frame(
  diagnosis=character(),
  sex=character(),
  min_years=numeric(),
  max_years=numeric())
transform(
  strcapture(pattern, names(who), proto)
, max_years=ifelse(is.na(max_years) & is.finite(min_years), Inf, max_years))
##   diagnosis  sex min_years max_years
## 1      <NA> <NA>        NA        NA
## 2        sp    m        55        64
## 3       rel    f        65       Inf
## 4        ep    f         0        14

Looking at the above code, we see a relatively complex regular expression pattern with four capture groups. To use it with strcapture we need to define proto, a data frame which specifies the type of each output column. There are two drawbacks to notice with this approach:

  • types are defined in proto but converting NA to Inf must be done after strcapture, here using transform.
  • there is separation between the definitions of the related concepts for each capture group (sub-pattern is defined in pattern whereas names and types defined in proto).

An attempt to remove this separation:

p <- function(regex, name=NA, type=NA, fun=NA)data.frame(regex, name, type, fun)
NA_to_Inf <- function(s){
  x <- as.numeric(s)
  ifelse(is.na(x), Inf, x)
}
(pattern.df <- rbind(
  p("new_?"),
  p("(.*)", "diagnosis", "character"),
  p("_"),
  p("(.)", "sex", "character"),
  p("(0|[1-9]{2})", "min_years", "numeric", "as.numeric"),
  p("([0-9]{0,2})", "max_years", "numeric", "NA_to_Inf")))
##          regex      name      type        fun
## 1        new_?      <NA>      <NA>       <NA>
## 2         (.*) diagnosis character       <NA>
## 3            _      <NA>      <NA>       <NA>
## 4          (.)       sex character       <NA>
## 5 (0|[1-9]{2}) min_years   numeric as.numeric
## 6 ([0-9]{0,2}) max_years   numeric  NA_to_Inf

In the definition above, all of the information related to each capture group is defined on the same line, so it is much easier to read/edit! We can use it for doing the same computation via

(my.pattern <- paste(pattern.df[["regex"]], collapse=""))
## [1] "new_?(.*)_(.)(0|[1-9]{2})([0-9]{0,2})"
(my.proto <- with(pattern.df, {
  is.group <- !is.na(name)
  structure(
    lapply(type[is.group], function(tname)get(tname)()),
    names = name[is.group])
}))
## $diagnosis
## character(0)
## 
## $sex
## character(0)
## 
## $min_years
## numeric(0)
## 
## $max_years
## numeric(0)
(strcapture.out <- strcapture(my.pattern, names(who), my.proto))
##   diagnosis  sex min_years max_years
## 1      <NA> <NA>        NA        NA
## 2        sp    m        55        64
## 3       rel    f        65        NA
## 4        ep    f         0        14

We can furthermore use the fun column to automatically apply the conversion to Inf:

(t.fun.df <- subset(pattern.df, !is.na(fun)))
##          regex      name    type        fun
## 5 (0|[1-9]{2}) min_years numeric as.numeric
## 6 ([0-9]{0,2}) max_years numeric  NA_to_Inf
for(t.fun.i in 1:nrow(t.fun.df)){
  t.fun.row <- t.fun.df[t.fun.i,]
  fun <- get(t.fun.row[["fun"]])
  name <- t.fun.row[["name"]]
  value <- strcapture.out[[name]]
  strcapture.out[[name]] <- fun(value)
}
strcapture.out
##   diagnosis  sex min_years max_years
## 1      <NA> <NA>        NA       Inf
## 2        sp    m        55        64
## 3       rel    f        65       Inf
## 4        ep    f         0        14

So the code above demonstrates that it is techinically POSSIBLE to use strcapture with a complex regex defined in a way that has all info for a particular capture group on a single line. Furthermore we can use the same info to do the reshape, using the new data.table::measurev() function which I recently implemented.

library(data.table)
fun.list <- with(subset(pattern.df, !is.na(name)), {
  structure(lapply(fun, function(f)if(is.na(f))NULL else get(f)), names=name)
})
(who.dt <- data.table(who))
##    country new_sp_m5564 newrel_f65 new_ep_f014
## 1:     usa            1          2           3
who.long <- data.table::melt(
  who.dt, measure.vars=measurev(fun.list, pattern=my.pattern))
print(who.long, class=TRUE)
##    country diagnosis    sex min_years max_years value
##     <char>    <char> <char>     <num>     <num> <num>
## 1:     usa        sp      m        55        64     1
## 2:     usa       rel      f        65       Inf     2
## 3:     usa        ep      f         0        14     3

Simpler version using nc package

The whole point of the nc package is that you can define your regex in a similar way, so that there is one line which has all the related info for each group (name, sub-pattern, conversion function). The nc code which is analogous to the pattern.df code above would be

nc.pattern <- list(
  "new_?",
  diagnosis=".*", 
  "_",
  sex=".", 
  min_years="0|[1-9]{2}", as.numeric,
  max_years="[0-9]{0,2}", function(x)ifelse(x=="", Inf, as.numeric(x)))

Having defined the pattern in this way, nc provides a function to do the matching and capturing,

capture.result <- nc::capture_first_vec(
  names(who), nc.pattern, nomatch.error = FALSE)
print(capture.result, class=TRUE)
##    diagnosis    sex min_years max_years
##       <char> <char>     <num>     <num>
## 1:      <NA>   <NA>        NA        NA
## 2:        sp      m        55        64
## 3:       rel      f        65       Inf
## 4:        ep      f         0        14

and another function to do the reshaping,

reshape.result <- nc::capture_melt_single(who, nc.pattern)
print(reshape.result, class=TRUE)
##    country diagnosis    sex min_years max_years value
##     <char>    <char> <char>     <num>     <num> <num>
## 1:     usa        sp      m        55        64     1
## 2:     usa       rel      f        65       Inf     2
## 3:     usa        ep      f         0        14     3

Doing the same thing in python

Can we do the same in python? YES.

import pandas as pd
import math
who = pd.DataFrame({
    "country":["usa"], "new_sp_m5564":[1], "newrel_f65":[2], "new_ep_f014":[3]})
pattern_list = [
    "new_?",
    ("(.*)", "diagnosis"),
    "_",
    ("(.)", "sex"),
    ("(0|[1-9]{2})", "min_years", float),
    ("([0-9]{0,2})", "max_years", lambda y: y.where(y!="", math.inf).astype(float)),
    ]
pattern = ""
name_list = []
fun_dict = {}
for pat_or_tup in pattern_list:
    if type(pat_or_tup) == type(""):
        pat = pat_or_tup
        name = None
    elif len(pat_or_tup) == 2:
        pat, name = pat_or_tup
    else:
        pat, name, fun = pat_or_tup
        fun_dict[name] = fun
    if name:
        name_list.append(name)
    pattern += pat
pattern
## 'new_?(.*)_(.)(0|[1-9]{2})([0-9]{0,2})'
name_list
## ['diagnosis', 'sex', 'min_years', 'max_years']
fun_dict
## {'min_years': <class 'float'>, 'max_years': <function <lambda> at 0x7f81737cdd90>}
match_df = who.columns.to_series().str.extract(pattern)
match_df.columns = name_list
match_df
##              diagnosis  sex min_years max_years
## country            NaN  NaN       NaN       NaN
## new_sp_m5564        sp    m        55        64
## newrel_f65         rel    f        65          
## new_ep_f014         ep    f         0        14
for col_name, fun in fun_dict.items():
    match_df[col_name] = match_df[col_name].transform(fun)
match_df
##              diagnosis  sex  min_years  max_years
## country            NaN  NaN        NaN        NaN
## new_sp_m5564        sp    m       55.0       64.0
## newrel_f65         rel    f       65.0        inf
## new_ep_f014         ep    f        0.0       14.0

Named capture groups in regex string literals

One other approach worth mentioning here is using regular expression string literals with named capture groups. That would result in something like

n <- function(regex, type=NA, fun=NA)data.frame(regex, type, fun)
(named.pattern.df <- rbind(
  n("new_?"),
  n("(?P<diagnosis>.*)", "character"),
  n("_"),
  n("(?P<sex>.)", "character"),
  n("(?P<min_years>0|[1-9]{2})", "numeric", "as.numeric"),
  n("(?P<max_years>[0-9]{0,2})", "numeric", "NA_to_Inf")))
##                       regex      type        fun
## 1                     new_?      <NA>       <NA>
## 2         (?P<diagnosis>.*) character       <NA>
## 3                         _      <NA>       <NA>
## 4                (?P<sex>.) character       <NA>
## 5 (?P<min_years>0|[1-9]{2})   numeric as.numeric
## 6 (?P<max_years>[0-9]{0,2})   numeric  NA_to_Inf
(named.pattern <- paste(named.pattern.df[["regex"]], collapse=""))
## [1] "new_?(?P<diagnosis>.*)_(?P<sex>.)(?P<min_years>0|[1-9]{2})(?P<max_years>[0-9]{0,2})"

The names defined in those string literals can be parsed and output to R with several functions,

(exec.result <- regexec(named.pattern, names(who), perl=TRUE))
## [[1]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## attr(,"index.type")
## [1] "chars"
## 
## [[2]]
##           diagnosis       sex min_years max_years 
##         1         5         8         9        11 
## attr(,"match.length")
## [1] 12  2  1  2  2
## attr(,"useBytes")
## [1] TRUE
## attr(,"index.type")
## [1] "chars"
## 
## [[3]]
##           diagnosis       sex min_years max_years 
##         1         4         8         9        11 
## attr(,"match.length")
## [1] 10  3  1  2  0
## attr(,"useBytes")
## [1] TRUE
## attr(,"index.type")
## [1] "chars"
## 
## [[4]]
##           diagnosis       sex min_years max_years 
##         1         5         8         9        10 
## attr(,"match.length")
## [1] 11  2  1  1  2
## attr(,"useBytes")
## [1] TRUE
## attr(,"index.type")
## [1] "chars"
(match.result <- regmatches(names(who), exec.result))
## [[1]]
## character(0)
## 
## [[2]]
##                     diagnosis            sex      min_years      max_years 
## "new_sp_m5564"           "sp"            "m"           "55"           "64" 
## 
## [[3]]
##                 diagnosis          sex    min_years    max_years 
## "newrel_f65"        "rel"          "f"         "65"           "" 
## 
## [[4]]
##                   diagnosis           sex     min_years     max_years 
## "new_ep_f014"          "ep"           "f"           "0"          "14"
match.result[sapply(match.result, length)==0] <- NA
do.call(rbind, match.result)
##                     diagnosis sex min_years max_years
## [1,] NA             NA        NA  NA        NA       
## [2,] "new_sp_m5564" "sp"      "m" "55"      "64"     
## [3,] "newrel_f65"   "rel"     "f" "65"      ""       
## [4,] "new_ep_f014"  "ep"      "f" "0"       "14"
rex::matches(names(who), named.pattern)
##   diagnosis  sex min_years max_years
## 1      <NA> <NA>      <NA>      <NA>
## 2        sp    m        55        64
## 3       rel    f        65          
## 4        ep    f         0        14
rematch2::re_match(names(who), named.pattern)
## # A tibble: 4 x 6
##   diagnosis sex   min_years max_years .text        .match      
##   <chr>     <chr> <chr>     <chr>     <chr>        <chr>       
## 1 <NA>      <NA>  <NA>       <NA>     country      <NA>        
## 2 sp        m     55        "64"      new_sp_m5564 new_sp_m5564
## 3 rel       f     65        ""        newrel_f65   newrel_f65  
## 4 ep        f     0         "14"      new_ep_f014  new_ep_f014
re2r::re2_match(names(who), named.pattern)
##      .match         diagnosis sex min_years max_years
## [1,] NA             NA        NA  NA        NA       
## [2,] "new_sp_m5564" "sp"      "m" "55"      "64"     
## [3,] "newrel_f65"   "rel"     "f" "65"      ""       
## [4,] "new_ep_f014"  "ep"      "f" "0"       "14"

The results above are all of type character, and can be converted to numeric using the other information defined in named.pattern.df as above. This approach using named capture groups in the regex string literals does not have any substantial advantages to the previous approach with un-named capture groups (storing the name in a separate column of pattern.df).

Conclusion

We have seen that defining a data table by row is much easier to read/edit than definition by column. Likewise, it is possible to define a complex regex in terms of sub-patterns together with related info. We created a data table with one row per sub-pattern, and columns for optional name, type, and conversion function. After defining the regex in this manner, it is possible to derive all of the information you need as input to various R functions (complete pattern, group names, list of type conversion functions, etc). In contrast to direct definition of these various inputs, the proposed approach is much easier to read/edit, and is the main idea/novelty/useful feature of the nc package.