Defining data by row and regex by sub-pattern
My paper about regular expressions for data reshaping was recently accepted into R journal. It describes the novelty of my nc R package, which solves the problem of “separation of related concepts” which happens with other data reshaping packages. I wrote “the nc syntax should be preferred for more complex patterns (with more groups) in order to keep the group names and sub-patterns closer and easier to maintain/read in the code.” This is related to the idea that a data table is often easier to read/edit when defined as a collection of rows (versus columns).
medRxiv paper code example
In our recent paper about spatially explicit stochastic disease models (currently in peer review), there is some R code that defines time windows. It was my suggestion to present the data/code as follows, with each time window on a line:
one.window <- function(start, end, r0)data.frame(start, end, r0)
library(lubridate)
(time.window.args <- rbind(# Specify the components of 5 time windows
  one.window(mdy("1-1-20"),mdy("1-31-20"),3.0),
  one.window(mdy("2-1-20"),mdy("2-15-20"),0.8),
  one.window(mdy("2-16-20"),mdy("3-10-20"),0.8),
  one.window(mdy("3-11-20"),mdy("3-21-20"),1.4),
  one.window(mdy("3-22-20"),mdy("5-1-20"),1.4)))
##        start        end  r0
## 1 2020-01-01 2020-01-31 3.0
## 2 2020-02-01 2020-02-15 0.8
## 3 2020-02-16 2020-03-10 0.8
## 4 2020-03-11 2020-03-21 1.4
## 5 2020-03-22 2020-05-01 1.4
The code above is easy to read because all of the information related to a given time window is defined on the same line. It is easy to add/remove/change windows for the same reason. In contrast, another way to define these data would be with a column on each line,
data.frame(
  start=c("1-1-20","2-1-20","2-16-20","3-11-20","3-22-20"),
  end=c("1-31-20","2-15-20","3-10-20","3-21-20","5-1-20"),
  r0=c(3,0.8,0.8,1.4,1.4))
##     start     end  r0
## 1  1-1-20 1-31-20 3.0
## 2  2-1-20 2-15-20 0.8
## 3 2-16-20 3-10-20 0.8
## 4 3-11-20 3-21-20 1.4
## 5 3-22-20  5-1-20 1.4
This format above is much more difficult to read and understand – what is the r0 value for Mar 17? You have to first figure out that day falls in the fourth start/end, then you have to find the fourth r0 value. Reading is a bit easier if you add a bunch of spaces, as below:
data.frame(
  start =c( "1-1-20", "2-1-20", "2-16-20", "3-11-20", "3-22-20"),
  end   =c("1-31-20","2-15-20", "3-10-20", "3-21-20",  "5-1-20"),
  r0    =c(        3,      0.8,       0.8,       1.4,       1.4))
##     start     end  r0
## 1  1-1-20 1-31-20 3.0
## 2  2-1-20 2-15-20 0.8
## 3 2-16-20 3-10-20 0.8
## 4 3-11-20 3-21-20 1.4
## 5 3-22-20  5-1-20 1.4
Column-wise storage also makes it more difficult to add/remove/edit entries. This observation helps to explain why CSV files are such a popular data storage format.
Analog with regular expressions for data reshaping
In wide-to-long data reshaping with regular column names we can use a regex pattern to (1) identify the input columns to reshape, and (2) extract data from those columns for storage in the output. For example consider the following data,
# similar to data(who, package="tidyr")
who <- data.frame(country="usa", new_sp_m5564=1, newrel_f65=2, new_ep_f014=3)
Each column after the first has a regular name:
- new,
 - an optional underscore,
 - a diagnosis code (sp, rel, ep),
 - sex (m or f),
 - a min age in years,
 - an optional max age.
 
Say we want to reshape these “wide” data into a “longer” table with
columns diagnosis, sex, min_years, max_years. The most basic
method is by using a capturing regular expression,
pattern <- "new_?(.*)_(.)(0|[1-9]{2})([0-9]{0,2})"
proto <- data.frame(
  diagnosis=character(),
  sex=character(),
  min_years=numeric(),
  max_years=numeric())
transform(
  strcapture(pattern, names(who), proto)
, max_years=ifelse(is.na(max_years) & is.finite(min_years), Inf, max_years))
##   diagnosis  sex min_years max_years
## 1      <NA> <NA>        NA        NA
## 2        sp    m        55        64
## 3       rel    f        65       Inf
## 4        ep    f         0        14
Looking at the above code, we see a relatively complex regular
expression pattern with four capture groups. To use it with
strcapture we need to define proto, a data frame which specifies
the type of each output column. There are two drawbacks to notice with
this approach:
- types are defined in 
protobut converting NA to Inf must be done afterstrcapture, here usingtransform. - there is separation between the definitions of the related concepts
for each capture group (sub-pattern is defined in 
patternwhereas names and types defined inproto). 
An attempt to remove this separation:
p <- function(regex, name=NA, type=NA, fun=NA)data.frame(regex, name, type, fun)
NA_to_Inf <- function(s){
  x <- as.numeric(s)
  ifelse(is.na(x), Inf, x)
}
(pattern.df <- rbind(
  p("new_?"),
  p("(.*)", "diagnosis", "character"),
  p("_"),
  p("(.)", "sex", "character"),
  p("(0|[1-9]{2})", "min_years", "numeric", "as.numeric"),
  p("([0-9]{0,2})", "max_years", "numeric", "NA_to_Inf")))
##          regex      name      type        fun
## 1        new_?      <NA>      <NA>       <NA>
## 2         (.*) diagnosis character       <NA>
## 3            _      <NA>      <NA>       <NA>
## 4          (.)       sex character       <NA>
## 5 (0|[1-9]{2}) min_years   numeric as.numeric
## 6 ([0-9]{0,2}) max_years   numeric  NA_to_Inf
In the definition above, all of the information related to each capture group is defined on the same line, so it is much easier to read/edit! We can use it for doing the same computation via
(my.pattern <- paste(pattern.df[["regex"]], collapse=""))
## [1] "new_?(.*)_(.)(0|[1-9]{2})([0-9]{0,2})"
(my.proto <- with(pattern.df, {
  is.group <- !is.na(name)
  structure(
    lapply(type[is.group], function(tname)get(tname)()),
    names = name[is.group])
}))
## $diagnosis
## character(0)
## 
## $sex
## character(0)
## 
## $min_years
## numeric(0)
## 
## $max_years
## numeric(0)
(strcapture.out <- strcapture(my.pattern, names(who), my.proto))
##   diagnosis  sex min_years max_years
## 1      <NA> <NA>        NA        NA
## 2        sp    m        55        64
## 3       rel    f        65        NA
## 4        ep    f         0        14
We can furthermore use the fun column to automatically apply the
conversion to Inf:
(t.fun.df <- subset(pattern.df, !is.na(fun)))
##          regex      name    type        fun
## 5 (0|[1-9]{2}) min_years numeric as.numeric
## 6 ([0-9]{0,2}) max_years numeric  NA_to_Inf
for(t.fun.i in 1:nrow(t.fun.df)){
  t.fun.row <- t.fun.df[t.fun.i,]
  fun <- get(t.fun.row[["fun"]])
  name <- t.fun.row[["name"]]
  value <- strcapture.out[[name]]
  strcapture.out[[name]] <- fun(value)
}
strcapture.out
##   diagnosis  sex min_years max_years
## 1      <NA> <NA>        NA       Inf
## 2        sp    m        55        64
## 3       rel    f        65       Inf
## 4        ep    f         0        14
So the code above demonstrates that it is techinically POSSIBLE to use
strcapture with a complex regex defined in a way that has all info
for a particular capture group on a single line. Furthermore we can
use the same info to do the reshape, using the new
data.table::measurev() function which I recently
implemented.
library(data.table)
fun.list <- with(subset(pattern.df, !is.na(name)), {
  structure(lapply(fun, function(f)if(is.na(f))NULL else get(f)), names=name)
})
(who.dt <- data.table(who))
##    country new_sp_m5564 newrel_f65 new_ep_f014
## 1:     usa            1          2           3
who.long <- data.table::melt(
  who.dt, measure.vars=measurev(fun.list, pattern=my.pattern))
print(who.long, class=TRUE)
##    country diagnosis    sex min_years max_years value
##     <char>    <char> <char>     <num>     <num> <num>
## 1:     usa        sp      m        55        64     1
## 2:     usa       rel      f        65       Inf     2
## 3:     usa        ep      f         0        14     3
Simpler version using nc package
The whole point of the nc package is that you can define your regex
in a similar way, so that there is one line which has all the related
info for each group (name, sub-pattern, conversion function). The nc
code which is analogous to the pattern.df code above would be
nc.pattern <- list(
  "new_?",
  diagnosis=".*", 
  "_",
  sex=".", 
  min_years="0|[1-9]{2}", as.numeric,
  max_years="[0-9]{0,2}", function(x)ifelse(x=="", Inf, as.numeric(x)))
Having defined the pattern in this way, nc provides a function to do
the matching and capturing,
capture.result <- nc::capture_first_vec(
  names(who), nc.pattern, nomatch.error = FALSE)
print(capture.result, class=TRUE)
##    diagnosis    sex min_years max_years
##       <char> <char>     <num>     <num>
## 1:      <NA>   <NA>        NA        NA
## 2:        sp      m        55        64
## 3:       rel      f        65       Inf
## 4:        ep      f         0        14
and another function to do the reshaping,
reshape.result <- nc::capture_melt_single(who, nc.pattern)
print(reshape.result, class=TRUE)
##    country diagnosis    sex min_years max_years value
##     <char>    <char> <char>     <num>     <num> <num>
## 1:     usa        sp      m        55        64     1
## 2:     usa       rel      f        65       Inf     2
## 3:     usa        ep      f         0        14     3
Doing the same thing in python
Can we do the same in python? YES.
import pandas as pd
import math
who = pd.DataFrame({
    "country":["usa"], "new_sp_m5564":[1], "newrel_f65":[2], "new_ep_f014":[3]})
pattern_list = [
    "new_?",
    ("(.*)", "diagnosis"),
    "_",
    ("(.)", "sex"),
    ("(0|[1-9]{2})", "min_years", float),
    ("([0-9]{0,2})", "max_years", lambda y: y.where(y!="", math.inf).astype(float)),
    ]
pattern = ""
name_list = []
fun_dict = {}
for pat_or_tup in pattern_list:
    if type(pat_or_tup) == type(""):
        pat = pat_or_tup
        name = None
    elif len(pat_or_tup) == 2:
        pat, name = pat_or_tup
    else:
        pat, name, fun = pat_or_tup
        fun_dict[name] = fun
    if name:
        name_list.append(name)
    pattern += pat
pattern
## 'new_?(.*)_(.)(0|[1-9]{2})([0-9]{0,2})'
name_list
## ['diagnosis', 'sex', 'min_years', 'max_years']
fun_dict
## {'min_years': <class 'float'>, 'max_years': <function <lambda> at 0x7f81737cdd90>}
match_df = who.columns.to_series().str.extract(pattern)
match_df.columns = name_list
match_df
##              diagnosis  sex min_years max_years
## country            NaN  NaN       NaN       NaN
## new_sp_m5564        sp    m        55        64
## newrel_f65         rel    f        65          
## new_ep_f014         ep    f         0        14
for col_name, fun in fun_dict.items():
    match_df[col_name] = match_df[col_name].transform(fun)
match_df
##              diagnosis  sex  min_years  max_years
## country            NaN  NaN        NaN        NaN
## new_sp_m5564        sp    m       55.0       64.0
## newrel_f65         rel    f       65.0        inf
## new_ep_f014         ep    f        0.0       14.0
Named capture groups in regex string literals
One other approach worth mentioning here is using regular expression string literals with named capture groups. That would result in something like
n <- function(regex, type=NA, fun=NA)data.frame(regex, type, fun)
(named.pattern.df <- rbind(
  n("new_?"),
  n("(?P<diagnosis>.*)", "character"),
  n("_"),
  n("(?P<sex>.)", "character"),
  n("(?P<min_years>0|[1-9]{2})", "numeric", "as.numeric"),
  n("(?P<max_years>[0-9]{0,2})", "numeric", "NA_to_Inf")))
##                       regex      type        fun
## 1                     new_?      <NA>       <NA>
## 2         (?P<diagnosis>.*) character       <NA>
## 3                         _      <NA>       <NA>
## 4                (?P<sex>.) character       <NA>
## 5 (?P<min_years>0|[1-9]{2})   numeric as.numeric
## 6 (?P<max_years>[0-9]{0,2})   numeric  NA_to_Inf
(named.pattern <- paste(named.pattern.df[["regex"]], collapse=""))
## [1] "new_?(?P<diagnosis>.*)_(?P<sex>.)(?P<min_years>0|[1-9]{2})(?P<max_years>[0-9]{0,2})"
The names defined in those string literals can be parsed and output to R with several functions,
(exec.result <- regexec(named.pattern, names(who), perl=TRUE))
## [[1]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## attr(,"index.type")
## [1] "chars"
## 
## [[2]]
##           diagnosis       sex min_years max_years 
##         1         5         8         9        11 
## attr(,"match.length")
## [1] 12  2  1  2  2
## attr(,"useBytes")
## [1] TRUE
## attr(,"index.type")
## [1] "chars"
## 
## [[3]]
##           diagnosis       sex min_years max_years 
##         1         4         8         9        11 
## attr(,"match.length")
## [1] 10  3  1  2  0
## attr(,"useBytes")
## [1] TRUE
## attr(,"index.type")
## [1] "chars"
## 
## [[4]]
##           diagnosis       sex min_years max_years 
##         1         5         8         9        10 
## attr(,"match.length")
## [1] 11  2  1  1  2
## attr(,"useBytes")
## [1] TRUE
## attr(,"index.type")
## [1] "chars"
(match.result <- regmatches(names(who), exec.result))
## [[1]]
## character(0)
## 
## [[2]]
##                     diagnosis            sex      min_years      max_years 
## "new_sp_m5564"           "sp"            "m"           "55"           "64" 
## 
## [[3]]
##                 diagnosis          sex    min_years    max_years 
## "newrel_f65"        "rel"          "f"         "65"           "" 
## 
## [[4]]
##                   diagnosis           sex     min_years     max_years 
## "new_ep_f014"          "ep"           "f"           "0"          "14"
match.result[sapply(match.result, length)==0] <- NA
do.call(rbind, match.result)
##                     diagnosis sex min_years max_years
## [1,] NA             NA        NA  NA        NA       
## [2,] "new_sp_m5564" "sp"      "m" "55"      "64"     
## [3,] "newrel_f65"   "rel"     "f" "65"      ""       
## [4,] "new_ep_f014"  "ep"      "f" "0"       "14"
rex::matches(names(who), named.pattern)
##   diagnosis  sex min_years max_years
## 1      <NA> <NA>      <NA>      <NA>
## 2        sp    m        55        64
## 3       rel    f        65          
## 4        ep    f         0        14
rematch2::re_match(names(who), named.pattern)
## # A tibble: 4 x 6
##   diagnosis sex   min_years max_years .text        .match      
##   <chr>     <chr> <chr>     <chr>     <chr>        <chr>       
## 1 <NA>      <NA>  <NA>       <NA>     country      <NA>        
## 2 sp        m     55        "64"      new_sp_m5564 new_sp_m5564
## 3 rel       f     65        ""        newrel_f65   newrel_f65  
## 4 ep        f     0         "14"      new_ep_f014  new_ep_f014
re2r::re2_match(names(who), named.pattern)
##      .match         diagnosis sex min_years max_years
## [1,] NA             NA        NA  NA        NA       
## [2,] "new_sp_m5564" "sp"      "m" "55"      "64"     
## [3,] "newrel_f65"   "rel"     "f" "65"      ""       
## [4,] "new_ep_f014"  "ep"      "f" "0"       "14"
The results above are all of type character, and can be converted to
numeric using the other information defined in named.pattern.df as
above. This approach using named capture groups in the regex string
literals does not have any substantial advantages to the previous
approach with un-named capture groups (storing the name in a
separate column of pattern.df).
Conclusion
We have seen that defining a data table by row is much easier to
read/edit than definition by column. Likewise, it is possible to
define a complex regex in terms of sub-patterns together with related
info. We created a data table with one row per sub-pattern, and
columns for optional name, type, and conversion function. After
defining the regex in this manner, it is possible to derive all of the
information you need as input to various R functions (complete
pattern, group names, list of type conversion functions, etc). In
contrast to direct definition of these various inputs, the proposed
approach is much easier to read/edit, and is the main
idea/novelty/useful feature of the nc package.