Defining data by row and regex by sub-pattern
My paper about regular expressions for data reshaping was recently accepted into R journal. It describes the novelty of my nc R package, which solves the problem of “separation of related concepts” which happens with other data reshaping packages. I wrote “the nc syntax should be preferred for more complex patterns (with more groups) in order to keep the group names and sub-patterns closer and easier to maintain/read in the code.” This is related to the idea that a data table is often easier to read/edit when defined as a collection of rows (versus columns).
medRxiv paper code example
In our recent paper about spatially explicit stochastic disease models (currently in peer review), there is some R code that defines time windows. It was my suggestion to present the data/code as follows, with each time window on a line:
one.window <- function(start, end, r0)data.frame(start, end, r0)
library(lubridate)
(time.window.args <- rbind(# Specify the components of 5 time windows
one.window(mdy("1-1-20"),mdy("1-31-20"),3.0),
one.window(mdy("2-1-20"),mdy("2-15-20"),0.8),
one.window(mdy("2-16-20"),mdy("3-10-20"),0.8),
one.window(mdy("3-11-20"),mdy("3-21-20"),1.4),
one.window(mdy("3-22-20"),mdy("5-1-20"),1.4)))
## start end r0
## 1 2020-01-01 2020-01-31 3.0
## 2 2020-02-01 2020-02-15 0.8
## 3 2020-02-16 2020-03-10 0.8
## 4 2020-03-11 2020-03-21 1.4
## 5 2020-03-22 2020-05-01 1.4
The code above is easy to read because all of the information related to a given time window is defined on the same line. It is easy to add/remove/change windows for the same reason. In contrast, another way to define these data would be with a column on each line,
data.frame(
start=c("1-1-20","2-1-20","2-16-20","3-11-20","3-22-20"),
end=c("1-31-20","2-15-20","3-10-20","3-21-20","5-1-20"),
r0=c(3,0.8,0.8,1.4,1.4))
## start end r0
## 1 1-1-20 1-31-20 3.0
## 2 2-1-20 2-15-20 0.8
## 3 2-16-20 3-10-20 0.8
## 4 3-11-20 3-21-20 1.4
## 5 3-22-20 5-1-20 1.4
This format above is much more difficult to read and understand – what is the r0 value for Mar 17? You have to first figure out that day falls in the fourth start/end, then you have to find the fourth r0 value. Reading is a bit easier if you add a bunch of spaces, as below:
data.frame(
start =c( "1-1-20", "2-1-20", "2-16-20", "3-11-20", "3-22-20"),
end =c("1-31-20","2-15-20", "3-10-20", "3-21-20", "5-1-20"),
r0 =c( 3, 0.8, 0.8, 1.4, 1.4))
## start end r0
## 1 1-1-20 1-31-20 3.0
## 2 2-1-20 2-15-20 0.8
## 3 2-16-20 3-10-20 0.8
## 4 3-11-20 3-21-20 1.4
## 5 3-22-20 5-1-20 1.4
Column-wise storage also makes it more difficult to add/remove/edit entries. This observation helps to explain why CSV files are such a popular data storage format.
Analog with regular expressions for data reshaping
In wide-to-long data reshaping with regular column names we can use a regex pattern to (1) identify the input columns to reshape, and (2) extract data from those columns for storage in the output. For example consider the following data,
# similar to data(who, package="tidyr")
who <- data.frame(country="usa", new_sp_m5564=1, newrel_f65=2, new_ep_f014=3)
Each column after the first has a regular name:
- new,
- an optional underscore,
- a diagnosis code (sp, rel, ep),
- sex (m or f),
- a min age in years,
- an optional max age.
Say we want to reshape these “wide” data into a “longer” table with
columns diagnosis
, sex
, min_years
, max_years
. The most basic
method is by using a capturing regular expression,
pattern <- "new_?(.*)_(.)(0|[1-9]{2})([0-9]{0,2})"
proto <- data.frame(
diagnosis=character(),
sex=character(),
min_years=numeric(),
max_years=numeric())
transform(
strcapture(pattern, names(who), proto)
, max_years=ifelse(is.na(max_years) & is.finite(min_years), Inf, max_years))
## diagnosis sex min_years max_years
## 1 <NA> <NA> NA NA
## 2 sp m 55 64
## 3 rel f 65 Inf
## 4 ep f 0 14
Looking at the above code, we see a relatively complex regular
expression pattern
with four capture groups. To use it with
strcapture
we need to define proto
, a data frame which specifies
the type of each output column. There are two drawbacks to notice with
this approach:
- types are defined in
proto
but converting NA to Inf must be done afterstrcapture
, here usingtransform
. - there is separation between the definitions of the related concepts
for each capture group (sub-pattern is defined in
pattern
whereas names and types defined inproto
).
An attempt to remove this separation:
p <- function(regex, name=NA, type=NA, fun=NA)data.frame(regex, name, type, fun)
NA_to_Inf <- function(s){
x <- as.numeric(s)
ifelse(is.na(x), Inf, x)
}
(pattern.df <- rbind(
p("new_?"),
p("(.*)", "diagnosis", "character"),
p("_"),
p("(.)", "sex", "character"),
p("(0|[1-9]{2})", "min_years", "numeric", "as.numeric"),
p("([0-9]{0,2})", "max_years", "numeric", "NA_to_Inf")))
## regex name type fun
## 1 new_? <NA> <NA> <NA>
## 2 (.*) diagnosis character <NA>
## 3 _ <NA> <NA> <NA>
## 4 (.) sex character <NA>
## 5 (0|[1-9]{2}) min_years numeric as.numeric
## 6 ([0-9]{0,2}) max_years numeric NA_to_Inf
In the definition above, all of the information related to each capture group is defined on the same line, so it is much easier to read/edit! We can use it for doing the same computation via
(my.pattern <- paste(pattern.df[["regex"]], collapse=""))
## [1] "new_?(.*)_(.)(0|[1-9]{2})([0-9]{0,2})"
(my.proto <- with(pattern.df, {
is.group <- !is.na(name)
structure(
lapply(type[is.group], function(tname)get(tname)()),
names = name[is.group])
}))
## $diagnosis
## character(0)
##
## $sex
## character(0)
##
## $min_years
## numeric(0)
##
## $max_years
## numeric(0)
(strcapture.out <- strcapture(my.pattern, names(who), my.proto))
## diagnosis sex min_years max_years
## 1 <NA> <NA> NA NA
## 2 sp m 55 64
## 3 rel f 65 NA
## 4 ep f 0 14
We can furthermore use the fun
column to automatically apply the
conversion to Inf:
(t.fun.df <- subset(pattern.df, !is.na(fun)))
## regex name type fun
## 5 (0|[1-9]{2}) min_years numeric as.numeric
## 6 ([0-9]{0,2}) max_years numeric NA_to_Inf
for(t.fun.i in 1:nrow(t.fun.df)){
t.fun.row <- t.fun.df[t.fun.i,]
fun <- get(t.fun.row[["fun"]])
name <- t.fun.row[["name"]]
value <- strcapture.out[[name]]
strcapture.out[[name]] <- fun(value)
}
strcapture.out
## diagnosis sex min_years max_years
## 1 <NA> <NA> NA Inf
## 2 sp m 55 64
## 3 rel f 65 Inf
## 4 ep f 0 14
So the code above demonstrates that it is techinically POSSIBLE to use
strcapture
with a complex regex defined in a way that has all info
for a particular capture group on a single line. Furthermore we can
use the same info to do the reshape, using the new
data.table::measurev()
function which I recently
implemented.
library(data.table)
fun.list <- with(subset(pattern.df, !is.na(name)), {
structure(lapply(fun, function(f)if(is.na(f))NULL else get(f)), names=name)
})
(who.dt <- data.table(who))
## country new_sp_m5564 newrel_f65 new_ep_f014
## 1: usa 1 2 3
who.long <- data.table::melt(
who.dt, measure.vars=measurev(fun.list, pattern=my.pattern))
print(who.long, class=TRUE)
## country diagnosis sex min_years max_years value
## <char> <char> <char> <num> <num> <num>
## 1: usa sp m 55 64 1
## 2: usa rel f 65 Inf 2
## 3: usa ep f 0 14 3
Simpler version using nc package
The whole point of the nc
package is that you can define your regex
in a similar way, so that there is one line which has all the related
info for each group (name, sub-pattern, conversion function). The nc
code which is analogous to the pattern.df
code above would be
nc.pattern <- list(
"new_?",
diagnosis=".*",
"_",
sex=".",
min_years="0|[1-9]{2}", as.numeric,
max_years="[0-9]{0,2}", function(x)ifelse(x=="", Inf, as.numeric(x)))
Having defined the pattern in this way, nc
provides a function to do
the matching and capturing,
capture.result <- nc::capture_first_vec(
names(who), nc.pattern, nomatch.error = FALSE)
print(capture.result, class=TRUE)
## diagnosis sex min_years max_years
## <char> <char> <num> <num>
## 1: <NA> <NA> NA NA
## 2: sp m 55 64
## 3: rel f 65 Inf
## 4: ep f 0 14
and another function to do the reshaping,
reshape.result <- nc::capture_melt_single(who, nc.pattern)
print(reshape.result, class=TRUE)
## country diagnosis sex min_years max_years value
## <char> <char> <char> <num> <num> <num>
## 1: usa sp m 55 64 1
## 2: usa rel f 65 Inf 2
## 3: usa ep f 0 14 3
Doing the same thing in python
Can we do the same in python? YES.
import pandas as pd
import math
who = pd.DataFrame({
"country":["usa"], "new_sp_m5564":[1], "newrel_f65":[2], "new_ep_f014":[3]})
pattern_list = [
"new_?",
("(.*)", "diagnosis"),
"_",
("(.)", "sex"),
("(0|[1-9]{2})", "min_years", float),
("([0-9]{0,2})", "max_years", lambda y: y.where(y!="", math.inf).astype(float)),
]
pattern = ""
name_list = []
fun_dict = {}
for pat_or_tup in pattern_list:
if type(pat_or_tup) == type(""):
pat = pat_or_tup
name = None
elif len(pat_or_tup) == 2:
pat, name = pat_or_tup
else:
pat, name, fun = pat_or_tup
fun_dict[name] = fun
if name:
name_list.append(name)
pattern += pat
pattern
## 'new_?(.*)_(.)(0|[1-9]{2})([0-9]{0,2})'
name_list
## ['diagnosis', 'sex', 'min_years', 'max_years']
fun_dict
## {'min_years': <class 'float'>, 'max_years': <function <lambda> at 0x7f81737cdd90>}
match_df = who.columns.to_series().str.extract(pattern)
match_df.columns = name_list
match_df
## diagnosis sex min_years max_years
## country NaN NaN NaN NaN
## new_sp_m5564 sp m 55 64
## newrel_f65 rel f 65
## new_ep_f014 ep f 0 14
for col_name, fun in fun_dict.items():
match_df[col_name] = match_df[col_name].transform(fun)
match_df
## diagnosis sex min_years max_years
## country NaN NaN NaN NaN
## new_sp_m5564 sp m 55.0 64.0
## newrel_f65 rel f 65.0 inf
## new_ep_f014 ep f 0.0 14.0
Named capture groups in regex string literals
One other approach worth mentioning here is using regular expression string literals with named capture groups. That would result in something like
n <- function(regex, type=NA, fun=NA)data.frame(regex, type, fun)
(named.pattern.df <- rbind(
n("new_?"),
n("(?P<diagnosis>.*)", "character"),
n("_"),
n("(?P<sex>.)", "character"),
n("(?P<min_years>0|[1-9]{2})", "numeric", "as.numeric"),
n("(?P<max_years>[0-9]{0,2})", "numeric", "NA_to_Inf")))
## regex type fun
## 1 new_? <NA> <NA>
## 2 (?P<diagnosis>.*) character <NA>
## 3 _ <NA> <NA>
## 4 (?P<sex>.) character <NA>
## 5 (?P<min_years>0|[1-9]{2}) numeric as.numeric
## 6 (?P<max_years>[0-9]{0,2}) numeric NA_to_Inf
(named.pattern <- paste(named.pattern.df[["regex"]], collapse=""))
## [1] "new_?(?P<diagnosis>.*)_(?P<sex>.)(?P<min_years>0|[1-9]{2})(?P<max_years>[0-9]{0,2})"
The names defined in those string literals can be parsed and output to R with several functions,
(exec.result <- regexec(named.pattern, names(who), perl=TRUE))
## [[1]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## attr(,"index.type")
## [1] "chars"
##
## [[2]]
## diagnosis sex min_years max_years
## 1 5 8 9 11
## attr(,"match.length")
## [1] 12 2 1 2 2
## attr(,"useBytes")
## [1] TRUE
## attr(,"index.type")
## [1] "chars"
##
## [[3]]
## diagnosis sex min_years max_years
## 1 4 8 9 11
## attr(,"match.length")
## [1] 10 3 1 2 0
## attr(,"useBytes")
## [1] TRUE
## attr(,"index.type")
## [1] "chars"
##
## [[4]]
## diagnosis sex min_years max_years
## 1 5 8 9 10
## attr(,"match.length")
## [1] 11 2 1 1 2
## attr(,"useBytes")
## [1] TRUE
## attr(,"index.type")
## [1] "chars"
(match.result <- regmatches(names(who), exec.result))
## [[1]]
## character(0)
##
## [[2]]
## diagnosis sex min_years max_years
## "new_sp_m5564" "sp" "m" "55" "64"
##
## [[3]]
## diagnosis sex min_years max_years
## "newrel_f65" "rel" "f" "65" ""
##
## [[4]]
## diagnosis sex min_years max_years
## "new_ep_f014" "ep" "f" "0" "14"
match.result[sapply(match.result, length)==0] <- NA
do.call(rbind, match.result)
## diagnosis sex min_years max_years
## [1,] NA NA NA NA NA
## [2,] "new_sp_m5564" "sp" "m" "55" "64"
## [3,] "newrel_f65" "rel" "f" "65" ""
## [4,] "new_ep_f014" "ep" "f" "0" "14"
rex::matches(names(who), named.pattern)
## diagnosis sex min_years max_years
## 1 <NA> <NA> <NA> <NA>
## 2 sp m 55 64
## 3 rel f 65
## 4 ep f 0 14
rematch2::re_match(names(who), named.pattern)
## # A tibble: 4 x 6
## diagnosis sex min_years max_years .text .match
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 <NA> <NA> <NA> <NA> country <NA>
## 2 sp m 55 "64" new_sp_m5564 new_sp_m5564
## 3 rel f 65 "" newrel_f65 newrel_f65
## 4 ep f 0 "14" new_ep_f014 new_ep_f014
re2r::re2_match(names(who), named.pattern)
## .match diagnosis sex min_years max_years
## [1,] NA NA NA NA NA
## [2,] "new_sp_m5564" "sp" "m" "55" "64"
## [3,] "newrel_f65" "rel" "f" "65" ""
## [4,] "new_ep_f014" "ep" "f" "0" "14"
The results above are all of type character, and can be converted to
numeric using the other information defined in named.pattern.df
as
above. This approach using named capture groups in the regex string
literals does not have any substantial advantages to the previous
approach with un-named capture groups (storing the name
in a
separate column of pattern.df
).
Conclusion
We have seen that defining a data table by row is much easier to
read/edit than definition by column. Likewise, it is possible to
define a complex regex in terms of sub-patterns together with related
info. We created a data table with one row per sub-pattern, and
columns for optional name, type, and conversion function. After
defining the regex in this manner, it is possible to derive all of the
information you need as input to various R functions (complete
pattern, group names, list of type conversion functions, etc). In
contrast to direct definition of these various inputs, the proposed
approach is much easier to read/edit, and is the main
idea/novelty/useful feature of the nc
package.