Stress testing reshape operations on list columns
My paper about regular expressions for data
reshaping was recently accepted
into R journal. It proposes a new syntax for wide-to-tall data
reshaping, and uses data.table::melt internally. I recently
submitted a few PRs to data.table in order to improve its
functionality. Here are two examples related to consistency of reshape
operations on list columns. Code was run with R-4.1.0 and
data.table-1.14.1 (from GitHub).
What is a missing list element?
Let us begin with a simple exploration about the nature of missing values in R.
is.na(L <- list(NULL, NA, c(NA,NA)))
## [1] FALSE  TRUE FALSE
The above shows that base R considers a list element missing if it
contains a scalar NA.
str(na.omit(L))
## List of 3
##  $ : NULL
##  $ : logi NA
##  $ : logi [1:2] NA NA
str(L[!is.na(L)])
## List of 2
##  $ : NULL
##  $ : logi [1:2] NA NA
I assumed the two results above should be the same, so I think this is
a bug in na.omit. To simplify from now on let us assume that
is.na
gives us the one true definition of what is a missing value in R.
Reshape with missing list element
Now let’s put that list into a column of a data table,
library(data.table)
## data.table 1.14.3 IN DEVELOPMENT built 2022-07-20 16:55:52 UTC; th798 using 6 threads (see ?getDTthreads).  Latest news: r-datatable.com
(DT.wide <- data.table(L))
##         L
##    <list>
## 1:       
## 2:     NA
## 3:  NA,NA
Reshaping that data table gives
(DT.melt <- melt(DT.wide, measure="L"))
##    variable  value
##      <fctr> <list>
## 1:        L       
## 2:        L     NA
## 3:        L  NA,NA
(DT.pivot <- tidyr::pivot_longer(DT.wide, cols="L"))
## # A tibble: 3 × 2
##   name  value    
##   <chr> <list>   
## 1 L     <NULL>   
## 2 L     <lgl [1]>
## 3 L     <lgl [2]>
Let’s try to remove missing values with melt,
na.omit(DT.melt)
##    variable  value
##      <fctr> <list>
## 1:        L       
## 2:        L  NA,NA
melt(DT.wide, measure="L", na.rm=TRUE)
##    variable  value
##      <fctr> <list>
## 1:        L       
## 2:        L  NA,NA
Both results above seem to be correct. Now something strange happens when we try to remove missing values with tidyr,
na.omit(DT.pivot)
## # A tibble: 3 × 2
##   name  value    
##   <chr> <list>   
## 1 L     <NULL>   
## 2 L     <lgl [1]>
## 3 L     <lgl [2]>
tidyr::pivot_longer(DT.wide, cols="L", values_drop_na=TRUE)
## # A tibble: 2 × 2
##   name  value    
##   <chr> <list>   
## 1 L     <lgl [1]>
## 2 L     <lgl [2]>
The two results above are different, and both seem to be
incorrect. Both incorrectly contain NA, and the second does not have
NULL (which is not considered missing by is.na).
Reshape missing list column
What if there are several list columns to reshape?
dt.wide <- data.table(num_1=1, num_2=2, list_1=list(1), list_3=list(3))
print(dt.wide, class=TRUE)
##    num_1 num_2 list_1 list_3
##    <num> <num> <list> <list>
## 1:     1     2      1      3
Above is some data with a “missing” list_2 column. We reshape below:
(melt.tall <- melt(dt.wide, measure=measure(value.name, int=as.integer)))
##      int   num   list
##    <int> <num> <list>
## 1:     1     1      1
## 2:     2     2     NA
## 3:     3    NA      3
Note that in melt.tall the missing list_2 column is represented
by NA, which is recognized as missing by is.na. If we exclude
missing values the result is consistent,
na.omit(melt.tall)
##      int   num   list
##    <int> <num> <list>
## 1:     1     1      1
melt(
  dt.wide,
  measure=measure(value.name, int=as.integer),
  na.rm=TRUE)
##      int   num   list
##    <int> <num> <list>
## 1:     1     1      1
Now we do the same operation using tidyr,
names_pattern <- "(.*)_(.*)"
(pivot.tall <- tidyr::pivot_longer(
  dt.wide,
  matches(names_pattern),
  names_pattern=names_pattern,
  names_to=c(".value", "int")))
## # A tibble: 3 × 3
##   int     num list     
##   <chr> <dbl> <list>   
## 1 1         1 <dbl [1]>
## 2 2         2 <NULL>   
## 3 3        NA <dbl [1]>
Note that in pivot.tall the missing list_2 column is represented
by NULL.  However as we have seen above, R is.na recognizes scalar
NA (not NULL) as a missing list element. This results in an
inconsistency if we remove rows with any NA:
na.omit(pivot.tall)
## # A tibble: 2 × 3
##   int     num list     
##   <chr> <dbl> <list>   
## 1 1         1 <dbl [1]>
## 2 2         2 <NULL>
tidyr::pivot_longer(
  dt.wide,
  matches(names_pattern),
  names_pattern=names_pattern,
  names_to=c(".value", "int"),
  values_drop_na=TRUE)
## # A tibble: 3 × 3
##   int     num list     
##   <chr> <dbl> <list>   
## 1 1         1 <dbl [1]>
## 2 2         2 <NULL>   
## 3 3        NA <dbl [1]>
The na.omit result incorrectly includes row 2, and the
values_drop_na=TRUE result incorrectly includes rows 2 and 3.
Conclusion
When there are list columns, the data.table functions are more
correct and consistent in terms of reshape operations and treatment of
missing values.
Follow-up
I posted a thread on R-devel about na.omit on lists and data frames
with list columns.