Toby Dylan Hocking | Stress testing reshape operations on list columns

My paper about regular expressions for data reshaping was recently accepted into R journal. It proposes a new syntax for wide-to-tall data reshaping, and uses data.table::melt internally. I recently submitted a few PRs to data.table in order to improve its functionality. Here are two examples related to consistency of reshape operations on list columns. Code was run with R-4.1.0 and data.table-1.14.1 (from GitHub).

What is a missing list element?

Let us begin with a simple exploration about the nature of missing values in R.

is.na(L <- list(NULL, NA, c(NA,NA)))

## [1] FALSE  TRUE FALSE

The above shows that base R considers a list element missing if it contains a scalar NA.

str(na.omit(L))

## List of 3
##  $ : NULL
##  $ : logi NA
##  $ : logi [1:2] NA NA

str(L[!is.na(L)])

## List of 2
##  $ : NULL
##  $ : logi [1:2] NA NA

I assumed the two results above should be the same, so I think this is a bug in na.omit. To simplify from now on let us assume that is.na gives us the one true definition of what is a missing value in R.

Reshape with missing list element

Now let’s put that list into a column of a data table,

library(data.table)

## data.table 1.14.3 IN DEVELOPMENT built 2022-07-20 16:55:52 UTC; th798 using 6 threads (see ?getDTthreads).  Latest news: r-datatable.com

(DT.wide <- data.table(L))

##         L
##    <list>
## 1:       
## 2:     NA
## 3:  NA,NA

Reshaping that data table gives

(DT.melt <- melt(DT.wide, measure="L"))

##    variable  value
##      <fctr> <list>
## 1:        L       
## 2:        L     NA
## 3:        L  NA,NA

(DT.pivot <- tidyr::pivot_longer(DT.wide, cols="L"))

## # A tibble: 3 × 2
##   name  value    
##   <chr> <list>   
## 1 L     <NULL>   
## 2 L     <lgl [1]>
## 3 L     <lgl [2]>

Let’s try to remove missing values with melt,

na.omit(DT.melt)

##    variable  value
##      <fctr> <list>
## 1:        L       
## 2:        L  NA,NA

melt(DT.wide, measure="L", na.rm=TRUE)

##    variable  value
##      <fctr> <list>
## 1:        L       
## 2:        L  NA,NA

Both results above seem to be correct. Now something strange happens when we try to remove missing values with tidyr,

na.omit(DT.pivot)

## # A tibble: 3 × 2
##   name  value    
##   <chr> <list>   
## 1 L     <NULL>   
## 2 L     <lgl [1]>
## 3 L     <lgl [2]>

tidyr::pivot_longer(DT.wide, cols="L", values_drop_na=TRUE)

## # A tibble: 2 × 2
##   name  value    
##   <chr> <list>   
## 1 L     <lgl [1]>
## 2 L     <lgl [2]>

The two results above are different, and both seem to be incorrect. Both incorrectly contain NA, and the second does not have NULL (which is not considered missing by is.na).

Reshape missing list column

What if there are several list columns to reshape?

dt.wide <- data.table(num_1=1, num_2=2, list_1=list(1), list_3=list(3))
print(dt.wide, class=TRUE)

##    num_1 num_2 list_1 list_3
##    <num> <num> <list> <list>
## 1:     1     2      1      3

Above is some data with a “missing” list_2 column. We reshape below:

(melt.tall <- melt(dt.wide, measure=measure(value.name, int=as.integer)))

##      int   num   list
##    <int> <num> <list>
## 1:     1     1      1
## 2:     2     2     NA
## 3:     3    NA      3

Note that in melt.tall the missing list_2 column is represented by NA, which is recognized as missing by is.na. If we exclude missing values the result is consistent,

na.omit(melt.tall)

##      int   num   list
##    <int> <num> <list>
## 1:     1     1      1

melt(
  dt.wide,
  measure=measure(value.name, int=as.integer),
  na.rm=TRUE)

##      int   num   list
##    <int> <num> <list>
## 1:     1     1      1

Now we do the same operation using tidyr,

names_pattern <- "(.*)_(.*)"
(pivot.tall <- tidyr::pivot_longer(
  dt.wide,
  matches(names_pattern),
  names_pattern=names_pattern,
  names_to=c(".value", "int")))

## # A tibble: 3 × 3
##   int     num list     
##   <chr> <dbl> <list>   
## 1 1         1 <dbl [1]>
## 2 2         2 <NULL>   
## 3 3        NA <dbl [1]>

Note that in pivot.tall the missing list_2 column is represented by NULL. However as we have seen above, R is.na recognizes scalar NA (not NULL) as a missing list element. This results in an inconsistency if we remove rows with any NA:

na.omit(pivot.tall)

## # A tibble: 2 × 3
##   int     num list     
##   <chr> <dbl> <list>   
## 1 1         1 <dbl [1]>
## 2 2         2 <NULL>

tidyr::pivot_longer(
  dt.wide,
  matches(names_pattern),
  names_pattern=names_pattern,
  names_to=c(".value", "int"),
  values_drop_na=TRUE)

## # A tibble: 3 × 3
##   int     num list     
##   <chr> <dbl> <list>   
## 1 1         1 <dbl [1]>
## 2 2         2 <NULL>   
## 3 3        NA <dbl [1]>

The na.omit result incorrectly includes row 2, and the values_drop_na=TRUE result incorrectly includes rows 2 and 3.

Conclusion

When there are list columns, the data.table functions are more correct and consistent in terms of reshape operations and treatment of missing values.

Follow-up

I posted a thread on R-devel about na.omit on lists and data frames with list columns.