Toby Dylan Hocking | Combining data tables in R

Say you have a list of data tables, and you want to combine them. Of course, to combine data tables in R, we use rbind. But how?

I had two different students this week try the following, with rbind inside the for loop:

> system.time({
+   inside.dt <- NULL
+   for(i in 1:N){
+     inside.dt <- rbind(inside.dt, data.table(i, list.of.data[[i]]))
+   }
+ })
   user  system elapsed 
  6.936   2.804   9.767 
> 

The code above is relatively slow because it is quadratic O(N^2) in the list size N. For each i from 1 to N, we must allocate a new data table of size O(i), and this is done N times. The quadratic time complexity comes from the fact that the size of the data table allocation gets bigger with each iteration of the for loop.

In contrast, consider the code below, which allocates a data table of the same (constant) size in each iteration of the for loop.

> system.time({
+   outside.dt.list <- list()
+   for(i in 1:N){
+     outside.dt.list[[i]] <- data.table(i, list.of.data[[i]])
+   }
+   outside.dt <- do.call(rbind, outside.dt.list)
+ })
   user  system elapsed 
   0.38    0.06    0.44 
> 

I call this the “outside” version because the rbind is outside (after) the for loop. This timing is much faster, because it is linear time complexity. The combination step at the end do.call(rbind, outside.dt.list) is a linear time operation because it only needs to allocate one data table of the size of the result.

Both get the same result, as shown below.

> identical(inside.dt, outside.dt)
[1] TRUE
> 

I used this R script to compute timings of these two algorithms for combining lists of various sizes (10 data.tables, 20, …, 100). The timings are plotted in the figure below, which clearly shows the quadratic time complexity of the inside version:

inside is quadratic

Conclusion: use the version with rbind outside the for loop: do.call(rbind, list.of.data.tables) is fast because it is a linear time operation!