Yesterday I was reading the source code for the scikit-learn function that implements Hierachical clustering with Ward linkage, _agglomerative.py. The python function ward_tree() uses functions heapify(), heappop(), heappush(). The heapify docs say that it converts a python list (~ C++ STL vector) to a min-heap. Is there an equivalent in R?

In theory

The basic data structures in R are “vectors” which are in fact more like C++ STL arrays (fixed size). So in R we could have a large heap that reduces in size (with some unused elements). If we had a small heap, and we wanted to increase its size, we could use the same trick that python uses to enable its “lists” to support the append operation (allocate a new array double the size and have some unused elements).

Software

In C++ STL there is make_heap() which works on any random access iterator (vector or array). The heap is also used in the priority_queue data structure which I have used to implement efficient binary segmentation in binsegRcpp/src/binseg.cpp.

Of course we can call C++ from R or Python. But can we call an R function like make_heap() in C++ or heapify() in python? There was a datastructures CRAN package, now only on github:

remotes::install_github("dirmeier/datastructures")
## Using GitHub PAT from the git credential store.
## Skipping install of 'datastructures' from a github remote, the SHA1 (c69faa25) has not changed since last install.
##   Use `force = TRUE` to force installation

For the application to hierarchical clustering, we want efficient minimization of numeric distances, so we create a heap with numeric keys below.

num_heap <- datastructures::fibonacci_heap("numeric")
datastructures::size(num_heap)
## [1] 0

Above we see that it starts empty (size 0). Below we insert five simulated distances as keys.

set.seed(1)
(dist.vec <- rnorm(5))
## [1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078
datastructures::insert(num_heap, dist.vec, 1:5)
## An object of class fibonacci_heap<numeric, SEXP>
## 
## Peek: -0.835629 -> integer, ...
datastructures::size(num_heap)
## [1] 5

Above we see the size has increased to 5. Below we pop one item from the heap.

datastructures::pop(num_heap)
## $`-0.835629`
## [1] 3
datastructures::size(num_heap)
## [1] 4

Above we see that it popped the smallest distance, and the new size is 4. Below we adapt these methods to an atime performance comparison.

library(data.table)
a_result <- atime::atime(
  setup={
    set.seed(1)
    keys <- rnorm(N)
    values <- 1:N
  },
  data.frame={
    df <- data.frame(keys, values)
    out <- integer(N)
    for(i in 1:N){
      min.row <- which.min(df$keys)
      df$keys[min.row] <- NA
      out[[i]] <- df$values[min.row]
    }
    out
  },
  seconds=1,
  result=TRUE,
  heap={
    num_heap <- datastructures::fibonacci_heap("numeric")
    datastructures::insert(num_heap, keys, values)
    out <- integer(N)
    for(i in 1:N){
      out[[i]] <- datastructures::pop(num_heap)[[1]]
    }
    out
  })
plot(a_result)
## Warning in ggplot2::scale_y_log10("median line, min/max band"): log-10 transformation introduced infinite values.
## log-10 transformation introduced infinite values.
## log-10 transformation introduced infinite values.

plot of chunk atime-result

Above we see that

  • for small data sizes, the linear search with a data frame is faster than the heap pop.
  • for large data sizes, the heap pop is faster than the linear search with the data frame.
  • the asymptotic time and memory usage (slope on the right) is smaller for heap, which indicates a smaller big-O complexity class.

Below we plot reference lines to estimate the big-O classes.

a_refs <- atime::references_best(a_result)
plot(a_refs)
## Warning in ggplot2::scale_y_log10(""): log-10 transformation introduced infinite values.

plot of chunk atime-refs

The plot above shows that

  • data.frame is quadratic O(N^2) time and memory.
  • heap is log-linear O(N log N) time and linear O(N) memory.

These are the expected results from using a heap versus a vector.

Conclusions

We have shown that there is an R package datastructures which implements the heap, and has expected asymptotic time/space complexity.

Session info

sessionInfo()
## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                           LC_TIME=English_United States.utf8    
## 
## time zone: America/Toronto
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  utils     datasets  grDevices methods   base     
## 
## other attached packages:
## [1] data.table_1.17.8
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6           dplyr_1.1.4            compiler_4.4.1         crayon_1.5.3           tidyselect_1.2.1       Rcpp_1.1.0            
##  [7] datastructures_0.2.9   scales_1.4.0           directlabels_2025.6.24 lattice_0.22-7         ggplot2_4.0.0          R6_2.6.1              
## [13] generics_0.1.4         curl_7.0.0             knitr_1.50             tibble_3.3.0           atime_2025.5.24        pillar_1.11.1         
## [19] RColorBrewer_1.1-3     rlang_1.1.6            xfun_0.53              quadprog_1.5-8         S7_0.2.0               cli_3.6.5             
## [25] withr_3.0.2            magrittr_2.0.4         grid_4.4.1             remotes_2.5.0          lifecycle_1.0.4        vctrs_0.6.5           
## [31] bench_1.1.4            evaluate_1.0.5         glue_1.8.0             farver_2.1.2           codetools_0.2-20       profmem_0.7.0         
## [37] purrr_1.1.0            tools_4.4.1            pkgconfig_2.0.3