Software
Free/open-source software for statistical machine learning and data visualization
My main contributions to free/open-source software are R packages that provide implementations of the methods described in my research papers (see below).
R and statistical software research community
- Program Committee, international useR 2024 conference, Salzburg, Austria, July 8-11.
- Since 2023, I serve as Associate Editor for the journal Stat.
- Since 2023, I serve as co-author for the Omics CRAN Task View.
- Since 2021, I serve as editor for rOpenSci Statistical Software.
- Since 2018, I serve as editor for Journal of Statistical Software.
- Since 2012, I serve as co-administrator and mentor for the R project in Google Summer of Code – I have been helping teach college students all over the world how to write R packages. Because of this work, the R Foundation gave me the toby.hocking@r-project.org email address.
- I was president of the organizing committee for “R in Montreal 2018,” a local conference for useRs and developeRs.
xgboost: Extreme Gradient Boosting
To support our paper about Survival Regression with Accelerated Failure Time Model in XGBoost (Journal of Computational and Graphical Statistics, 2022) we created the AFT objectives in xgboost, Documentation, Video.
mlr3resampling: cross-validation for mlr3 framework in R
mlr3 is a framework for machine learning in R. To support an upcoming research paper about applications of cross-validation, we created the mlr3resampling R package, which provides reference implementations of useful non-standard cross-validation algorithms.
aum: Area Under Min(FP,FN)
To support our JMLR 2023 paper about optimizing the Area Under Min(FP,FN) functions (a differentiable surrogate for ROC-AUC), we created the aum R package which has a C++ implementation of directional derivatives, and a python torch function which can be used with automatic differentiation.
SPARSEMODr: SPAtial Resolution-SEnsitive Models of Outbreak Dynamics
To support our paper about infectious disease modeling, we created the SPARSEMODr R package. Preprint medRxiv, Biology Methods & Protocols (2022).
RcppDeepState: fuzz testing compiled code in R packages
To support our R consortium funded project about fuzz testing C++ functions in R packages that use Rcpp, we created the RcppDeepState R package and github action.
LOPART: Labeled Optimal Partitioning
To support our Computational Statistics (2022) paper about Labeled Optimal Partitioning, we created the LOPART R package.
gfpop: Graph-constrained Functional Pruning Optimal Partitioning
To support our paper about graph-constrained optimal changepoint detection, we created the gfpop and gfpopgui R packages. arXiv:2002.03646
PeakSeg: up-down constrained changepoint detection
The PeakSeg R packages contain algorithms for inferring optimal segmentation models subject to the constraint that up changes must be followed by down changes, and vice versa. This ensures that the model can be interpreted in terms of peaks (after up changes) and background (after down changes).
- PeakSegDP provides a heuristic quadratic time algorithm for computing models from 1 to S segments for a single sample. This was the original algorithm described in our ICML’15 paper, but it is neither fast nor optimal, so in practice we recommend to use our newer packages below instead.
- PeakSegOptimal provides log-linear time algorithms for computing optimal models with multiple peaks for a single sample, to support our JMLR’20 paper.
- PeakSegDisk provides an on-disk implementation of optimal log-linear algorithms for computing multiple peaks in a single sample (same as PeakSegOptimal but works for much larger data sets because disk is used for storage instead of memory). Journal of Statistical Software Vol. 101, Issue 10.
- PeakSegJoint provides a fast heuristic algorithm for computing models with a single common peak in 0,…,S samples. arXiv:1506.01286.
- PeakSegPipeline provides a supervised machine learning pipeline for genome-wide peak calling in multiple samples and cell types, as described in our PSB’20 paper.
- FLOPART, Functional Labeled Optimal Partitioning, provides a supervised peak detection algorithm with label constraints, as described in our JCGS’24 paper.
- CROCS supports our BMC Bioinformatics (2021), paper, and provides an interface to various peak detection models as well as an implementation of our proposed algorithm, Changepoints for a Range Of ComplexitieS.
PeakError: label error computation for peak models
To support our Bioinformatics (2017) paper about a labeling method for supervised peak detection, we created the R package PeakError which computes the number of incorrect labels for a given set of predicted peaks.
clusterpath: convex clustering
To support our ICML’11 paper about the “clusterpath,” a convex formulation of hierarchical clustering, we created the clusterpath R package, available on R-Forge.
rankSVMcompare: support vector machines for ranking and comparing
To support our paper about a Support Vector Machine (SVM) algorithm for ranking and comparing (in preparation, arXiv:1401.8008), we created the rankSVMcompare R package.
animint: animated interactive grammar of graphics
To support our JCGS paper about animated and interactive extensions to the grammar of graphics, and our useR2016 tutorial on interactive graphics, we created the animint R package. The more recent version is animint2.
fpop: functional pruning optimal partitioning
To support our Statistics and Computing (2016) paper about a functional pruning optimal partitioning algorithm, we created the fpop R package.
mmit: max margin interval trees
To support our NeurIPS’17 paper about max margin interval trees, we created the mmit R package and Python module.
penaltyLearning: supervised changepoint detection
To support our ICML’13 paper, useR2017 tutorial, and JCGS’21 paper about learning penalty functions for changepoint detection, we created the penaltyLearning R package.
iregnet: elastic net regularized interval regression
To support our paper about elastic net regularized interval regression models (in preparation), we created the iregnet R package.
binsegRcpp: binary segmentation
To use as a baseline efficient implementation of binary segmentation in various papers such as Labeled Optimal Partitioning and Linear time model selection, we created the binsegRcpp R package.
directlabels: automatic label placement on figures
To support my poster “Adding direct labels to plots” which won Best Student Poster at useR 2011, we created the directlabels R package.
inlinedocs: documentation generation
To support our Journal of Statistical Software (2013) paper about documentation generation for R, we created the inlinedocs R package.
namedCapture: regular expressions for text parsing
To support our R Journal paper about R packages for regular expressions, we created the namedCapture R package, and provided various contributions to base R:
- We wrote a C code patch for regexpr/gregexpr which implements named capture regular expressions. Brian Ripley merged the patch into R (since version 2.14 in 2011).
- We sent R-devel a bug report for substring and a patch for gregexpr. Tomas Kalibera merged the fixes into R (since version 3.6 in 2019).
nc: named capture regular expressions for text parsing and data reshaping
To support our R Journal paper about data reshaping using regular expressions, we created the nc R package. To get a more efficient and fully-featured implementation of data reshaping, we contributed C code and the new measure function to the data.table package (since version 1.14.1 in 2021).
Python pandas str.extractall method for regular expressions
I wrote the str.extractall
method for regular expressions, which was
merged into pandas in 2016.
data.table
To support my NSF POSE funded project (2023-2025) about expanding the
open-source community around R
data.table, I provided
several important contributions, such as the measure()
function, and
was recognized as co-author (committer).
atime: Asymptotic Time and Memory Complexity
To support our paper about testing and improving software for large data, using asymptotic time and memory measurement, we created the atime R package. This work led to the following improvements in base R, since version 4.3.0 in 2023:
write.csv
is linear in the number of columns (it was quadratic before I wrote an email to R-devel, which prompted Gabriel Becker to write a patch that was committed by Sebastian Meyer).- Replacing columns,
df[j]<-value
, for a data framedf
with a large number of columns is now linear, thanks to a fix committed by Sebastian Meyer (before my report, it was log-quadratic in number of columns to replace).