My main contributions to free/open-source software are R packages that provide implementations of the methods described in my research papers (see below).
R and statistical software research community
- Since 2021, I am an editor for rOpenSci Statistical Software.
- Since 2018, I am an editor for Journal of Statistical Software.
- Since 2012, I am co-administrator and mentor for the R project in Google Summer of Code – I have been helping teach college students all over the world how to write R packages. Because of this work, the R Foundation gave me the email@example.com email address.
- I was president of the organizing committee for “R in Montreal 2018,” a local conference for useRs and developeRs.
xgboost: Extreme Gradient Boosting
To support our paper about Survival Regression with Accelerated Failure Time Model in XGBoost (Journal of Computational and Graphical Statistics, 2022) we created the AFT objectives in xgboost, Documentation, Video.
aum: Area Under Min(FP,FN)
To support our paper (in progress) about optimizing the Area Under Min(FP,FN) functions (a differentiable surrogate for ROC-AUC), we created the aum R package which has a C++ implementation of directional derivatives, and a python torch function which can be used with automatic differentiation.
SPARSEMODr: SPAtial Resolution-SEnsitive Models of Outbreak Dynamics
RcppDeepState: fuzz testing compiled code in R packages
LOPART: Labeled Optimal Partitioning
gfpop: Graph-constrained Functional Pruning Optimal Partitioning
PeakSeg: up-down constrained changepoint detection
The PeakSeg R packages contain algorithms for inferring optimal segmentation models subject to the constraint that up changes must be followed by down changes, and vice versa. This ensures that the model can be interpreted in terms of peaks (after up changes) and background (after down changes).
- PeakSegDP provides a heuristic quadratic time algorithm for computing models from 1 to S segments for a single sample. This was the original algorithm described in our ICML’15 paper, but it is neither fast nor optimal, so in practice we recommend to use our newer packages below instead.
- PeakSegOptimal provides log-linear time algorithms for computing optimal models with multiple peaks for a single sample, to support our JMLR’20 paper.
- PeakSegDisk provides an on-disk implementation of optimal log-linear algorithms for computing multiple peaks in a single sample (same as PeakSegOptimal but works for much larger data sets because disk is used for storage instead of memory). Journal of Statistical Software Vol. 101, Issue 10.
- PeakSegJoint provides a fast heuristic algorithm for computing models with a single common peak in 0,…,S samples. arXiv:1506.01286.
- PeakSegPipeline provides a supervised machine learning pipeline for genome-wide peak calling in multiple samples and cell types, as described in our PSB’20 paper.
- FLOPART, Functional Labeled Optimal Partitioning, provides a supervised peak detection algorithm with label constraints. (paper in progress)
- CROCS supports our BMC Bioinformatics (2021), paper, and provides an interface to various peak detection models as well as an implementation of our proposed algorithm, Changepoints for a Range Of ComplexitieS.
PeakError: label error computation for peak models
To support our Bioinformatics (2017) paper about a labeling method for supervised peak detection, we created the R package PeakError which computes the number of incorrect labels for a given set of predicted peaks.
clusterpath: convex clustering
rankSVMcompare: support vector machines for ranking and comparing
animint: animated interactive grammar of graphics
To support our JCGS paper about animated and interactive extensions to the grammar of graphics, and our useR2016 tutorial on interactive graphics, we created the animint R package. The more recent version is animint2.
fpop: functional pruning optimal partitioning
mmit: max margin interval trees
penaltyLearning: supervised changepoint detection
iregnet: elastic net regularized interval regression
To support our paper about elastic net regularized interval regression models (in preparation), we created the iregnet R package.
binsegRcpp: binary segmentation
directlabels: automatic label placement on figures
inlinedocs: documentation generation
namedCapture: regular expressions for text parsing
- We wrote a C code patch for regexpr/gregexpr which implements named capture regular expressions. Brian Ripley merged the patch into R (since version 2.14 in 2011).
- We sent R-devel a bug report for substring and a patch for gregexpr. Tomas Kalibera merged the fixes into R (since version 3.6 in 2019).
nc: named capture regular expressions for text parsing and data reshaping
To support our R Journal paper about data reshaping using regular expressions, we created the nc R package. To get a more efficient and fully-featured implementation of data reshaping, we contributed C code and the new measure function to the data.table package (since version 1.14.1 in 2021).
Python pandas str.extractall method for regular expressions
I wrote the
str.extractall method for regular expressions, which was
merged into pandas in 2016.