Toby Dylan Hocking | The importance of hyper-parameter tuning

The goal of this blog post is to compare a machine learning algorithms, and demonstrate the importance of learning model complexity hyper-parameters, if the goal is to minimize prediction error (and that is always the goal in ML).

Theoretical discussion and visualizations

What is model complexity and why does it need to be controlled? Model complexity is the ability of a learning algorithm to fit complex patterns which are encoded in the train set. There is a different measure of model complexity for every learning algorithm:

number of neighbors in K-nearest-neighbors,
degree of L1 regularization in LASSO (as implemented in glmnet package in R),
degree of polynomial in linear model basis expansion,
cost parameter in support vector machines,
number of iterations and learning rate in boosting (as implemented in xgboost package in R),
number of gradient descent iterations (epochs), learning rate, number of hidden units, number of layers, etc, in neural networks.

In each machine learning algorithm, the model complexity must be controlled, by selecting a good value for the corresponding hyper-parameter. I created a data visualization which shows how this works using nearest neighbors (need to learn number of neighbors) and linear models (need to learn the degree of the polynomial basis expansion), for several regression problems.

There are four different simulated data sets, each with a pattern of different complexity (constant, linear, quadratic, cubic).
The linear model polynomial degree varies from 0 (constant prediction function, most regularized, least complex) to 9 (very wiggly prediction function that perfectly interpolates every train data point, least regularized, most complex).
The number of neighbors varies from 1 (very wiggly prediction function that perfectly interpolates every train data point, least regularized, most complex) to 10 (constant prediction function, most regularized, least complex).
These two learning algorithms are interesting examples to compare, because the least complex model (in both cases) always predicts the average of the train labels. That turns out to be the best prediction function, in the case of the constant data/pattern.
But for the other data/patterns, a more complex regularization parameter must be used to get optimal prediction accuracy. For example, the regularization parameters which gives minimal test error for the linear data/pattern are 3 nearest neighbors, and polynomial degree 1.

The goal of this blog is to demonstrate how to use mlr3 auto tuner to properly learn such hyper-parameters, and show that approach is more accurate than using the defaults (which may be ok for some data, but sub-optimal for others).

Simulation

Below we define some simulated data, for five different regression problems, of varying complexity.

max.x <- 12
min.x <- -max.x
fun.list <- list(
  constant=function(x)1,
  linear=function(x)x/3,
  quadratic=function(x)x^2/max.x-5,
  sin=function(x)4*sin(x),
  step=function(x)ifelse(x<0, 4, -4))
N <- 200
set.seed(1)
input.vec <- runif(N, min.x, max.x)
library(data.table)
task.list <- list()
sim.dt.list <- list()
for(fun.name in names(fun.list)){
  f <- fun.list[[fun.name]]
  true.vec <- f(input.vec)
  task.dt <- data.table(
    input=input.vec,
    output=true.vec+rnorm(N,sd=2))
  task.list[[fun.name]] <- mlr3::TaskRegr$new(
    fun.name, task.dt, target="output"
  )
  sim.dt.list[[fun.name]] <- data.table(fun.name, task.dt)
}
(sim.dt <- rbindlist(sim.dt.list))

##       fun.name     input     output
##         <char>     <num>      <num>
##    1: constant -5.627792 -0.2407334
##    2: constant -3.069026  1.0842317
##    3: constant  1.748481 -0.8218433
##    4: constant  9.796987  1.3160575
##    5: constant -7.159634 -0.3091693
##   ---                              
##  996:     step  2.173756 -8.1816922
##  997:     step -9.351345  7.3947878
##  998:     step  8.172169 -1.8722377
##  999:     step -4.368872  2.4667667
## 1000:     step  6.788432 -3.2359849

library(ggplot2)
ggplot()+
  geom_point(aes(
    input, output),
    data=sim.dt)+
  facet_grid(. ~ fun.name, labeller=label_both)

plot of chunk simData

The plot above visualizes the simulated data. The goal of each learning algorithm is to learn the pattern in each of the five different data sets. To do that, we define a few different learning algorithms in the next section.

Learning algorithm definitions

Perhaps the simplest learning algorithm is nearest neighbors, which can be implemented in the mlr3 framework using the code below.

nn.default <- mlr3learners::LearnerRegrKKNN$new()
nn.default$param_set

## <ParamSet>
##             id    class lower upper nlevels default  value
##         <char>   <char> <num> <num>   <num>  <list> <list>
## 1:           k ParamInt     1   Inf     Inf       7      7
## 2:    distance ParamDbl     0   Inf     Inf       2       
## 3:      kernel ParamFct    NA    NA      10 optimal       
## 4:       scale ParamLgl    NA    NA       2    TRUE       
## 5:     ykernel ParamUty    NA    NA     Inf               
## 6: store_model ParamLgl    NA    NA       2   FALSE

The code above defines the default nearest neighbors regressor, which has a default value of k=7 (that is the number of neighbors). As discussed above, the number of neighbors is an important model complexity hyper-parameter, which must be tuned for optimal prediction accuracy. In other words, the default value 7 may provide sub-optimal prediction accuracy, depending on the data set. Another option is to manually select a different value for the number of neighbors, which we do in the code below,

nn1 <- mlr3learners::LearnerRegrKKNN$new()
nn1$id <- "1 nearest neighbor"
nn1$param_set$values$k <- 1

The code above defines another nearest neighbor regressor, which uses 1 nearest neighbor, instead of the default 7. That may be better for some data sets, but in general, we would like to choose a different number of neighbors for each data set. We can do that using cross-validation, by dividing the train set into subtrain/validation sets, where the validation set is used to determine the optimal number of neighbors. The code below defines 5-fold cross-validation as the method to use,

subtrain.valid.cv <- mlr3::ResamplingCV$new()
subtrain.valid.cv$param_set$values$folds <- 5

Next, in the code below, we define a new nearest neighbor regressor, and tell mlr3 to tune the number of neighbors.

knn.learner <- mlr3learners::LearnerRegrKKNN$new()
knn.learner$param_set$values$k <- paradox::to_tune(1, 20)

To convert that to a learning algorithm that can be used in a benchmark experiment, we need to use as the learner in an auto_tuner, as in the code below,

knn.tuned = mlr3tuning::auto_tuner(
  tuner = mlr3tuning::TunerGridSearch$new(),
  learner = knn.learner,
  resampling = subtrain.valid.cv,
  measure = mlr3::msr("regr.mse"))

The code above says that we want to tune the learner using grid search, 5-fold CV, and select the model which minimizes MSE on the validation set.

Another non-linear regression algorithm is boosting, which can be implemented in mlr3 using the code below. Again there are several hyper-parameters which control model complexity, and in the code below we tune two of them (eta=step size, nrounds=number of boosting iterations). Note that some parameters should be tuned on the log scale (like eta with log=TRUE in the code below), and others can be tuned on the linear scale (like nrounds). Finally note in the code below that we use a grid search with resolution 5 in order to save time (default is 10 so with 2 hyper-parameters that would make 100 combinations, exercise for the reader to try that instead).

xgboost.learner <- mlr3learners::LearnerRegrXgboost$new()
xgboost.learner$param_set$values$eta <- paradox::to_tune(0.001, 1, log=TRUE)
xgboost.learner$param_set$values$nrounds <- paradox::to_tune(1, 100)
grid.search.5 <- mlr3tuning::TunerGridSearch$new()
grid.search.5$param_set$values$resolution <- 5
xgboost.tuned = mlr3tuning::auto_tuner(
  tuner = grid.search.5,
  learner = xgboost.learner,
  resampling = subtrain.valid.cv,
  measure = mlr3::msr("regr.mse"))

You might wonder about the code above, how did I know to tune eta from 0.001 to 1, and nrounds from 1 to 100? What are the good min/max values for each hyper-parameter? mlr3tuningspaces is a useful reference with data describing typical values for each hyper-parameter. And in fact, if you want to use the hyper-parameter ranges defined in that package, you can use code like below (ranger package for random forest).

ranger.tuned = mlr3tuning::auto_tuner(
  tuner = grid.search.5,
  learner = mlr3tuningspaces::lts(mlr3learners::LearnerRegrRanger$new()),
  resampling = subtrain.valid.cv,
  measure = mlr3::msr("regr.mse"))

Define benchmark experiment

After having defined the learning algorithms in the previous section, we combine them in the list below.

learner.list <- list(
  mlr3::LearnerRegrFeatureless$new(),
  mlr3learners::LearnerRegrRanger$new(), ranger.tuned,
  mlr3learners::LearnerRegrXgboost$new(), xgboost.tuned,
  knn.tuned, nn1, nn.default)

To see which of the learning algorithms is most accurate in each data set, we will use 3-fold cross-validation, as defined in ht code below.

train.test.cv <- mlr3::ResamplingCV$new()
train.test.cv$param_set$values$folds <- 3

In the code below, we define a benchmark grid, which combines tasks (data sets), with learners (algorithms), and resamplings (actually just one, 3-fold CV).

(bench.grid <- mlr3::benchmark_grid( 
  tasks=task.list,
  learners=learner.list,
  resamplings=train.test.cv))

##          task            learner resampling
##        <char>             <char>     <char>
##  1:  constant   regr.featureless         cv
##  2:  constant        regr.ranger         cv
##  3:  constant  regr.ranger.tuned         cv
##  4:  constant       regr.xgboost         cv
##  5:  constant regr.xgboost.tuned         cv
##  6:  constant    regr.kknn.tuned         cv
##  7:  constant 1 nearest neighbor         cv
##  8:  constant          regr.kknn         cv
##  9:    linear   regr.featureless         cv
## 10:    linear        regr.ranger         cv
## 11:    linear  regr.ranger.tuned         cv
## 12:    linear       regr.xgboost         cv
## 13:    linear regr.xgboost.tuned         cv
## 14:    linear    regr.kknn.tuned         cv
## 15:    linear 1 nearest neighbor         cv
## 16:    linear          regr.kknn         cv
## 17: quadratic   regr.featureless         cv
## 18: quadratic        regr.ranger         cv
## 19: quadratic  regr.ranger.tuned         cv
## 20: quadratic       regr.xgboost         cv
## 21: quadratic regr.xgboost.tuned         cv
## 22: quadratic    regr.kknn.tuned         cv
## 23: quadratic 1 nearest neighbor         cv
## 24: quadratic          regr.kknn         cv
## 25:       sin   regr.featureless         cv
## 26:       sin        regr.ranger         cv
## 27:       sin  regr.ranger.tuned         cv
## 28:       sin       regr.xgboost         cv
## 29:       sin regr.xgboost.tuned         cv
## 30:       sin    regr.kknn.tuned         cv
## 31:       sin 1 nearest neighbor         cv
## 32:       sin          regr.kknn         cv
## 33:      step   regr.featureless         cv
## 34:      step        regr.ranger         cv
## 35:      step  regr.ranger.tuned         cv
## 36:      step       regr.xgboost         cv
## 37:      step regr.xgboost.tuned         cv
## 38:      step    regr.kknn.tuned         cv
## 39:      step 1 nearest neighbor         cv
## 40:      step          regr.kknn         cv
##          task            learner resampling

In the code below we tell the mlr3 logger to suppress most messages, then we run the benchmark experiment. Actually, if the results have already been computed and saved in the cache RData file, we just read that. Otherwise, you should run the code below that interactively, either on your own computer, or on a SLURM super-computer cluster. There are only 120 iterations in this case, but if you run that on a super-computer cluster, then it could potentially run 120x faster. It is better to use a super-computer for larger experiments, with more data sets, learning algorithms, and cross-validation folds. For example I have recently computed some machine learning experiments with over 1000 benchmark iterations, each of which can be computed in parallel (and very quickly using NAU Monsoon which has 4000 CPUs). See mlr3batchmark for computing these iterations in parallel using the batchtools R package (which supports backends such as SLURM, which is how things are parallelized on super-computers such as NAU Monsoon). See my R batchtools on Monsoon tutorial for more info.

lgr::get_logger("mlr3")$set_threshold("warn")
cache.RData <- "2024-01-29-hyper-parameter-tuning.RData"
if(file.exists(cache.RData)){
  load(cache.RData)
}else{#code below should be run interactively.
  if(on.cluster){
    reg.dir <- "2024-01-29-hyper-parameter-tuning-registry"
    unlink(reg.dir, recursive=TRUE)
    reg = batchtools::makeExperimentRegistry(
      file.dir = reg.dir,
      seed = 1,
      packages = "mlr3verse"
    )
    mlr3batchmark::batchmark(
      bench.grid, store_models = TRUE, reg=reg)
    job.table <- batchtools::getJobTable(reg=reg)
    chunks <- data.frame(job.table, chunk=1)
    batchtools::submitJobs(chunks, resources=list(
      walltime = 60*60,#seconds
      memory = 2000,#megabytes per cpu
      ncpus=1,  #>1 for multicore/parallel jobs.
      ntasks=1, #>1 for MPI jobs.
      chunks.as.arrayjobs=TRUE), reg=reg)
    batchtools::getStatus(reg=reg)
    jobs.after <- batchtools::getJobTable(reg=reg)
    table(jobs.after$error)
    ids <- jobs.after[is.na(error), job.id]
    bench.result <- mlr3batchmark::reduceResultsBatchmark(ids, reg = reg)
  }else{
    ## In the code below, we declare a multisession future plan to
    ## compute each benchmark iteration in parallel on this computer
    ## (data set, learning algorithm, cross-validation fold). For a
    ## few dozen iterations, using the multisession backend is
    ## probably sufficient (I have 12 CPUs on my work PC).
    if(require(future))plan("multisession")
    bench.result <- mlr3::benchmark(
      bench.grid, store_models = TRUE)
  }    
  save(bench.result, file=cache.RData)
}

If you are following along, now is the time to grab a cup of tea or coffee.

Comparing tuned to default/fixed hyper-parameters

After having computed the benchmark result in the previous section, we use the score method below to compute a data table with columns for train time and mean squared test error.

bench.score <- bench.result$score(mlr3::msrs(c("time_train","regr.mse")))
bench.score[1]

##       nr  task_id       learner_id resampling_id iteration time_train regr.mse
##    <int>   <char>           <char>        <char>     <int>      <num>    <num>
## 1:     1 constant regr.featureless            cv         1       0.01 4.650449
## Hidden columns: uhash, task, learner, resampling, prediction

First, we create a new algorithm factor column in the code below, and take the subset of results for the nearest neighbors learning algorithms.

algo.levs <- c(
  "regr.ranger.tuned", "regr.ranger", 
  "regr.xgboost.tuned", "regr.xgboost", 
  "regr.kknn.tuned", nn.default$id, nn1$id,
  "regr.featureless")
only.nn <- bench.score[
, algorithm := factor(learner_id, algo.levs)
][
  grepl("nn|featureless|neighbor", algorithm)
]

We use the code below to visualize the nearest neighbors test error values.

ggplot()+
  geom_point(aes(
    regr.mse, algorithm),
    shape=1,
    data=only.nn)+
  facet_grid(. ~ task_id, labeller=label_both, scales="free")+
  scale_x_log10(
    "Mean squared prediction error on the test set")

plot of chunk nnError

The plot above shows the mean squared prediction error on the X axis, and the learning algorithm on the Y axis. The three versions of nearest neighbors are shown, along with the baseline featureless predictor (which always predicts a constant value, the mean of the train labels). It is clear that

the 1 nearest neighbor algorithm is the least accurate learning algorithm, and sometimes even has larger test error rates than the featureless baseline (in the constant and linear data).
the regr.kknn algorithm (default/fixed number of neighbors is 7) yields reasonable predictions, which are most of the time better than the featureless baseline (except in the constant task).
the regr.kknn.tuned algorithm has the least prediction error overall, because it uses an internal grid search to select the best number of neighbors, as a function of the training data.

The code below computes the selected number of neighbors for the tuned nearest neighbor algorithms:

only.nn[
  grepl("tuned", algorithm),
  data.table(
    task_id,
    neighbors=sapply(learner, function(L)L$tuning_result$k))
]

##       task_id neighbors
##        <char>     <int>
##  1:  constant        16
##  2:  constant        20
##  3:  constant        20
##  4:    linear        20
##  5:    linear        20
##  6:    linear        20
##  7: quadratic         9
##  8: quadratic         7
##  9: quadratic        12
## 10:       sin         9
## 11:       sin         9
## 12:       sin         7
## 13:      step        16
## 14:      step         7
## 15:      step        12

It is clear that the tuning procedure selected numbers of neighbors which are sometimes different from the fixed values we used in the other learners (1 and 7). As expected, we see larger number of neighbors for the simpler patterns (constant and linear).

Next, we examine the test error rates of the random forest learners.

only.ranger <- bench.score[grepl("featureless|ranger", algorithm)]
ggplot()+
  geom_point(aes(
    regr.mse, algorithm),
    shape=1,
    data=only.ranger)+
  facet_grid(. ~ task_id, labeller=label_both, scales="free")+
  scale_x_log10(
    "Mean squared prediction error on the test set")

plot of chunk rangerError

It is clear from the plot above that the tuned version is at least as accurate as the default, for each data set. In the case of the simpler tasks (constant and linear), the tuned version is substantially better.

Next, we examine the test error rates of the xgboost learners.

only.xgboost <- bench.score[grepl("xgboost|featureless", algorithm)]
ggplot()+
  geom_point(aes(
    regr.mse, algorithm),
    shape=1,
    data=only.xgboost)+
  facet_grid(. ~ task_id, labeller=label_both, scales="free")+
  scale_x_log10(
    "Mean squared prediction error on the test set")

plot of chunk xgboostError

The plot above shows that the tuned learner is at least as accurate as the default, for each data set. The tuned learner has significantly smaller prediction error for the more complex patterns (quadratic, sin, step).

Comparing learning algorithms

In this section we compare the tuned versions of the learning algorithms with each other.

only.tuned <- bench.score[grepl("tuned|featureless", algorithm)]
ggplot()+
  geom_point(aes(
    regr.mse, algorithm),
    shape=1,
    data=only.tuned)+
  facet_grid(. ~ task_id, labeller=label_both, scales="free")+
  scale_x_log10(
    "Mean squared prediction error on the test set")

plot of chunk tunedError

The plot above shows that there is no one algorithm that is the best for every data set, which is to be expected (no free lunch theorem). The best algorithm for task=step seems to be xgboost, whereas the nearest neighbors and ranger algorithms predict with similar optimality on the other non-trivial tasks (linear, quadratic, sin).

Comparing train time

In this section we compare the time it takes to train the algorithms.

ref <- function(seconds, unit){
  data.table(seconds, label=paste0("1 ", unit, " "))
}
ref.dt <- rbind(
  ref(60, "minute"),
  ref(1, "second"))
ggplot()+
  theme_bw()+
  geom_point(aes(
    time_train, algorithm),
    shape=1,
    data=bench.score)+
  geom_vline(aes(
    xintercept=seconds),
    color="grey",
    data=ref.dt)+
  geom_text(aes(
    seconds, Inf, label=label),
    color="grey50",
    data=ref.dt,
    angle=90,
    vjust=1.2,
    hjust=1)+
  facet_grid(. ~ task_id, labeller=label_both, scales="free")+
  scale_x_log10(
    "Training time (seconds)",
    breaks=10^seq(-2,2,by=2))

plot of chunk trainTime

It is clear from the plot above that the hyper-parameter tuning takes significantly more time than the learning algorithms with default/fixed hyper-parameters. This is to be expected, because an internal grid search needs to be computed, in order to determine the optimal hyper-parameters for each train set. The increased computation time is the price you have to pay for the increased prediction accuracy.

Conclusions

We have explained how to implement hyper-parameter tuning, in the context of the mlr3 framework in R. We have shown that hyper-parameter tuning takes more time, but results in consistently more accurate predictions, with respect to learning algorithms which use default/fixed hyper-parameters. We have additionally showed how to implement mlr3 benchmark experiments on a SLURM cluster using mlr3batchmark/batchtools packages.

Session info

sessionInfo()

## R version 4.3.2 (2023-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C               LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8    LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/Phoenix
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  utils     datasets  grDevices methods   base     
## 
## other attached packages:
## [1] future_1.33.1     ggplot2_3.4.4     data.table_1.15.0
## 
## loaded via a namespace (and not attached):
##  [1] future.apply_1.11.1    gtable_0.3.4           highr_0.10             dplyr_1.1.4            compiler_4.3.2        
##  [6] crayon_1.5.2           tidyselect_1.2.0       parallel_4.3.2         globals_0.16.2         scales_1.3.0          
## [11] uuid_1.2-0             RhpcBLASctl_0.23-42    R6_2.5.1               mlr3tuning_0.19.2      labeling_0.4.3        
## [16] generics_0.1.3         knitr_1.45.11          palmerpenguins_0.1.1   backports_1.4.1        checkmate_2.3.1       
## [21] tibble_3.2.1           munsell_0.5.0          paradox_0.11.1         pillar_1.9.0           mlr3tuningspaces_0.4.0
## [26] mlr3measures_0.5.0     rlang_1.1.3            utf8_1.2.4             xfun_0.41.10           lgr_0.4.4             
## [31] mlr3_0.17.2            mlr3misc_0.13.0        cli_3.6.2              withr_3.0.0            magrittr_2.0.3        
## [36] digest_0.6.34          grid_4.3.2             mlr3learners_0.5.8     bbotk_0.7.3            lifecycle_1.0.4       
## [41] vctrs_0.6.5            evaluate_0.23          glue_1.7.0             farver_2.1.1           listenv_0.9.1         
## [46] codetools_0.2-19       parallelly_1.36.0      fansi_1.0.6            colorspace_2.1-0       tools_4.3.2           
## [51] pkgconfig_2.0.3