Creating imbalanced data benchmarks
The goal of this post is to show how to create unbalanced classification data sets for use with our recently proposed SOAK algorithm, which will be able to tell us if we can train on imbalanced data, and get accurate predictions on balanced data (and vice versa).
Read and order MNIST data
We begin by reading the MNIST data,
library(data.table)
MNIST_dt <- fread("~/projects/cv-same-other-paper/data_Classif/MNIST.csv")
data.table(
name=names(MNIST_dt),
first_row=unlist(MNIST_dt[1]),
last_row=unlist(MNIST_dt[.N]))
## name first_row last_row
## <char> <char> <char>
## 1: predefined.set train test
## 2: y 5 6
## 3: 0 0 0
## 4: 1 0 0
## 5: 2 0 0
## ---
## 782: 779 0 0
## 783: 780 0 0
## 784: 781 0 0
## 785: 782 0 0
## 786: 783 0 0
We see that these MNIST data have a predefined.set
column, which we
will ignore (use all data from train and test). The class distribution
is as follows:
(multi_counts <- MNIST_dt[, .(
count=.N,
prop=.N/nrow(MNIST_dt)
), by=y][order(-prop)])
## y count prop
## <int> <int> <num>
## 1: 1 7877 0.11252857
## 2: 7 7293 0.10418571
## 3: 3 7141 0.10201429
## 4: 2 6990 0.09985714
## 5: 9 6958 0.09940000
## 6: 0 6903 0.09861429
## 7: 6 6876 0.09822857
## 8: 8 6825 0.09750000
## 9: 4 6824 0.09748571
## 10: 5 6313 0.09018571
We see in the table above that there are about an equal amount of data for each class, from 9% to about 11%. We would like to preserve these proportions when we take subsamples of the data. To do that, we first take a random order of the data (in case the original file had some special structure), and then we assign a new proportional ordering based on the label value.
my.seed <- 1
set.seed(my.seed)
rand_ord <- MNIST_dt[, sample(.N)]
prop_ord <- data.table(y=MNIST_dt$y[rand_ord])[
, prop_y := seq(0,1,l=.N), by=y
][, order(prop_y)]
ord_list <- list(
random=rand_ord,
proportional=rand_ord[prop_ord])
ord_prop_dt_list <- list()
for(ord_name in names(ord_list)){
ord_vec <- ord_list[[ord_name]]
y_ord <- MNIST_dt$y[ord_vec]
for(prop_data in c(0.01, 0.1, 1)){
N <- nrow(MNIST_dt)*prop_data
N_props <- data.table(y=y_ord[1:N])[, .(
count=.N,
prop_y=.N/N
), by=y][order(-prop_y)]
ord_prop_dt_list[[paste(ord_name, prop_data)]] <- data.table(
ord_name, prop_data, N_props)
}
}
ord_prop_dt <- rbindlist(ord_prop_dt_list)
dcast(ord_prop_dt, ord_name + prop_data ~ y, value.var="prop_y")
## Key: <ord_name, prop_data>
## ord_name prop_data 0 1 2 3 4 5 6 7
## <char> <num> <num> <num> <num> <num> <num> <num> <num> <num>
## 1: proportional 0.01 0.09857143 0.1128571 0.10000000 0.10142857 0.09714286 0.09000000 0.09857143 0.1042857
## 2: proportional 0.10 0.09857143 0.1125714 0.09985714 0.10200000 0.09742857 0.09014286 0.09828571 0.1041429
## 3: proportional 1.00 0.09861429 0.1125286 0.09985714 0.10201429 0.09748571 0.09018571 0.09822857 0.1041857
## 4: random 0.01 0.10142857 0.1028571 0.11285714 0.08571429 0.09142857 0.12142857 0.08571429 0.1171429
## 5: random 0.10 0.10114286 0.1117143 0.10114286 0.10285714 0.08985714 0.09685714 0.10000000 0.1040000
## 6: random 1.00 0.09861429 0.1125286 0.09985714 0.10201429 0.09748571 0.09018571 0.09822857 0.1041857
## 8 9
## <num> <num>
## 1: 0.09714286 0.10000000
## 2: 0.09757143 0.09942857
## 3: 0.09750000 0.09940000
## 4: 0.11285714 0.06857143
## 5: 0.09914286 0.09328571
## 6: 0.09750000 0.09940000
The output above shows that the proportional ordering is much more stable than the random ordering, which has lots of variations in each column, between the different rows (values of prop_data).
Converting to binary
To convert the data to a binary problem, we can create a new column
that indicates if the y
label is odd=1 or even=0 (this will be used as
the label in supervised binary classification).
(binary_counts <- MNIST_dt[
, odd := y %% 2
][, .(
count=.N,
prop=.N/nrow(MNIST_dt)
), by=odd][order(-prop)])
## odd count prop
## <num> <int> <num>
## 1: 1 35582 0.5083143
## 2: 0 34418 0.4916857
Above we see that each of the two classes is about equally prevalent.
Creating class imbalance
We will create different versions of MNIST, each version having two subsets:
- one which is balanced: 50% positive labels (odd=1), 50% negative labels (odd=0).
- one which is unbalanced: 50% positive labels (odd=1), and a smaller proportion of negative labels (from 10% to 0.1%).
To do that, we use the following code to compute the number of samples we would need:
larger_N <- binary_counts$count[1]/2
target_prop <- c(0.5, 0.1, 0.05, 0.01, 0.005, 0.001)
(smaller_dt <- data.table(
target_prop,
count=as.integer(target_prop*larger_N/(1-target_prop))
)[
, prop := count/(count+larger_N)
][])
## target_prop count prop
## <num> <int> <num>
## 1: 0.500 17791 0.5000000000
## 2: 0.100 1976 0.0999645874
## 3: 0.050 936 0.0499813104
## 4: 0.010 179 0.0099610462
## 5: 0.005 89 0.0049776286
## 6: 0.001 17 0.0009546271
Above we see
target_prop
, the desired proportion of negative labels in the unbalanced subset,count
, the number of negative labels in the unbalanced subset which gives the best proportion closest to the target,prop
, the proportion of negative labels in the unbalanced subset, that corresponds to the actual count.
It is clear from the table above that the empirical proportions are consistent with the target proportions. To create the different unbalanced data sets, we first create a table with one row for each target proportion in the unbalanced subset:
(unb_small_dt <- data.table(
subset="unbalanced",
binary_counts[2,.(odd)],
smaller_dt[-1]))
## subset odd target_prop count prop
## <char> <num> <num> <int> <num>
## 1: unbalanced 0 0.100 1976 0.0999645874
## 2: unbalanced 0 0.050 936 0.0499813104
## 3: unbalanced 0 0.010 179 0.0099610462
## 4: unbalanced 0 0.005 89 0.0049776286
## 5: unbalanced 0 0.001 17 0.0009546271
The table above has one row for each unbalanced variant (from 10% to
0.1%) that we will create based on the original MNIST data. Below we
use a for loop over the rows of this table, to assign rows to each
subset of each unbalanced variant (each corresponding to a column of
subset_mat
).
subset_mat <- matrix(
NA, nrow(MNIST_dt), nrow(unb_small_dt),
dimnames=list(
NULL,
target_prop=paste0(
"seed",
my.seed,
"_prop",
unb_small_dt$target_prop)))
emp_y_list <- list()
emp_props_list <- list()
MNIST_ord <- MNIST_dt[, .(odd, .I)][ord_list$proportional]
for(unb_i in 1:nrow(unb_small_dt)){
unb_row <- unb_small_dt[unb_i]
unb_count_dt <- rbind(
data.table(subset="balanced", binary_counts[,.(odd)], smaller_dt[1]),
data.table(subset="unbalanced", binary_counts[1,.(odd)], smaller_dt[1]),
unb_row)
MNIST_ord[, subset := NA_character_]
for(o in c(1,0)){
o_dt <- unb_count_dt[odd==o]
sub_vals <- o_dt[, rep(subset, count)]
o_idx <- which(MNIST_ord$odd==o)
some_idx <- o_idx[1:length(sub_vals)]
MNIST_ord[some_idx, subset := sub_vals]
}
subset_mat[MNIST_ord$I, unb_i] <- MNIST_ord$subset
## Check to make unbalanced is a subset of the previous larger one.
if(unb_i>1)stopifnot(all(which(
subset_mat[,unb_i]=="unbalanced"
) %in% which(
subset_mat[,unb_i-1]=="unbalanced"
)))
## Check to make sure balanced is the same as previous.
if(unb_i>1)stopifnot(identical(
which(subset_mat[,unb_i]=="balanced"),
which(subset_mat[,unb_i-1]=="balanced")
))
(unb_MNIST <- data.table(
target_prop=unb_row$target_prop,
subset=subset_mat[,unb_i],
odd=MNIST_dt$odd,
y=MNIST_dt$y)[, idx := .I][!is.na(subset)])
emp_y_list[[unb_i]] <- unb_MNIST[, .(
count=.N
), by=.(target_prop,subset,y)]
emp_props_list[[unb_i]] <- unb_MNIST[, .(
count=.N,
first=idx[1],
last=idx[.N]
), by=.(target_prop,subset,odd)
][
, prop_in_subset := count/sum(count)
, by=subset
][]
}
emp_y <- rbindlist(emp_y_list)
(emp_props <- rbindlist(emp_props_list))
## target_prop subset odd count first last prop_in_subset
## <num> <char> <num> <int> <int> <int> <num>
## 1: 0.100 balanced 1 17791 1 69997 0.5000000000
## 2: 0.100 balanced 0 17791 2 70000 0.5000000000
## 3: 0.100 unbalanced 1 17791 4 69999 0.9000354126
## 4: 0.100 unbalanced 0 1976 19 69990 0.0999645874
## 5: 0.050 balanced 1 17791 1 69997 0.5000000000
## 6: 0.050 balanced 0 17791 2 70000 0.5000000000
## 7: 0.050 unbalanced 1 17791 4 69999 0.9500186896
## 8: 0.050 unbalanced 0 936 198 69973 0.0499813104
## 9: 0.010 balanced 1 17791 1 69997 0.5000000000
## 10: 0.010 balanced 0 17791 2 70000 0.5000000000
## 11: 0.010 unbalanced 1 17791 4 69999 0.9900389538
## 12: 0.010 unbalanced 0 179 198 69888 0.0099610462
## 13: 0.005 balanced 1 17791 1 69997 0.5000000000
## 14: 0.005 balanced 0 17791 2 70000 0.5000000000
## 15: 0.005 unbalanced 1 17791 4 69999 0.9950223714
## 16: 0.005 unbalanced 0 89 770 69828 0.0049776286
## 17: 0.001 balanced 1 17791 1 69997 0.5000000000
## 18: 0.001 balanced 0 17791 2 70000 0.5000000000
## 19: 0.001 unbalanced 1 17791 4 69999 0.9990453729
## 20: 0.001 unbalanced 0 17 2715 66154 0.0009546271
## target_prop subset odd count first last prop_in_subset
The table above can be used to verify that the subset assignments are consistent with the target label proportions. It has one row for each unique combination of target proportion, subset, and label (odd). For each value of target proportion,
- each balanced subset has the same data, as can be seen by examining
columns count, first, and last. The
prop_in_subset
column shows that the class labels are balanced (half of each). - each unbalanced subset has the same first index values, but different last index values. The smaller unbalanced subsets are strict subsets of the larger unbalanced subsets (for example the 5% unbalanced subset is also a part of the 10% unbalanced subset).
To verify that the underlying multi-class proportion is consistent in the down-sampled subsets, we use the code below:
dcast(emp_y, subset + target_prop ~ y, value.var="count")
## Key: <subset, target_prop>
## subset target_prop 0 1 2 3 4 5 6 7 8 9
## <char> <num> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1: balanced 0.001 3568 3939 3613 3571 3528 3156 3554 3646 3528 3479
## 2: balanced 0.005 3568 3939 3613 3571 3528 3156 3554 3646 3528 3479
## 3: balanced 0.010 3568 3939 3613 3571 3528 3156 3554 3646 3528 3479
## 4: balanced 0.050 3568 3939 3613 3571 3528 3156 3554 3646 3528 3479
## 5: balanced 0.100 3568 3939 3613 3571 3528 3156 3554 3646 3528 3479
## 6: unbalanced 0.001 3 3938 4 3570 3 3157 4 3647 3 3479
## 7: unbalanced 0.005 18 3938 18 3570 17 3157 18 3647 18 3479
## 8: unbalanced 0.010 36 3938 37 3570 35 3157 36 3647 35 3479
## 9: unbalanced 0.050 188 3938 190 3570 185 3157 187 3647 186 3479
## 10: unbalanced 0.100 397 3938 401 3570 391 3157 395 3647 392 3479
The table above has one row per subset, and one column per class (there are ten original classes in MNIST, 0-9). It is clear that
- the balanced subset always has the same number of data, for each value of target proportion.
- the unbalanced subset number of data depends on the class label:
- for odd labels, there are always the same number of samples.
- for even labels, the number of samples depends on the target proportion, and is uniform across classes (0 about the same number as 2 for example).
Write the subset columns to a new CSV file
Finally, we create new columns for each subset.
fwrite(subset_mat, "2025-03-21-unbalanced.csv")
## conversion automatique de classe pour x : matrix vers data.table
system("head 2025-03-21-unbalanced.csv")
Putting it all together
What if we wanted to do the same thing on several data sets? Or for several different random seeds? See code below.
library(data.table)
data_Classif <- "~/projects/cv-same-other-paper/data_Classif/"
for(data.name in c("EMNIST", "FashionMNIST", "MNIST")){
data.csv <- paste0(
data_Classif,
data.name,
".csv")
MNIST_dt <- fread(data.csv)
seed_mat_list <- list()
for(seed in 1:2){
set.seed(seed)
rand_ord <- MNIST_dt[, sample(.N)]
prop_ord <- data.table(y=MNIST_dt$y[rand_ord])[
, prop_y := seq(0,1,l=.N), by=y
][, order(prop_y)]
ord_list <- list(
random=rand_ord,
proportional=rand_ord[prop_ord])
(binary_counts <- MNIST_dt[
, odd := y %% 2
][, .(
count=.N,
prop=.N/nrow(MNIST_dt)
), by=odd][order(-prop, -odd)])
larger_N <- binary_counts$count[1]/2
target_prop <- c(0.5, 0.1, 0.05, 0.01, 0.005, 0.001)
(smaller_dt <- data.table(
target_prop,
count=as.integer(target_prop*larger_N/(1-target_prop))
)[
, prop := count/(count+larger_N)
][])
(unb_small_dt <- data.table(
subset="unbalanced",
binary_counts[2,.(odd)],
smaller_dt[-1]))
subset_mat <- matrix(
NA, nrow(MNIST_dt), nrow(unb_small_dt),
dimnames=list(
NULL,
target_prop=paste0(
"seed",
seed,
"_prop",
unb_small_dt$target_prop)))
emp_y_list <- list()
emp_props_list <- list()
MNIST_ord <- MNIST_dt[, .(odd, .I)][ord_list$proportional]
for(unb_i in 1:nrow(unb_small_dt)){
unb_row <- unb_small_dt[unb_i]
unb_count_dt <- rbind(
data.table(subset="balanced", binary_counts[,.(odd)], smaller_dt[1]),
data.table(subset="unbalanced", binary_counts[1,.(odd)], smaller_dt[1]),
unb_row)
MNIST_ord[, subset := NA_character_]
for(o in c(1,0)){
o_dt <- unb_count_dt[odd==o]
sub_vals <- o_dt[, rep(subset, count)]
o_idx <- which(MNIST_ord$odd==o)
some_idx <- o_idx[1:length(sub_vals)]
MNIST_ord[some_idx, subset := sub_vals]
}
subset_mat[MNIST_ord$I, unb_i] <- MNIST_ord$subset
## Check to make unbalanced is a subset of the previous larger one.
if(unb_i>1)stopifnot(all(which(
subset_mat[,unb_i]=="unbalanced"
) %in% which(
subset_mat[,unb_i-1]=="unbalanced"
)))
## Check to make sure balanced is the same as previous.
if(unb_i>1)stopifnot(identical(
which(subset_mat[,unb_i]=="balanced"),
which(subset_mat[,unb_i-1]=="balanced")
))
(unb_MNIST <- data.table(
target_prop=unb_row$target_prop,
subset=subset_mat[,unb_i],
odd=MNIST_dt$odd,
y=MNIST_dt$y)[, idx := .I][!is.na(subset)])
emp_y_list[[unb_i]] <- unb_MNIST[, .(
count=.N
), by=.(target_prop,subset,y)]
emp_props_list[[unb_i]] <- unb_MNIST[, .(
count=.N,
first=idx[1],
last=idx[.N]
), keyby=.(target_prop,subset,odd)
][
, prop_in_subset := count/sum(count)
, by=subset
][]
}
emp_y <- rbindlist(emp_y_list)
(emp_props <- rbindlist(emp_props_list))
seed_mat_list[[seed]] <- subset_mat
}
print(data.name)
print(dcast(emp_y, subset + target_prop ~ y, value.var="count"))
(seed_dt <- do.call(data.table, seed_mat_list))
(out.csv <- sub("data_Classif", "data_Classif_unbalanced", data.csv))
dir.create(dirname(out.csv), showWarnings = FALSE, recursive = FALSE)
fwrite(seed_dt, out.csv)
}
## [1] "EMNIST"
## Key: <subset, target_prop>
## subset target_prop 0 1 2 3 4 5 6 7 8 9
## <char> <num> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1: balanced 0.001 3500 3500 3500 3500 3500 3500 3500 3500 3500 3500
## 2: balanced 0.005 3500 3500 3500 3500 3500 3500 3500 3500 3500 3500
## 3: balanced 0.010 3500 3500 3500 3500 3500 3500 3500 3500 3500 3500
## 4: balanced 0.050 3500 3500 3500 3500 3500 3500 3500 3500 3500 3500
## 5: balanced 0.100 3500 3500 3500 3500 3500 3500 3500 3500 3500 3500
## 6: unbalanced 0.001 4 3500 3 3500 3 3500 3 3500 4 3500
## 7: unbalanced 0.005 18 3500 17 3500 17 3500 17 3500 18 3500
## 8: unbalanced 0.010 36 3500 35 3500 35 3500 35 3500 35 3500
## 9: unbalanced 0.050 185 3500 184 3500 184 3500 184 3500 184 3500
## 10: unbalanced 0.100 389 3500 388 3500 389 3500 389 3500 389 3500
## [1] "FashionMNIST"
## Key: <subset, target_prop>
## subset target_prop 0 1 2 3 4 5 6 7 8 9
## <char> <num> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1: balanced 0.001 3500 3500 3500 3500 3500 3500 3500 3500 3500 3500
## 2: balanced 0.005 3500 3500 3500 3500 3500 3500 3500 3500 3500 3500
## 3: balanced 0.010 3500 3500 3500 3500 3500 3500 3500 3500 3500 3500
## 4: balanced 0.050 3500 3500 3500 3500 3500 3500 3500 3500 3500 3500
## 5: balanced 0.100 3500 3500 3500 3500 3500 3500 3500 3500 3500 3500
## 6: unbalanced 0.001 3 3500 3 3500 4 3500 4 3500 3 3500
## 7: unbalanced 0.005 17 3500 17 3500 18 3500 18 3500 17 3500
## 8: unbalanced 0.010 35 3500 35 3500 36 3500 35 3500 35 3500
## 9: unbalanced 0.050 184 3500 184 3500 185 3500 184 3500 184 3500
## 10: unbalanced 0.100 389 3500 389 3500 389 3500 389 3500 388 3500
## [1] "MNIST"
## Key: <subset, target_prop>
## subset target_prop 0 1 2 3 4 5 6 7 8 9
## <char> <num> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1: balanced 0.001 3568 3939 3613 3570 3528 3157 3554 3646 3528 3479
## 2: balanced 0.005 3568 3939 3613 3570 3528 3157 3554 3646 3528 3479
## 3: balanced 0.010 3568 3939 3613 3570 3528 3157 3554 3646 3528 3479
## 4: balanced 0.050 3568 3939 3613 3570 3528 3157 3554 3646 3528 3479
## 5: balanced 0.100 3568 3939 3613 3570 3528 3157 3554 3646 3528 3479
## 6: unbalanced 0.001 3 3938 4 3571 3 3156 4 3647 3 3479
## 7: unbalanced 0.005 18 3938 18 3571 17 3156 18 3647 18 3479
## 8: unbalanced 0.010 36 3938 37 3571 35 3156 36 3647 35 3479
## 9: unbalanced 0.050 188 3938 190 3571 185 3156 187 3647 186 3479
## 10: unbalanced 0.100 397 3938 401 3571 391 3156 395 3647 392 3479
system(paste("head", file.path(dirname(out.csv), "*")))
Note in the output above that the minority class is the same in each data set (even=0 minority, odd=1 majority).
Conclusions
This tutorial showed how to create unbalanced data sets. Each unbalanced data set is represented as a new column (with the same number of rows as the original data file), with two values: balanced and imbalanced, that can be efficiently saved to a new CSV file (without having to copy or modify the original data CSV). We checked that the new data obey certain constraints:
- balanced subsets are the same for different imbalance ratios,
- imbalanced subsets are nested (smaller one is strict subset of larger one).
Each column in the resulting CSV files can be used to create a different mlr3 Task (each with a different definition of subset), so our recently proposed SOAK algorithm can be used to determine if a learning algorithm is able to generalize between data subsets with different proportions of labels (50% minority class versus 1%, etc).
Session info
sessionInfo()
## R version 4.4.3 (2025-02-28)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8
## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8 LC_PAPER=fr_FR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Paris
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] data.table_1.17.0
##
## loaded via a namespace (and not attached):
## [1] compiler_4.4.3 tools_4.4.3 knitr_1.50 xfun_0.51 evaluate_1.0.3