In the programming projects that I wrote for my CS499 Deep Learning class, I require the students to write code that splits a data set into a certain percent train/validation/test. For example in project 4 there is a 80% train / 20% test split, followed by a split of the train set into 60% subtrain / 40% validation. Sometimes we also want to split the entire data set into 60% train / 20% validation / 20% test.

One of my students implemented this in R using the sample function as below

train.test.N <- 4601
train.test.props <- c(train=0.8, test=0.2)
for(seed in 1:3){
  set.seed(seed)
  train.test.vec <- sample(
    names(train.test.props),
    train.test.N,
    replace = TRUE,
    prob = train.test.props)
  print(table(train.test.vec)/train.test.N)
}

The code above performs sampling with replacement, with identical probabilities for each draw, which gives different set sizes for each seed:

train.test.vec
     test     train 
0.2077809 0.7922191 
train.test.vec
     test     train 
0.2082156 0.7917844 
train.test.vec
     test     train 
0.1990872 0.8009128 

That code does pretty much what we want: random assignment of sets with approximately the given proportions. One minor issue was already mentioned above: the set sizes depend on the random seed.

What code should we write if we wanted to get the same set sizes for each random seed, and have those sets be the closest possible to our target proportions?

Using the data.table package I wrote the following function which takes in a data set size N and a named numeric vector of proportions, and then returns a character vector of size N:


library(data.table)
random_set_vec <- function(N, prob.vec){
  obs.dt <- data.table(obs=1:N)
  cum.dt <- data.table(cum.obs=cumsum(prob.vec)*N, set=names(prob.vec))
  join.dt <- cum.dt[i=obs.dt, j=.(set=set[1]), on=.(cum.obs >= obs), by=.EACHI]
  sample(join.dt$set)
}

There are several issues with doing it as above, but first let’s discuss the code. The function above makes use of the cumsum function on the second line in order to convert the proportions to cumulative probabilities. Then the third line uses several advanced features of the data.table package. First, on=.(cum.obs >= obs) indicates a non-equi join, which means to join the two data tables obs.dt and cum.dt for all rows such that cum.i is greater than or equal to i. Second, by=.EACHI means to group by each row of the given i data table obs.dt. The j=.(set=set[1]) argument indicates to summarize by creating a set column that is equal to the first element of each group. This achieves the desired result:


train.test.N <- 4601
train.test.props <- c(train=0.8, test=0.2)
for(seed in 1:3){
  set.seed(seed)
  train.test.vec <- random_set_vec(train.test.N, train.test.props)
  print(head(train.test.vec))
  print(table(train.test.vec, useNA="ifany")/train.test.N)
}

That gives me the following output:

[1] "train" "train" "train" "test"  "train" "train"
train.test.vec
     test     train 
0.2001739 0.7998261 
[1] "test"  "train" "test"  "test"  "train" "train"
train.test.vec
     test     train 
0.2001739 0.7998261 
[1] "train" "test"  "train" "train" "train" "train"
train.test.vec
     test     train 
0.2001739 0.7998261 

The tables indicate that the set sizes are the same for each seed, and the head/printed character vector indicates that the placement of the train/test values is indeed random/different.

Note that this works for any number of sets, e.g.


tvt.props <- c(test=0.19, train=0.67, validation=0.14)
tvt.N <- 1234567
system.time({
  tvt.vec <- random_set_vec(tvt.N, tvt.props)
})
table(tvt.vec, useNA="ifany")/tvt.N

However the group by operation starts to get slow for big N like the above, e.g.

> system.time({
+   tvt.vec <- random_set_vec(tvt.N, tvt.props)
+ })
   user  system elapsed 
  2.611   0.001   2.629 
> table(tvt.vec, useNA="ifany")/tvt.N
tvt.vec
      test      train validation 
 0.1899994  0.6700001  0.1400005 

Is there an alternative that uses pure vector operations? The code below uses the same logic (cumsum etc) but with the vectorized data.table::fcase function:


tvt.seq <- seq(1, tvt.N, l=tvt.N)/tvt.N
system.time({
  tvt.vec <- fcase(
    tvt.seq <= 0.19, "test",
    tvt.seq <= 0.86, "train",
    tvt.seq <= 1, "validation")
})
table(tvt.vec, useNA="ifany")/tvt.N

The output is:

   user  system elapsed 
   0.04    0.00    0.04 

tvt.vec
      test      train validation 
 0.1900002  0.6699993  0.1400005 

That is a lot faster! (1000x) But if we want to use that approach when the proportions are supplied as data (rather than in code as above) we need to do some meta-programming (writing code that writes code):


tvt.props <- c(test=0.19, train=0.67, validation=0.14)
tvt.N <- 1234567
tvt.seq <- seq(1, tvt.N, l=tvt.N)/tvt.N
system.time({
  tvt.cum <- cumsum(tvt.props)
  fcase.args <- list(list(quote(data.table::fcase)))
  for(set in names(tvt.cum)){
    set.cum <- tvt.cum[[set]]
    fcase.args[[set]] <- list(
      substitute(tvt.seq <= X, list(X=set.cum)),
      set)
  }
  call.args <- do.call(c, fcase.args)
  fcase.call <- as.call(unname(call.args))
  tvt.vec <- eval(fcase.call)
})
table(tvt.vec, useNA="ifany")/tvt.N

What is the code above doing? The big picture is that the code above creates fcase.call an R language object which represents a line of un-evaluated R code that calls fcase just as in the previous, hard-coded example:

> fcase.call
data.table::fcase(tvt.seq <= 0.19, "test", tvt.seq <= 0.86, "train", 
    tvt.seq <= 1, "validation")

It is constructed by first initializing fcase.args as a list with one element: a list containing quote(data.table::fcase). Then each iteration of the for loop adds a new element to the fcase.args list. Each element is a list with two elements:

> str(fcase.args[[2]])
List of 2
 $ : language tvt.seq <= 0.19
 $ : chr "test"

The first is a language object created by substitute that represents an un-evaluated piece of R code, a test for tvt.seq less than or equal to the given constant (which is substituted for the X variable provided in the R code above). The second is a character scalar constant which will be the value used if the inequality is true. Both list elements will be used as arguments to fcase as in the hard-coded version above. After the for loop we use call.args <- do.call(c, fcase.args) to get the list below:

> str(call.args)
List of 7
 $            : language data.table::fcase
 $ test1      : language tvt.seq <= 0.19
 $ test2      : chr "test"
 $ train1     : language tvt.seq <= 0.86
 $ train2     : chr "train"
 $ validation1: language tvt.seq <= 1
 $ validation2: chr "validation"

When we pass the list above to as.call it interprets the first element as the function to call, and the other elements as the arguments to that function. Finally doing tvt.vec <- eval(fcase.call) converts the R language object (un-evaluated R code) to the evaluated result, giving us the desired output (character vector of set names). The output I got from the code above is:

   user  system elapsed 
  0.045   0.000   0.046 

tvt.vec
      test      train validation 
 0.1899994  0.6700001  0.1400005 

The timing indicates that it is almost as fast as the hard-coded version above! (there is some overhead in construction of the list/language objects)

Exercise 1 for the reader: how to further modify the code so that the set sizes do not depend on the order in which the test/train/validation sets are specified? More precisely, in the code above we have the result below, which depends on the order in which the sets are specified.

> table(random_set_vec(4601, c(train=0.8, test=0.2)))

 test train 
  921  3680 
> table(random_set_vec(4601, c(test=0.2, train=0.8)))

 test train 
  920  3681 

How to modify the code above so that the two versions give the same result?

Exercise 2 for the reader: how to modify the function so that it checks that the given proportions are valid? (each between 0 and 1 and sum to 1)

A final question / exercise 3 for the reader: are the set sizes optimal? Can you re-write the random_set_vec function so that it is guaranteed to return set sizes that are optimal according to the multinomial likelihood? (the answer is yes, and it significantly simplifies the code, and it still is fast) Consider tvt.N=4 with c(train=2/3, test=1/3) and tvt.N=3 with c(train=0.5, test=0.5). Hint 1: you can use dbinom or dmultinom to compute the likelihood of a set size, given the desired proportions. Hint 2: the set sizes that maximize the likelihood given by the mode of the binomial/multinomial distribution. Here is a one-liner “solution” that uses these ideas, but needs some modification in order to work correctly for the given example:

P <- c(test=0.5, train=0.5)
N <- 3
sample(rep(names(P), floor(P*(N+1)))) #two train, two test.

Implementations in machine learning libraries

In this section I present how train/test splits are implemented in some machine learning libraries.

First there is a sub-optimal implementation in sklearn.model_selection.train_test_split. Of course it does not make a significant difference in big data sets, but it is clear that the counts per set are sub-optimal for some small data, e.g.

from sklearn.model_selection import train_test_split
set_list = train_test_split(range(4), test_size=1.0/3, shuffle=False)
dict(zip(["train", "test"], set_list))

On my system the output of the code above is:

>>> dict(zip(["train", "test"], set_list))
{'test': [2, 3], 'train': [0, 1]}
>>> sklearn.__version__
'0.19.1'

Again that is not a huge problem, but returning a test set size of two when the test proportion is 1/3 has binomial probability of 0.296, whereas a test set size of one has binomial probability of 0.395, so it would be more appropriate/representative of the given proportion to return a test set size of one in this case.

Second I checked the implementation in mlr3::rsmp(“holdout”) which also appears to be sub-optimal, but in a different way:


N <- 2
prop.train <- 0.7
task <- tsk("iris")$filter(1:N)
rho <- rsmp("holdout", ratio=prop.train)
set.seed(1)
rho$instantiate(task)
rho$train_set(1)
rho$test_set(1)
dbinom(0:N, N, prop.train)

The code above creates a data set of size N <- 2 and then requests a train/test split with a proportion of prop.train <- 0.7 in the train set. In this case a train set size of 2 is optimal with respect to the binomial likelihood, but mlr3 returns a split with one observation in the train set and one observation in the test set:

> rho$train_set(1)
[1] 1
> rho$test_set(1)
[1] 2
> dbinom(0:N, N, prop.train)
[1] 0.09 0.42 0.49
> packageVersion("mlr3")
[1] ‘0.2.0’

Again, no big deal for big/real data situations, but the code could be easily modified so that it is optimal in the small data case as well.

Conclusions

In conclusion I have presented a variety of different methods for dividing a data set into train/validation/test sets, based on given proportions. They all give quite similar results, but the ideas I presented may be desirable to obtain code that is (1) fast (2) invariant to the order in which the sets are specified, and (3) resulting in sets of the same/optimal size for any random seed.