# Random train/validation/test assignment

In the programming projects that I wrote for my CS499 Deep Learning class, I require the students to write code that splits a data set into a certain percent train/validation/test. For example in project 4 there is a 80% train / 20% test split, followed by a split of the train set into 60% subtrain / 40% validation. Sometimes we also want to split the entire data set into 60% train / 20% validation / 20% test.

One of my students implemented this in R using the `sample`

function
as below

```
train.test.N <- 4601
train.test.props <- c(train=0.8, test=0.2)
for(seed in 1:3){
set.seed(seed)
train.test.vec <- sample(
names(train.test.props),
train.test.N,
replace = TRUE,
prob = train.test.props)
print(table(train.test.vec)/train.test.N)
}
```

The code above performs sampling with replacement, with identical probabilities for each draw, which gives different set sizes for each seed:

```
train.test.vec
test train
0.2077809 0.7922191
train.test.vec
test train
0.2082156 0.7917844
train.test.vec
test train
0.1990872 0.8009128
```

That code does pretty much what we want: random assignment of sets with approximately the given proportions. One minor issue was already mentioned above: the set sizes depend on the random seed.

What code should we write if we wanted to get the same set sizes for each random seed, and have those sets be the closest possible to our target proportions?

Using the `data.table`

package I wrote the following function which
takes in a data set size `N`

and a named numeric vector of
proportions, and then returns a character vector of size `N`

:

```
library(data.table)
random_set_vec <- function(N, prob.vec){
obs.dt <- data.table(obs=1:N)
cum.dt <- data.table(cum.obs=cumsum(prob.vec)*N, set=names(prob.vec))
join.dt <- cum.dt[i=obs.dt, j=.(set=set[1]), on=.(cum.obs >= obs), by=.EACHI]
sample(join.dt$set)
}
```

There are several issues with doing it as above, but first let’s
discuss the code. The function above makes use of the `cumsum`

function on the second line in order to convert the proportions to
cumulative probabilities. Then the third line uses several advanced
features of the `data.table`

package. First, `on=.(cum.obs >= obs)`

indicates a non-equi join, which means to join the two data tables
`obs.dt`

and `cum.dt`

for all rows such that `cum.i`

is greater than
or equal to `i`

. Second, `by=.EACHI`

means to group by each row of the
given `i`

data table `obs.dt`

. The `j=.(set=set[1])`

argument
indicates to summarize by creating a `set`

column that is equal to the
first element of each group. This achieves the desired result:

```
train.test.N <- 4601
train.test.props <- c(train=0.8, test=0.2)
for(seed in 1:3){
set.seed(seed)
train.test.vec <- random_set_vec(train.test.N, train.test.props)
print(head(train.test.vec))
print(table(train.test.vec, useNA="ifany")/train.test.N)
}
```

That gives me the following output:

```
[1] "train" "train" "train" "test" "train" "train"
train.test.vec
test train
0.2001739 0.7998261
[1] "test" "train" "test" "test" "train" "train"
train.test.vec
test train
0.2001739 0.7998261
[1] "train" "test" "train" "train" "train" "train"
train.test.vec
test train
0.2001739 0.7998261
```

The tables indicate that the set sizes are the same for each seed, and the head/printed character vector indicates that the placement of the train/test values is indeed random/different.

Note that this works for any number of sets, e.g.

```
tvt.props <- c(test=0.19, train=0.67, validation=0.14)
tvt.N <- 1234567
system.time({
tvt.vec <- random_set_vec(tvt.N, tvt.props)
})
table(tvt.vec, useNA="ifany")/tvt.N
```

However the group by operation starts to get slow for big N like the above, e.g.

```
> system.time({
+ tvt.vec <- random_set_vec(tvt.N, tvt.props)
+ })
user system elapsed
2.611 0.001 2.629
> table(tvt.vec, useNA="ifany")/tvt.N
tvt.vec
test train validation
0.1899994 0.6700001 0.1400005
```

Is there an alternative that uses pure vector operations? The code
below uses the same logic (cumsum etc) but with the vectorized
`data.table::fcase`

function:

```
tvt.seq <- seq(1, tvt.N, l=tvt.N)/tvt.N
system.time({
tvt.vec <- fcase(
tvt.seq <= 0.19, "test",
tvt.seq <= 0.86, "train",
tvt.seq <= 1, "validation")
})
table(tvt.vec, useNA="ifany")/tvt.N
```

The output is:

```
user system elapsed
0.04 0.00 0.04
tvt.vec
test train validation
0.1900002 0.6699993 0.1400005
```

That is a lot faster! (1000x) But if we want to use that approach when the proportions are supplied as data (rather than in code as above) we need to do some meta-programming (writing code that writes code):

```
tvt.props <- c(test=0.19, train=0.67, validation=0.14)
tvt.N <- 1234567
tvt.seq <- seq(1, tvt.N, l=tvt.N)/tvt.N
system.time({
tvt.cum <- cumsum(tvt.props)
fcase.args <- list(list(quote(data.table::fcase)))
for(set in names(tvt.cum)){
set.cum <- tvt.cum[[set]]
fcase.args[[set]] <- list(
substitute(tvt.seq <= X, list(X=set.cum)),
set)
}
call.args <- do.call(c, fcase.args)
fcase.call <- as.call(unname(call.args))
tvt.vec <- eval(fcase.call)
})
table(tvt.vec, useNA="ifany")/tvt.N
```

What is the code above doing? The big picture is that the code above
creates `fcase.call`

an R language object which represents a line of
un-evaluated R code that calls `fcase`

just as in the previous,
hard-coded example:

```
> fcase.call
data.table::fcase(tvt.seq <= 0.19, "test", tvt.seq <= 0.86, "train",
tvt.seq <= 1, "validation")
```

It is constructed by first initializing `fcase.args`

as a list with
one element: a list containing `quote(data.table::fcase)`

. Then each
iteration of the for loop adds a new element to the `fcase.args`

list. Each element is a list with two elements:

```
> str(fcase.args[[2]])
List of 2
$ : language tvt.seq <= 0.19
$ : chr "test"
```

The first is a language object created by `substitute`

that represents
an un-evaluated piece of R code, a test for `tvt.seq`

less than or
equal to the given constant (which is substituted for the `X`

variable
provided in the R code above). The second is a character scalar
constant which will be the value used if the inequality is true. Both
list elements will be used as arguments to `fcase`

as in the
hard-coded version above. After the for loop we use ```
call.args <-
do.call(c, fcase.args)
```

to get the list below:

```
> str(call.args)
List of 7
$ : language data.table::fcase
$ test1 : language tvt.seq <= 0.19
$ test2 : chr "test"
$ train1 : language tvt.seq <= 0.86
$ train2 : chr "train"
$ validation1: language tvt.seq <= 1
$ validation2: chr "validation"
```

When we pass the list above to `as.call`

it interprets the first
element as the function to call, and the other elements as the
arguments to that function. Finally doing ```
tvt.vec <-
eval(fcase.call)
```

converts the R language object (un-evaluated R code)
to the evaluated result, giving us the desired output (character
vector of set names). The output I got from the code above is:

```
user system elapsed
0.045 0.000 0.046
tvt.vec
test train validation
0.1899994 0.6700001 0.1400005
```

The timing indicates that it is almost as fast as the hard-coded version above! (there is some overhead in construction of the list/language objects)

Exercise 1 for the reader: how to further modify the code so that the set sizes do not depend on the order in which the test/train/validation sets are specified? More precisely, in the code above we have the result below, which depends on the order in which the sets are specified.

```
> table(random_set_vec(4601, c(train=0.8, test=0.2)))
test train
921 3680
> table(random_set_vec(4601, c(test=0.2, train=0.8)))
test train
920 3681
```

How to modify the code above so that the two versions give the same result?

Exercise 2 for the reader: how to modify the function so that it checks that the given proportions are valid? (each between 0 and 1 and sum to 1)

A final question / exercise 3 for the reader: are the set sizes
optimal? Can you re-write the `random_set_vec`

function so that it is
guaranteed to return set sizes that are optimal according to the
multinomial likelihood? (the answer is yes, and it significantly
simplifies the code, and it still is fast) Consider `tvt.N=4`

with
`c(train=2/3, test=1/3)`

and `tvt.N=3`

with ```
c(train=0.5,
test=0.5)
```

. Hint 1: you can use `dbinom`

or `dmultinom`

to compute the
likelihood of a set size, given the desired proportions. Hint 2: the
set sizes that maximize the likelihood given by the mode of the
binomial/multinomial distribution. Here is a one-liner “solution” that
uses these ideas, but needs some modification in order to work
correctly for the given example:

```
P <- c(test=0.5, train=0.5)
N <- 3
sample(rep(names(P), floor(P*(N+1)))) #two train, two test.
```

## Implementations in machine learning libraries

In this section I present how train/test splits are implemented in some machine learning libraries.

First there is a sub-optimal implementation in sklearn.model_selection.train_test_split. Of course it does not make a significant difference in big data sets, but it is clear that the counts per set are sub-optimal for some small data, e.g.

```
from sklearn.model_selection import train_test_split
set_list = train_test_split(range(4), test_size=1.0/3, shuffle=False)
dict(zip(["train", "test"], set_list))
```

On my system the output of the code above is:

```
>>> dict(zip(["train", "test"], set_list))
{'test': [2, 3], 'train': [0, 1]}
>>> sklearn.__version__
'0.19.1'
```

Again that is not a huge problem, but returning a test set size of two when the test proportion is 1/3 has binomial probability of 0.296, whereas a test set size of one has binomial probability of 0.395, so it would be more appropriate/representative of the given proportion to return a test set size of one in this case.

Second I checked the implementation in mlr3::rsmp(“holdout”) which also appears to be sub-optimal, but in a different way:

```
N <- 2
prop.train <- 0.7
task <- tsk("iris")$filter(1:N)
rho <- rsmp("holdout", ratio=prop.train)
set.seed(1)
rho$instantiate(task)
rho$train_set(1)
rho$test_set(1)
dbinom(0:N, N, prop.train)
```

The code above creates a data set of size `N <- 2`

and then requests a
train/test split with a proportion of `prop.train <- 0.7`

in the train
set. In this case a train set size of 2 is optimal with respect to the
binomial likelihood, but `mlr3`

returns a split with one observation
in the train set and one observation in the test set:

```
> rho$train_set(1)
[1] 1
> rho$test_set(1)
[1] 2
> dbinom(0:N, N, prop.train)
[1] 0.09 0.42 0.49
> packageVersion("mlr3")
[1] ‘0.2.0’
```

Again, no big deal for big/real data situations, but the code could be easily modified so that it is optimal in the small data case as well.

### Conclusions

In conclusion I have presented a variety of different methods for dividing a data set into train/validation/test sets, based on given proportions. They all give quite similar results, but the ideas I presented may be desirable to obtain code that is (1) fast (2) invariant to the order in which the sets are specified, and (3) resulting in sets of the same/optimal size for any random seed.