Toby Dylan Hocking | Computing K-means train/validation error

Today in my CS499/599 class on Unsupervised Learning I explained how to efficiently compute the train/validation error of the K-means clustering algorithm in R. I only had time in class to explain a method using array, but here I explain a few different methods.

Let’s use the classic iris data as a simple example. We begin by splitting the data into half train, half validation:

> X.mat <- as.matrix(iris[, 1:4])
> set.seed(1)
> shuffled.sets <- sample(rep(c("train", "validation"), l=nrow(X.mat)))
> table(shuffled.sets, iris$Species)
             
shuffled.sets setosa versicolor virginica
   train          22         31        22
   validation     28         19        28

Then we use the K-means algorithm to compute two cluster centers using only the train set:

> n.clusters <- 2
> kmeans.result <- stats::kmeans(X.mat[shuffled.sets=="train", ], n.clusters)
> kmeans.result$centers
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1     6.280769    2.909615     4.892308   1.6576923
2     5.034783    3.404348     1.500000   0.2608696

The goal is to compute the total sum of squares for each set (train/validation). To do that we need to compute the sum of squares between each observation and each cluster center, then for each observation take the minimum over clusters. How to do that efficiently?

Multi-dimensional arrays

Multi-dimensional arrays can be used for fast vector operations in R. In this case we want an array with one element for each observation/feature/cluster combination, so we will use those names for the dimnames of our array:

> array.dim <- c(nrow(X.mat), ncol(X.mat), nrow(kmeans.result$centers))
> array.names <- list(obs=NULL, feature=NULL, cluster=NULL)
> X.array <- array(
+   X.mat, array.dim, array.names)
> head(X.array)
, , 1

      feature
obs    [,1] [,2] [,3] [,4]
  [1,]  5.1  3.5  1.4  0.2
  [2,]  4.9  3.0  1.4  0.2
  [3,]  4.7  3.2  1.3  0.2
  [4,]  4.6  3.1  1.5  0.2
  [5,]  5.0  3.6  1.4  0.2
  [6,]  5.4  3.9  1.7  0.4

, , 2

      feature
obs    [,1] [,2] [,3] [,4]
  [1,]  5.1  3.5  1.4  0.2
  [2,]  4.9  3.0  1.4  0.2
  [3,]  4.7  3.2  1.3  0.2
  [4,]  4.6  3.1  1.5  0.2
  [5,]  5.0  3.6  1.4  0.2
  [6,]  5.4  3.9  1.7  0.4

> head(X.mat)
     Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,]          5.1         3.5          1.4         0.2
[2,]          4.9         3.0          1.4         0.2
[3,]          4.7         3.2          1.3         0.2
[4,]          4.6         3.1          1.5         0.2
[5,]          5.0         3.6          1.4         0.2
[6,]          5.4         3.9          1.7         0.4

Note how the third dimension has two elements, each a copy of the iris data matrix (one copy for each cluster center). We also need an array with copies of the cluster centers:

> m.array <- array(
+   rep(t(kmeans.result$centers), each=nrow(X.mat)), array.dim, array.names)
> head(m.array)
, , 1

      feature
obs        [,1]     [,2]     [,3]     [,4]
  [1,] 6.280769 2.909615 4.892308 1.657692
  [2,] 6.280769 2.909615 4.892308 1.657692
  [3,] 6.280769 2.909615 4.892308 1.657692
  [4,] 6.280769 2.909615 4.892308 1.657692
  [5,] 6.280769 2.909615 4.892308 1.657692
  [6,] 6.280769 2.909615 4.892308 1.657692

, , 2

      feature
obs        [,1]     [,2] [,3]      [,4]
  [1,] 5.034783 3.404348  1.5 0.2608696
  [2,] 5.034783 3.404348  1.5 0.2608696
  [3,] 5.034783 3.404348  1.5 0.2608696
  [4,] 5.034783 3.404348  1.5 0.2608696
  [5,] 5.034783 3.404348  1.5 0.2608696
  [6,] 5.034783 3.404348  1.5 0.2608696

> kmeans.result$centers
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1     6.280769    2.909615     4.892308   1.6576923
2     5.034783    3.404348     1.500000   0.2608696

Note how we used rep and t on the cluster center matrix to broadcast the values in the correct order. The computation then uses array-array arithmetic and apply functions:

> squares.array <- (X.array-m.array)^2
> sum.squares.mat <- apply(squares.array, c("obs", "cluster"), sum)
> min.vec <- apply(sum.squares.mat, "obs", min)
> tapply(min.vec, shuffled.sets, sum)
     train validation 
  78.48633   74.84365 
> kmeans.result$tot.withinss
[1] 78.48633

Note that our computation agrees with the result from kmeans.

Analogous method using data.table

The previous method involves computing/storing an array with one element for every observation/feature/cluster combination. How would we do an analogous computation using data.table? First we convert the matrices of data and cluster centers into data tables:

> (X.dt <- data.table::data.table(
+   value=as.numeric(X.mat),
+   obs=as.integer(row(X.mat)),
+   feature=as.integer(col(X.mat))))
     value obs feature
  1:   5.1   1       1
  2:   4.9   2       1
  3:   4.7   3       1
  4:   4.6   4       1
  5:   5.0   5       1
 ---                  
596:   2.3 146       4
597:   1.9 147       4
598:   2.0 148       4
599:   2.3 149       4
600:   1.8 150       4
> (m.dt <- data.table::data.table(
+   center=as.numeric(kmeans.result$centers),
+   cluster=as.integer(row(kmeans.result$centers)),
+   feature=as.integer(col(kmeans.result$centers))))
      center cluster feature
1: 6.2807692       1       1
2: 5.0347826       2       1
3: 2.9096154       1       2
4: 3.4043478       2       2
5: 4.8923077       1       3
6: 1.5000000       2       3
7: 1.6576923       1       4
8: 0.2608696       2       4

Then we do a cartesian join on feature between these two data tables, in order to consider all possible combinations of observation/feature/cluster:

> squares.dt <- X.dt[m.dt, on="feature", allow.cartesian=TRUE]
> squares.dt[, square := (value-center)^2 ]
> squares.dt
      value obs feature    center cluster   square
   1:   5.1   1       1 6.2807692       1 1.394216
   2:   4.9   2       1 6.2807692       1 1.906524
   3:   4.7   3       1 6.2807692       1 2.498831
   4:   4.6   4       1 6.2807692       1 2.824985
   5:   5.0   5       1 6.2807692       1 1.640370
  ---                                             
1196:   2.3 146       4 0.2608696       2 4.158053
1197:   1.9 147       4 0.2608696       2 2.686749
1198:   2.0 148       4 0.2608696       2 3.024575
1199:   2.3 149       4 0.2608696       2 4.158053
1200:   1.8 150       4 0.2608696       2 2.368922
> length(squares.array)
[1] 1200

Note that the code above shows that the storage complexity is the same as in the previous array method (1200 is the number of combinations of observation/feature/cluster). The final step is to use data table summarization oprations using by (analogous to the apply operations in the previous method):

> sum.squares.dt <- squares.dt[, .(
+   sum.squares=sum(square)
+ ), by=c("obs", "cluster")]
> min.dt <- sum.squares.dt[, .(
+   min.ss=min(sum.squares)
+ ), by="obs"]
> min.dt[, .(
+   tot.withinss=sum(min.ss)
+ ), by=.(set=shuffled.sets[obs])]
          set tot.withinss
1: validation     74.84365
2:      train     78.48633
> kmeans.result$tot.withinss
[1] 78.48633

Again the resulting train sum of squares is consistent with the value returned by the kmeans function.

More efficient method using data table

Both methods above require memory storage which is proportional to the number of combinations of observation/feature/cluster, which may be prohibitive for big data sets. How can we do the same computation but using less memory storage?

We can use data.table, but in a different way. Rather than beginning with a cartesian join (which requires a lot of memory), we instead create a data table with columns for set and observation id:

> (set.obs.ids <- data.table::data.table(
+   set=shuffled.sets, obs=seq_along(shuffled.sets)))
            set obs
validation   1
     train   2
validation   3
validation   4
validation   5
 ---               
validation 146
     train 147
validation 148
validation 149
validation 150

Then we need to think about the return value that we want, which is a data table with two rows (one for each set) and two columns (set, tot.withinss). To get that data table we use two nested data table summarization operations: the outer one by=set and the inner one by=obs:

> set.obs.ids[, {
+   set.obs.mins <- .SD[, {
+     diff.mat <- X.mat[obs, ] - t(kmeans.result$centers)
+     .(min.ss=min(colSums(diff.mat^2)))
+   }, by="obs"]
+   set.obs.mins[, .(tot.withinss=sum(min.ss))]
+ }, by="set"]
          set tot.withinss
1: validation     74.84365
2:      train     78.48633
> kmeans.result$tot.withinss
[1] 78.48633

The code above uses .SD which means the Subset of Data corresponding to a single value of by=set. Using .SD[, .(min.ss=...), by=obs] means to compute, for each observation in that set, the minimum sum of squares across all clusters.

Analogous version using for loops

The data table code above may seem a bit strange, but it is really just using by instead of for loops. Here is a translation which is perhaps easier for some readers to understand:

> tot.withinss <- structure(
+   rep(NA_real_, length(set.prop.vec)),
+   names=names(set.prop.vec))
> for(set in names(set.prop.vec)){
+   X.set <- X.mat[shuffled.sets == set, ]
+   set.obs.mins <- rep(NA_real_, nrow(X.set))
+   for(obs in 1:nrow(X.set)){
+     diff.mat <- X.set[obs, ] - t(kmeans.result$centers)
+     set.obs.mins[[obs]] <- min(colSums(diff.mat^2))
+   }
+   tot.withinss[[set]] <- sum(set.obs.mins)
+ }
> tot.withinss
validation      train 
  74.84365   78.48633 
> kmeans.result$tot.withinss
[1] 78.48633

In the code above X.set is the analog of .SD (the data for one set), and in each iteration of the for loops we fill in an entry of the numeric vectors set.obs.mins and then tot.withinss.

List of data tables idiom

Yet another method is to use the list of data tables idiom,

> for(set in names(set.prop.vec)){
+   X.set <- X.mat[shuffled.sets == set, ]
+   set.obs.mins.dt.list <- list()
+   for(obs in 1:nrow(X.set)){
+     diff.mat <- X.set[obs, ] - t(kmeans.result$centers)
+     set.obs.mins.dt.list[[obs]] <- data.table::data.table(
+       obs, min.ss=min(colSums(diff.mat^2)))
+   }
+   set.obs.mins.dt <- do.call(rbind, set.obs.mins.dt.list)
+   tot.withinss.dt.list[[set]] <- set.obs.mins.dt[, .(
+     set,
+     tot.withinss=sum(min.ss))]
+ }
> (tot.withinss.dt <- do.call(rbind, tot.withinss.dt.list))
          set tot.withinss
1: validation     74.84365
2:      train     78.48633
> kmeans.result$tot.withinss
[1] 78.48633

Other methods

There are three different methods we have seen above:

for loop using base R data structures, fill an entry of the matrix during each iteration.
for loop using list of data tables idiom.
data table summarization using by instead of a for loop.

There are also three different things we can do the loop/by over:

observations only.
clusters only.
sets and observations.

Exercise for the reader: implement all 9 combinations of the above.

Comparing computational requirements

The methods presented above all compute the same result, but which is most efficient?

The figure below presents a comparison of timings of several different methods, for the zip.test data set (7291 x 256) and from 2 to 5 clusters.

looping methods figure

From the first panel above we can see that all three pkg/loop combinations take about the same time, less than kmeans itself (black), if we do by/for over clusters only.

From the second panel above we can see that data.table methods are actually much slower than the base R for loop, when we are doing by/for over observations only.

From the third panel above we can see that if we do by/for over both set and observations, then the list of data tables idiom is much slower than the other two methods (data.table by and base R for loop).

Overall from the figure above it is clear that the most important thing to consider in implementing this computation is to minimize the number of things to iterate over (the number of clusters is definitely smaller than the number of observations).

The figure below takes the best method from above and compares it to the all.combos (array/cartesian join) methods:

all combinations figure

The top panel shows that the all.combos methods take more time than the data.table.by.cluster method (which was tied as fastest in the previous figure).

It is clear from the bottom panel that the all.combos methods take much more memory, which is expected.

Overall it is clear that the cartesian join should be avoided if at all possible, and when writing by/for operations you should minimize the number of items you iterate over.