What is the difference between unsupervised and supervised changepoint detection?
A typical supervised changepoint analysis

See the GitHub repository for links to source code and exercises: https://github.com/tdhock/change-tutorial

Before executing the code in this tutorial Rmd, make sure to install the required packages:

packages.R <- if(file.exists("packages.R")){
  "packages.R"
}else{
  "https://raw.githubusercontent.com/tdhock/change-tutorial/master/packages.R"
}
source(packages.R)

## Loading required package: survival

## Loading required package: data.table

## Loading required package: neuroblastoma

## Loading required package: future

## 
## Attaching package: 'future'

## The following object is masked from 'package:survival':
## 
##     cluster

## Loading required package: Segmentor3IsBack

## Segmentor3IsBack v2.0 Loaded

## Loading required package: changepoint

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Successfully loaded changepoint package version 2.3.1
##  NOTE: Predefined penalty values changed in version 2.2.  Previous penalty values with a postfix 1 i.e. SIC1 are now without i.e. SIC and previous penalties without a postfix i.e. SIC are now with a postfix 0 i.e. SIC0. See NEWS and help files for further details.

## Loading required package: directlabels

## Loading required package: ggplot2

## Loading required package: geometry

## Loading required package: penaltyLearning

options(width=100)

What is the difference between unsupervised and supervised changepoint detection?

There are two features of a data set that make it possible to do a supervised changepoint analysis:

The data consist of several sequences with similar signal/noise patterns, which are treated as separate changepoint detection problems.
Some data sequences have labels which indicate specific regions where the model should predict presence or absence of changepoints.

The goal of the supervised analysis is to find a model that learns from the limited labeled data, and provides accurate changepoint predictions for all of the data (a different number of changepoints for each separate data sequence).

Neuroblastoma data set

We will use the neuroblastoma data set to explore supervised changepoint detection.

data(neuroblastoma, package="neuroblastoma")
str(neuroblastoma)

## List of 2
##  $ profiles   :'data.frame': 4616846 obs. of  4 variables:
##   ..$ profile.id: Factor w/ 575 levels "1","2","4","5",..: 7 7 7 7 7 7 7 7 7 7 ...
##   ..$ chromosome: Factor w/ 24 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ position  : int [1:4616846] 809681 928433 987423 1083595 1125548 1199359 1392490 1672814 1985786 2046695 ...
##   ..$ logratio  : num [1:4616846] -0.01742 0.06626 0.06764 0.04264 0.00576 ...
##  $ annotations:'data.frame': 3418 obs. of  5 variables:
##   ..$ profile.id: Factor w/ 575 levels "1","2","4","5",..: 207 322 379 144 388 497 564 331 189 315 ...
##   ..$ chromosome: Factor w/ 24 levels "1","2","3","4",..: 11 11 11 11 11 11 11 11 11 11 ...
##   ..$ min       : int [1:3418] 53700000 53700000 53700000 53700000 53700000 53700000 53700000 53700000 53700000 53700000 ...
##   ..$ max       : int [1:3418] 135006516 135006516 135006516 135006516 135006516 135006516 135006516 135006516 135006516 135006516 ...
##   ..$ annotation: Factor w/ 2 levels "breakpoint","normal": 2 2 2 2 2 2 1 2 2 2 ...

lapply(neuroblastoma, head, 10)

## $profiles
##    profile.id chromosome position     logratio
## 1           8          1   809681 -0.017417053
## 2           8          1   928433  0.066261442
## 3           8          1   987423  0.067638717
## 4           8          1  1083595  0.042644337
## 5           8          1  1125548  0.005759269
## 6           8          1  1199359  0.000000000
## 7           8          1  1392490  0.069014678
## 8           8          1  1672814  0.204140717
## 9           8          1  1985786  0.039840265
## 10          8          1  2046695  0.073134705
## 
## $annotations
##    profile.id chromosome      min       max annotation
## 1         213         11 53700000 135006516     normal
## 2         333         11 53700000 135006516     normal
## 3         393         11 53700000 135006516     normal
## 4         148         11 53700000 135006516     normal
## 5         402         11 53700000 135006516     normal
## 6         521         11 53700000 135006516     normal
## 7         590         11 53700000 135006516 breakpoint
## 8         342         11 53700000 135006516     normal
## 9         195         11 53700000 135006516     normal
## 10        326         11 53700000 135006516     normal

The neuroblastoma data set consists of two data.frames:

profiles contains the noisy data (logratio), which is approximate DNA copy number (measured at chromosome, position), for a particular patient (profile.id). Every unique combination of (profile.id, chromosome) defines a separate multiple changepoint detection problem. The logratio measures how many copies of each part of the human genome are present in the patient’s tumor. Normally there are two copies of each chromosome (logratio=0), but in cancer samples there can be gains (changes up) and losses (changes down) of specific regions or entire chromosomes.
annotations contains the labels, which indicate presence (breakpoint) or absence (normal) of changepoints in specific regions (profile.id, chromosome, min, max) of the data.

These data come from children who were treated at the Institut Curie (Paris, France). For six children we also have follow-up data on whether they recovered (ok) or had a relapse, several years after treatment:

followup <- data.frame(
  profile.id=paste(c(10, 8, 4, 6, 11, 1)),
  status=c("ok", "relapse", "relapse", "ok", "ok", "relapse"))
followup

##   profile.id  status
## 1         10      ok
## 2          8 relapse
## 3          4 relapse
## 4          6      ok
## 5         11      ok
## 6          1 relapse

Unsupervised: no labels, model selection via theory

When there are no labels, the input for a changepoint analysis just uses the noisy DNA copy number data in neuroblastoma$profiles$logratio. We begin by selecting the profiles of the six patients for which we have follow-up data.

rownames(followup) <- followup$profile.id
followup$status.profile <- with(followup, paste(status, profile.id))
some.ids <- rownames(followup)
library(data.table)
someProfiles <- function(all.profiles){
  some <- subset(all.profiles, paste(profile.id) %in% some.ids)
  status.profile <- followup[paste(some$profile.id), "status.profile"]
  some$status.profile <- ifelse(
    is.na(status.profile), paste(some$profile.id), status.profile)
  data.table(some)
}
neuroblastoma.dt <- data.table(neuroblastoma$profiles)
## neuroblastoma.dt[, logratio.norm := logratio/mad(logratio), by=list(
##   profile.id, chromosome)]
neuroblastoma.dt[, logratio.norm := logratio]
six.profiles <- someProfiles(neuroblastoma.dt)

For these six patients, the total number of separate changepoint detection problems is 24 chromosomes x 6 patients = 144 problems. We plot each problem below in a separate facet (panel).

library(ggplot2)
gg.unsupervised <- ggplot()+
  ggtitle("unsupervised changepoint detection = only noisy data sequences")+
  theme(
    panel.spacing=grid::unit(0, "lines"),
    panel.border=element_rect(fill=NA, color="grey50")
  )+
  facet_grid(status.profile ~ chromosome, scales="free", space="free_x")+
  geom_point(aes(position/1e6, logratio),
             data=six.profiles,
             shape=1)+
  scale_x_continuous(
    "position on chromosome (mega bases)",
    breaks=c(100, 200))+
  scale_y_continuous(
    "logratio (approximate DNA copy number)",
    limits=c(-1,1)*1.1)
print(gg.unsupervised)

## Warning: Removed 65 rows containing missing values (geom_point).

Note the facet titles on the right:

The top three profiles are “ok” because those children completely recovered.
The bottom three profiles are “relapse” because the neuroblastoma cancer came back, and required another treatment at the hospital.

Question: can you visually identify any differences between the “ok” and “relapse” profiles?

How to choose the number of changepoints? In the unsupervised setting, our only option is to use theoretical/statistical arguments, which give information criteria such as Aikaike Information Criterion (AIC, penalty=2) or Bayesian/Schwarz Information Criterion (BIC/SIC, penalty=log n). More on this later.

Supervised: learn a penalty function with minimal incorrect labels

In supervised changepoint detection, there are labels which indicate presence and absence of changepoints in particular data subsets. For the same set of 6 profiles, we superimpose the labels in the plot below.

six.labels <- someProfiles(neuroblastoma$annotations)
change.colors <- c(
  "1change"="#ff7d7d",
  breakpoint="#a445ee",
  normal="#f6c48f"
  )
gg.supervised <- gg.unsupervised+
  ggtitle(paste(
    "supervised changepoint detection = data + labels",
    "that indicate specific regions with or without changepoints"))+
  penaltyLearning::geom_tallrect(aes(
    xmin=min/1e6, xmax=max/1e6, fill=annotation),
    alpha=0.5,
    color=NA,
    data=six.labels)+
  scale_fill_manual("label", values=change.colors)+
  theme(legend.position="bottom")
print(gg.supervised)

## Warning: Removed 65 rows containing missing values (geom_point).

New function: geom_tallrect is like geom_rect but the ymin is always the bottom of the panel and the ymax is always the top of the panel. These are useful for drawing labeled intervals of x values.

It is clear from this plot that the neuroblastoma data satisfy the two criteria which are necessary for a supervised changepoint analysis:

The data consist of several sequences with similar signal/noise patterns, which are treated as separate changepoint detection problems.
Some data sequences have labels which indicate specific regions where the model should predict presence or absence of changepoints.

As we will see later in the tutorial, the labels can be used for choosing the best model and parameters. The main idea is that the changepoint model should be consistent with the labels:

Every breakpoint/positive label is a region where the model should predict at least one changepoint. If the model predicts no changepoints in a region with a breakpoint label, that is considered a false negative.
Every normal/negative label is a region where the model should predict no changepoints. If the model predicts one or more changepoints in a region with a normal label, that is considered a false positive.
The overall goal is find a changepoint model that minimizes the number of incorrectly predicted labels, which is defined as the sum of false positives and false negatives.

Label error of an unsupervised changepoint model

In this section we show that typical unsupervised changepoint models result in prediction errors with respect to the labels in the neuroblastoma data. We begin by using the changepoint::cpt.mean function to compute an unsupervised changepoint model for each data sequence. To compute the number of incorrect labels later, make sure to save a description of the model in a “tidy” data.frame (observations on rows and variables on columns). In the code below we

fit the model inside by=list(status.profile, chromosome) which adds those two columns to the resulting data.table.
add the pen.name column to identify the model.
use the changes list column to save the changepoint positions (in the same units as the label positions: bases on the chromosome).

pen.name <- "SIC0"
(unsupervised.models <- six.profiles[, {
  fit.pelt <- changepoint::cpt.mean(
    logratio.norm, penalty=pen.name, method="PELT")
  end <- fit.pelt@cpts
  before.change <- end[-length(end)]
  after.change <- before.change+1L
  data.table(
    pen.name,
    pen.value=fit.pelt@pen.value,
    changes=list(
    as.integer((position[before.change]+position[after.change])/2)
    ))
}, by=list(profile.id, status.profile, chromosome)])

##      profile.id status.profile chromosome pen.name pen.value changes
##   1:          8      relapse 8          1     SIC0  6.013715        
##   2:          8      relapse 8          2     SIC0  5.375278        
##   3:          8      relapse 8          3     SIC0  5.081404        
##   4:          8      relapse 8          4     SIC0  4.882802        
##   5:          8      relapse 8          5     SIC0  5.164786        
##  ---                                                                
## 140:         10          ok 10         20     SIC0  4.828314        
## 141:         10          ok 10         21     SIC0  4.110874        
## 142:         10          ok 10         22     SIC0  4.543295        
## 143:         10          ok 10          X     SIC0  5.231109        
## 144:         10          ok 10          Y     SIC0  3.295837

Next, we convert the list column changes to tall format (one row for each changepoint).

(unsupervised.changes <- unsupervised.models[, data.table(
  change=changes[[1]]
), by=list(profile.id, status.profile, chromosome, pen.name)])

##     profile.id status.profile chromosome pen.name    change
##  1:          8      relapse 8          7     SIC0 141405948
##  2:          8      relapse 8         11     SIC0  70573642
##  3:          8      relapse 8         17     SIC0  29895917
##  4:          8      relapse 8          Y     SIC0   4459374
##  5:         11          ok 11          1     SIC0  32819999
##  6:         11          ok 11          Y     SIC0   5918094
##  7:         11          ok 11          Y     SIC0  11393013
##  8:         11          ok 11          Y     SIC0  23715526
##  9:          4      relapse 4          1     SIC0  59792500
## 10:          4      relapse 4          2     SIC0  45164625
## 11:          4      relapse 4          3     SIC0  69953902
## 12:          4      relapse 4         14     SIC0  76603452
## 13:          4      relapse 4         17     SIC0  41646489
## 14:          1      relapse 1          1     SIC0 212809180
## 15:          1      relapse 1          7     SIC0  60381530
## 16:          1      relapse 1         11     SIC0  80058339
## 17:          1      relapse 1          Y     SIC0   5918094
## 18:          1      relapse 1          Y     SIC0   9420652
## 19:          1      relapse 1          Y     SIC0  19070635

New function: we now introduce the penaltyLearning::labelError function, which computes the number of incorrect labels for a set of predicted changepoints. Its inputs are 3 data.frames:

models one row per changepoint model.
labels one row for each label, with columns for label location and type.
changes one row for each predicted changepoint, with a column for the predicted changepoint position.

unsupervised.error.list <- penaltyLearning::labelError(
  unsupervised.models, six.labels, unsupervised.changes,
  problem.vars=c("status.profile", "chromosome"),
  model.vars="pen.name",
  change.var="change")

The result of labelError is a list of two data tables:

model.errors has one row per (sequence,model), with the total number of incorrect labels. We use these totals in the ggtitle of the plot below.
label.errors has one row per (sequence,model,label), with the fp/fn status of each label. We use these to visualize false negative and false positive predictions using the geom_tallrect with aes(linetype=status) below.

gg.supervised+
  theme(legend.box="horizontal")+
  unsupervised.error.list$model.errors[, ggtitle(paste0(
    "Unsupervised ", pen.name, " penalty has ",
    sum(errors),
    " incorrect labels: ",
    sum(fp), " FP + ",
    sum(fn), " FN"))]+
  geom_vline(aes(
    xintercept=change/1e6),
    color="green",
    size=1,
    linetype="dashed",
    data=unsupervised.changes)+
  geom_tallrect(aes(
    xmin=min/1e6, xmax=max/1e6, linetype=status),
    fill=NA,
    data=unsupervised.error.list$label.errors)+
  scale_linetype_manual("error type", values=c(
    correct=0,
    "false negative"=3,
    "false positive"=1))

## Warning: Removed 65 rows containing missing values (geom_point).

Note that in the plot above we also plotted the predicted changepoints using a green dashed geom_vline. Labels with inconsistent changepoints are shown with a black border in the geom_tallrect, with scale_linetype_manual to show the type of prediction error:

False positives are labels with too many predicted changepoints, drawn with a solid black border (none here but we will see some later).
False negatives are labels with too few predicted changepoints, drawn with a dotted black border.
Correct labels with no border have an acceptable number of predicted changepoints.

Remark: the result above (BIC/SIC predicts too few changepoints, resulting in false negatives) is specific to the neuroblastoma data set. However, in general you can expect that unsupervised penalty functions will commit changepoint detection errors which are visually obvious. This is the main motivation of supervised changepoint analysis: we can use labels to learn a penalty function with improved changepoint detection accuracy.

Exercise 1 during class: try changing the pen.name variable to another theretically-motivated unsupervised penalty. Other options from help(cpt.mean) are SIC (without a zero), MBIC, AIC, and Hannan-Quinn. Can you find a penalty with improved changepoint detection accuracy? Also try changing the some.ids variable (to some profile IDs from 1 to 603) to see the changepoints and label errors for another set of profiles.

Creating labels via visual inspection

To quickly create labels for many data sequences, I recommend using a GUI in which you can use the mouse to draw label rectangles on the scatterplots. I wrote annotate_regions.py for labeling the neuroblastoma data set, and SegAnnDB for labeling general DNA copy number profile data. You may want to write another GUI for your particular data – make sure it supports visualizing the data sets and interactive labeling.

Even if you don’t have GUI software, you can always just plot the data in R, and then write the labels in the R code. Begin by plotting one data set by itself to see the details of the signal and noise.

zoom.profile <- 4     #Exercise: change this value!
zoom.chromosome <- 14 #Exercise: change this value!
zoom.pro <- neuroblastoma.dt[
  profile.id==zoom.profile & chromosome==zoom.chromosome]
zoom.gg <- ggplot()+
  geom_point(aes(position/1e6, logratio),
             data=zoom.pro,
             shape=1)+
  scale_y_continuous(
    "logratio (approximate DNA copy number)")
print(zoom.gg)

Based on your visual interpretation of the plot above, you can create a set of labels.

For each region that contains at least one changepoint, create a “breakpoint” label (no predicted changes inside is a false negative).
For each region that contains no changepoints, create a “normal” label (one or more predicted changes inside is a false positive).
For each region that contains exactly one changepoint, create a “1change” label (no predicted changes inside is a false negative, and two or more is a false positive).

The region for each label can be as small or as large as you want. Because the goal of learning is to minimize the number of incorrectly predicted labels, it is important to make sure that each label is a correct representation of the changepoint model you want. Avoid labeling any regions for which you are not sure about the label. Save your labels to a data.frame with the label type and positions, as below.

label <- function(min, max, annotation){
  data.frame(
    profile.id=paste(zoom.profile),
    chromosome=paste(zoom.chromosome),
    min, max, annotation)
}
zoom.labels <- rbind(# Exercise: change these values!
  label(70e6, 90e6, "1change"), 
  label(20e6, 60e6, "normal"))
zoom.gg.lab <- zoom.gg+
  penaltyLearning::geom_tallrect(aes(
    xmin=min/1e6, xmax=max/1e6, fill=annotation),
    alpha=0.5,
    color=NA,
    data=zoom.labels)+
  scale_fill_manual("label", values=change.colors)
print(zoom.gg.lab)

The plot above shows the noisy data along with the labels. Note that it is possible to have multiple labels per data sequence, even though that is not the case for the neuroblastoma data set.

A typical supervised changepoint analysis

In this section we explain how to perform a supervised changepoint analysis on a labeled data set.

Overview of supervised changepoint computations

A typical supervised changepoint analysis consists of the following computations:

Changepoint models of increasing complexity. For each of several labeled segmentation problems (data sequences that are separate but have a similar signal/noise pattern), use your favorite changepoint detection package to compute a sequence of models of increasing complexity (say from 0 to 20 changepoints).
Label error. Use penaltyLearning::labelError to compute the number of incorrect labels for each labeled segmentation problem and each changepoint model. The goal of learning is to minimize the number of incorrectly predicted labels.
Model selection. Use penaltyLearning::modelSelection to compute the exact path of models that will be selected for every possible non-negative penalty value.
Outputs. Use penaltyLearning::targetIntervals to compute a target interval of log(penalty) values that predicts the minimum number of incorrect labels for each segmentation problem. Create a target interval matrix (one row for each segmentation problems, 2 columns) which can be used as the output in the supervised machine learning problem.
Inputs. Compute a feature matrix (segmentation problems x features) using penaltyLearning::featureMatrix. Or do it yourself using simple statistics of each segmentation problem (quantiles, mean, number of data points, estimated variance, etc).
Learning and prediction. Use survival::survreg or penaltyLearning::IntervalRegressionCV to learn a penalty function.
Evaluation. use penaltyLearning::ROChange to compute ROC curves using labels in a test set (that were not used for training). The AUC (area under the ROC curve) and the percent incorrect labels in the test set can be used to evaluate the prediction accuracy of your model.

The mathematical details are described in our ICML’13 paper, Learning Sparse Penalties for Change-point Detection using Max Margin Interval Regression.

Changepoint models of increasing complexity

To begin we will perform the supervised changepoint computations on just one of the labeled neuroblastoma data sequences,

zoom.gg.lab <- ggplot()+
  geom_point(aes(position/1e6, logratio),
             data=zoom.pro,
             shape=1)+
  scale_y_continuous(
    "logratio (approximate DNA copy number)")+
  penaltyLearning::geom_tallrect(aes(
    xmin=min/1e6, xmax=max/1e6, fill=annotation),
    alpha=0.5,
    color=NA,
    data=zoom.labels)+
  scale_fill_manual("label", values=change.colors)
print(zoom.gg.lab)

We use the code below to compute a sequence of maximum likelihood Gaussian models with $s\in\{1,\dots,7\}$ segments. In contrast, a typical unsupervised analysis will include computation of one changepoint model. In supervised analysis we need to compute a sequence of changepoint models of increasing complexity in order to determine which are consistent and inconsistent with the labels. Remember the goal of the machine learning is to predict a model with complexity consistent with the labels (minimize the number of incorrectly predicted labels).

New function: Segmentor3IsBack::Segmentor computes maximum likelihood segmentations from 1 to Kmax segments. Inputs are the data sequence to segment, the likelihood (model=2 means normal/Gaussian), and the maximum number of segments (Kmax). For a supervised changepoint analysis, Kmax should be rather large, but how do you know if it is large enough? You should check to make sure there is at least one model with a false positive (a labeled region with too many predicted changepoints), if that is possible (not possible for data sequences with only breakpoint labels). These false positives in the training labels/models are used to learn a penalty function that avoids predicting too many changepoints. More details below in the Outputs section.

max.segments <- 7
(fit <- Segmentor3IsBack::Segmentor(
  zoom.pro$logratio.norm, model=2, Kmax=max.segments))

## Object of class Segmentor 
##  
##  Model used for the segmentation: 
## [1] "Normal"
## 
##  Maximum number of segments: 
## [1] 7
## 
##  Compression factor used: 
## [1] 1
## 
##  Matrix of breakpoints: 
##            1th break 2th break 3th break 4th break 5th break 6th break 7th break
## 1 segment         76         0         0         0         0         0         0
## 2 segments        50        76         0         0         0         0         0
## 3 segments         1        50        76         0         0         0         0
## 4 segments        50        66        68        76         0         0         0
## 5 segments         1        50        66        68        76         0         0
## 6 segments         1         9        50        66        68        76         0
## 7 segments         1        50        54        58        66        68        76
## 
##  Parameter of each segment: 
##  num [1:7, 1:7] -0.1348 0.0391 -0.2653 0.0391 -0.2653 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:7] "1 segment" "2 segments" "3 segments" "4 segments" ...
##   ..$ : chr [1:7] "1th parameter" "2th parameter" "3th parameter" "4th parameter" ...
## 
##  Likelihood of the segmentation 
##                 [,1]
## 1 segment   74.59080
## 2 segments -51.78161
## 3 segments -54.48624
## 4 segments -61.02661
## 5 segments -63.73125
## 6 segments -65.10483
## 7 segments -67.54654

The Segmentor fit object is an S4 class with slots

breaks an integer matrix of end position of each segment.
parameters a numeric matrix of the mean of each segment.

Next, we convert the model to tidy format (a data.frame with one row per segment). The code below is specific to the Segmentor function; if you use a different changepoint detection package/function, you will have to write some other code to tidy the data.

zoom.segs.list <- list()
zoom.loss.vec <- rep(NA, max.segments)
for(n.segments in 1:max.segments){
  end <- fit@breaks[n.segments, 1:n.segments]
  data.before.change <- end[-n.segments]
  data.after.change <- data.before.change+1
  pos.before.change <- as.integer(
  (zoom.pro$position[data.before.change]+
   zoom.pro$position[data.after.change])/2)
  start <- c(1, data.after.change)
  chromStart <- c(zoom.pro$position[1], pos.before.change)
  chromEnd <- c(pos.before.change, max(zoom.pro$position))
  seg.mean.vec <- fit@parameters[n.segments, 1:n.segments]
  zoom.segs.list[[n.segments]] <- data.table(
    profile.id=paste(zoom.profile),
    chromosome=paste(zoom.chromosome),
    n.segments, # model complexity.
    start, # in data points.
    end,
    chromStart, # in bases on chromosome.
    chromEnd,
    mean=seg.mean.vec)
  data.mean.vec <- rep(seg.mean.vec, end-start+1)
  stopifnot(length(data.mean.vec)==nrow(zoom.pro))
  zoom.loss.vec[n.segments] <- sum((zoom.pro$logratio.norm-data.mean.vec)^2)
}
(zoom.segs <- do.call(rbind, zoom.segs.list))

##     profile.id chromosome n.segments start end chromStart  chromEnd        mean
##  1:          4         14          1     1  76   20079729 106051083 -0.13477759
##  2:          4         14          2     1  50   20079729  76603452  0.03911700
##  3:          4         14          2    51  76   76603452 106051083 -0.46919028
##  4:          4         14          3     1   1   20079729  20448228 -0.26534457
##  5:          4         14          3     2  50   20448228  76603452  0.04533050
##  6:          4         14          3    51  76   76603452 106051083 -0.46919028
##  7:          4         14          4     1  50   20079729  76603452  0.03911700
##  8:          4         14          4    51  66   76603452  95013692 -0.48732984
##  9:          4         14          4    67  68   95013692  96042442 -0.08823781
## 10:          4         14          4    69  76   96042442 106051083 -0.52814926
## 11:          4         14          5     1   1   20079729  20448228 -0.26534457
## 12:          4         14          5     2  50   20448228  76603452  0.04533050
## 13:          4         14          5    51  66   76603452  95013692 -0.48732984
## 14:          4         14          5    67  68   95013692  96042442 -0.08823781
## 15:          4         14          5    69  76   96042442 106051083 -0.52814926
## 16:          4         14          6     1   1   20079729  20448228 -0.26534457
## 17:          4         14          6     2   9   20448228  25599644 -0.02555248
## 18:          4         14          6    10  50   25599644  76603452  0.05916133
## 19:          4         14          6    51  66   76603452  95013692 -0.48732984
## 20:          4         14          6    67  68   95013692  96042442 -0.08823781
## 21:          4         14          6    69  76   96042442 106051083 -0.52814926
## 22:          4         14          7     1   1   20079729  20448228 -0.26534457
## 23:          4         14          7     2  50   20448228  76603452  0.04533050
## 24:          4         14          7    51  54   76603452  80470604 -0.57770426
## 25:          4         14          7    55  58   80470604  85708989 -0.33481831
## 26:          4         14          7    59  66   85708989  95013692 -0.51839840
## 27:          4         14          7    67  68   95013692  96042442 -0.08823781
## 28:          4         14          7    69  76   96042442 106051083 -0.52814926
##     profile.id chromosome n.segments start end chromStart  chromEnd        mean

Above we show the data.table of optimal segment means. Below we compute the data.table of predicted changepoint positions, then plot the models.

(zoom.changes <- zoom.segs[1 < start, data.table(
  profile.id, chromosome, n.segments,
  changepoint=chromStart)])

##     profile.id chromosome n.segments changepoint
##  1:          4         14          2    76603452
##  2:          4         14          3    20448228
##  3:          4         14          3    76603452
##  4:          4         14          4    76603452
##  5:          4         14          4    95013692
##  6:          4         14          4    96042442
##  7:          4         14          5    20448228
##  8:          4         14          5    76603452
##  9:          4         14          5    95013692
## 10:          4         14          5    96042442
## 11:          4         14          6    20448228
## 12:          4         14          6    25599644
## 13:          4         14          6    76603452
## 14:          4         14          6    95013692
## 15:          4         14          6    96042442
## 16:          4         14          7    20448228
## 17:          4         14          7    76603452
## 18:          4         14          7    80470604
## 19:          4         14          7    85708989
## 20:          4         14          7    95013692
## 21:          4         14          7    96042442
##     profile.id chromosome n.segments changepoint

zoom.gg.models <- zoom.gg.lab+
  theme(
    panel.spacing=grid::unit(0, "lines"),
    panel.border=element_rect(fill=NA, color="grey50")
  )+
  facet_grid(n.segments ~ .)+
  geom_vline(aes(
    xintercept=changepoint/1e6),
    data=zoom.changes,
    color="green",
    size=1,
    linetype="dashed")
print(zoom.gg.models)

The plot above shows a panel for each model from 1 to 7 segments (0 to 6 changepoints). Each changepoint is drawn using a green dashed geom_vline.

For more details about the Segmentor function, read the references below. To save time during class we will skip to the next section, “Label error.”

Let there be $n$ separate data sequences to segment. For example in the previous section we plotted $n=144$ data sequences, each a separate changepoint detection problem.
For any data sequence $i\in\{1,\dots,n\}$, let $d_i$ be the number of data points in data sequence $i$, and let $\mathbf z_i\in\mathbb R^{d_i}$ be the vector of data points to segment for that sequence.
The Segment Neighborhood problem with $s$ segments for data sequence $i$ is to find the most likely $s-1$ changepoints: \[ \operatorname{minimize}_{\mathbf m\in\mathbb R^{d_i}} \sum_{j=1}^{d_i} \ell(z_{ij}, m_j) \text{ subject to} \sum_{j=1}^{d_i-1} I(m_{j} \neq m_{j+1})=s-1. \]

For the Normal homoscedastic model we have

the optimization variable $\mathbf m = (m_1 \cdots m_{d_i})$ is a vector of $d_i$ real-valued segment mean variables.
the loss function $\ell$ is the square loss $\ell(z,m)=(z-m)^2$.
the indicator function $I$ is 1 for every change-point where $m_{j} \neq m_{j+1}$, and 0 otherwise.
$\lambda>0$ is a non-negative penalty constant.
the objective function is the total loss over all $d_i$ data points, which is convex.
the constraint is that the mean $\mathbf m$ must have exactly $s-1$ changes, which is non-convex.
overall the optimization problem is non-convex, which means that gradient-based algorithms are not guaranteed to compute the global minimum. However the dynamic programming algorithms discussed below can be used to compute the global minimum in a reasonable amount of time.

A few references about the algorithm implemented in the Segmentor function:

the “Segment Neighborhood” problem was originally described by Auger and Lawrence (1989). They proposed an $O(S_{\text{max}} d^2)$ algorithm for computing the most likely models with $s\in\{1,\dots,S_{\text{max}}\}$ segments, for $d$ data points. An R implementation of their algorithm is available as changepoint::cpt.mean(method="SegNeigh") but it is very slow for large $d$, due to the quadratic time complexity.
Rigaill (2010) discovered a Pruned Dynamic Programming Algorithm (PDPA) which solves the same “Segment Neighborhood” problem. Its $O(S_{\text{max}}d\log d)$ time complexity makes it much faster for larger data sets. The original R implementation cghseg:::segmeanCO only works for the Gaussian/normal likelihood.
Cleynen et al (2014) generalized the PDPA to other loss/likelihood models. Their R function Segmentor3IsBack::Segmentor implements the $O(S_{\text{max}}d\log d)$ PDPA for Poisson, Normal homoscedastic (uniform variance – change in mean), Negative Binomial, Normal Heteroscedastic (uniform mean – change in variance), and Exponential likelihood models.

Exercise after class: instead of using Segmentor to compute a sequence of models, try using changepoint::cpt.mean(penalty="CROPS"), arXiv:1412.3617. Then re-do the rest of the analyses below. Another option is cghseg:::segmeanCO. Using any of these three packages you should be able to get the same results.

Label error

Next, we compute the number of incorrect labels for each model, using the penaltyLearning::labelError function that we saw above. We then plot the label errors.

zoom.models <- data.table(
  profile.id=paste(zoom.profile),
  chromosome=paste(zoom.chromosome),
  loss=zoom.loss.vec,
  n.segments=as.numeric(1:max.segments))
zoom.error.list <- penaltyLearning::labelError(
  zoom.models,
  zoom.labels,
  zoom.changes,
  change.var="changepoint",
  problem.vars=c("profile.id", "chromosome"))
zoom.gg.models+
  penaltyLearning::geom_tallrect(aes(
    xmin=min/1e6,
    xmax=max/1e6,
    linetype=status),
    data=zoom.error.list$label.errors,
    fill=NA)+
  scale_linetype_manual("error type", values=c(
    correct=0,
    "false negative"=3,
    "false positive"=1))

It is clear from the plot above that each model has a different number of labels which are correctly predicted. Models with too few changepoints cause false negatives, and models with too many changepoints cause false positives. However there is a subset of models with minimal incorrect labels. The goal of a supervised changepoint detection algorithm is to predict one of those models.

Model selection

Rather than using a regression model to directly predict the integer-valued number of segments/changepoints, we will instead learn a regression model that predicts the real-valued log(penalty). The model selection function is a mapping from penalty values to model size (in segments/changepoints).

Notice that zoom.models contains a loss column, which is the sum of squared residuals of each model. Below we use that to compute the models that will be selected for every possible penalty value.

New function: In the code below we use the penaltyLearning::modelSelection function, which takes as input a data.frame with one row per changepoint model. There should be at least the following columns (but there can be others):

IDs for the data sequence (here, profile.id and chromosome).
complexity, such as number of changepoints/segments. This column name can be specified via the complexity argument, as below.
loss, such as the residual sum of squares or negative log likelihood. This should decrease as model complexity increases, and the column name can be specified via the loss argument (default is "loss" which is present in zoom.models so no need to specify in the code below).

The output is a data.frame with one row per model that is selected for at least one penalty value. There are columns for the min/max penalty that will select each model, as shown below.

print(zoom.models)

##    profile.id chromosome      loss n.segments
## 1:          4         14 5.5436090          1
## 2:          4         14 1.1240146          2
## 3:          4         14 1.0294260          3
## 4:          4         14 0.8006910          4
## 5:          4         14 0.7061024          5
## 6:          4         14 0.6580643          6
## 7:          4         14 0.5726711          7

(zoom.selection <- penaltyLearning::modelSelection(
  zoom.models, complexity="n.segments"))

##   min.lambda max.lambda min.log.lambda max.log.lambda cum.iterations profile.id chromosome
## 7 0.00000000 0.06671563           -Inf      -2.707316              8          4         14
## 5 0.06671563 0.09458862      -2.707316      -2.358218              5          4         14
## 4 0.09458862 0.16166178      -2.358218      -1.822249              4          4         14
## 2 0.16166178 4.41959443      -1.822249       1.486048              1          4         14
## 1 4.41959443        Inf       1.486048            Inf              0          4         14
##        loss n.segments
## 7 0.5726711          7
## 5 0.7061024          5
## 4 0.8006910          4
## 2 1.1240146          2
## 1 5.5436090          1

ggplot()+
  geom_segment(aes(
    min.log.lambda, n.segments,
    xend=max.log.lambda, yend=n.segments),
    size=1,
    data=zoom.selection)+
  xlab("log(penalty) = log(lambda)")+
  scale_y_continuous(breaks=zoom.selection$n.segments)

We plot the model selection function in the figure above, which shows that it is a piecewise constant non-increasing function. Intuitively, the larger the penalty, the fewer the number of changepoints/segments (and vice versa). More concretely, let $L_{i,s}$ be the loss (sum of squared residuals) of the model with $s\in\{1,\dots,s_{\text{max}}\}$ segments for data sequence $i$. The model selection function is defined for every non-negative penalty $\lambda\geq 0$ as

\[ s_i^*(\lambda) = \text{arg min}_s L_{i,s} + \lambda s \]

Outputs

The output in the machine learning problem is an interval of log(penalty) values that achieves minimum incorrect labels. Remember that we have computed zoom.error.list$model.errors which contains the number of incorrect labels $e_i(s)\in\{0,1,2,\dots\}$ for this data sequence $i$ and every number of segments $s$. We can thus compute the label error as function of the penalty, \[ E_i(\lambda) = e_i[s^*_i(\lambda)]. \]

In R this computation amounts to a join with zoom.selection (which came from penaltyLearning::modelSelection).

zoom.error.join <- zoom.error.list$model.errors[J(zoom.selection), on=list(
  profile.id, chromosome, n.segments, loss)]
zoom.errors.tall <- data.table::melt(
  zoom.error.join,
  measure.vars=c("n.segments", "errors"))

## Warning in melt.data.table(zoom.error.join, measure.vars = c("n.segments", :
## 'measure.vars' [n.segments, errors, ...] are not all of the same type. By order of hierarchy, the
## molten data value column will be of type 'double'. All measure variables not of type 'double' will
## be coerced too. Check DETAILS in ?melt.data.table for more on coercion.

zoom.gg.errors <- ggplot()+
  geom_segment(aes(
    min.log.lambda, value,
    xend=max.log.lambda, yend=value),
    size=1,
    data=zoom.errors.tall)+
  theme_bw()+
  theme(panel.spacing=grid::unit(0, "lines"))+
  facet_grid(variable ~ ., scales="free")+
  scale_y_continuous("", breaks=0:max.segments)+
  xlab("log(penalty) = log(lambda)")
print(zoom.gg.errors)

The code above plots the number of incorrect labels $E_i(\lambda)$, clearly showing that it is a piecewise constant function that takes integer values (like the 01 loss in binary classification). The goal of the machine learning algorithm is to provide predictions that minimize this function (on test data).

Below we compute the output for the machine learning problem, a target interval of penalty values that yield changepoint models with minimal incorrect labels.

New function: the penaltyLearning::targetIntervals function computes the upper and lower limits of the target interval, which is the largest range of log(penalty) values that results in minimum incorrectly predicted labels. Its input parameter is a data.frame with one row per model (from penaltyLearning::modelSelection), with an additional column errors for the number of incorrect labels. The problem.vars argument indicates the data sequence ID columns. The function returns a data.table with one row per labeled data sequence, with columns for the lower and upper bounds of the target interval (min.log.lambda, max.log.lambda).

print(zoom.error.join)

##    profile.id chromosome      loss n.segments possible.fp fp possible.fn fn labels errors
## 1:          4         14 0.5726711          7           2  2           1  0      2      2
## 2:          4         14 0.7061024          5           2  1           1  0      2      1
## 3:          4         14 0.8006910          4           2  0           1  0      2      0
## 4:          4         14 1.1240146          2           2  0           1  0      2      0
## 5:          4         14 5.5436090          1           2  0           1  1      2      1
## 5 variables not shown: [min.lambda, max.lambda, min.log.lambda, max.log.lambda, cum.iterations]

(zoom.target <- penaltyLearning::targetIntervals(
  zoom.error.join,
  problem.vars=c("profile.id", "chromosome")))

##    profile.id chromosome min.log.lambda max.log.lambda errors
## 1:          4         14      -2.358218       1.486048      0

zoom.target.tall <- data.table::melt(
  zoom.target,
  measure.vars=c("min.log.lambda", "max.log.lambda"),
  variable.name="limit")[is.finite(value)]
zoom.gg.errors+
  geom_text(aes(
    ifelse(limit=="min.log.lambda", value-1, value+1),
    errors+1,
    label=paste(
      "false", 
      ifelse(limit=="min.log.lambda", "positives", "negatives"),
      "\ntoo",
      ifelse(limit=="min.log.lambda", "many", "few"),
      "changes"),
    hjust=ifelse(
      limit=="min.log.lambda", 0, 1)),
    data=data.frame(zoom.target.tall, variable="errors"),
    vjust=-1)+
  geom_point(aes(
    value,
    errors,
    fill=limit),
    shape=21,
    size=4,
    data=data.frame(zoom.target.tall, variable="errors"))+
  scale_fill_manual("limit", values=c(
    min.log.lambda="black",
    max.log.lambda="white"))+
  theme(
    legend.position="bottom",
    legend.box="horizontal")

The plot above emphasizes the target interval, which is used as the output in the machine learning problem:

the black dot shows the lower limit of the target interval. If a smaller log(penalty) is predicted, then there are too many changepoints (false positives).
the white dot shows the upper limit of the target interval. If a larger log(penalty) is predicted, then there are too few changepoints (false negatives).

The target interval is different for each labeled data sequence, as we show below. We will compute the target interval for each of the labeled data sequences in the six profiles from the beginning of the tutorial. The first step is to compute changepoints and model selection functions, which we do via a for loop over each data sequence below.

six.problems.dt <- unique(six.profiles[, list(
  profile.id, chromosome, status.profile)])
setkey(six.profiles, profile.id, chromosome)
six.segs.list <- list()
six.selection.list <- list()
for(problem.i in 1:nrow(six.problems.dt)){
  meta <- six.problems.dt[problem.i,]
  pro <- six.profiles[meta]
  max.segments <- min(nrow(pro), 10)
  fit <- Segmentor3IsBack::Segmentor(
    pro$logratio.norm, model=2, Kmax=max.segments)
  rss.vec <- rep(NA, max.segments)
  for(n.segments in 1:max.segments){
    end <- fit@breaks[n.segments, 1:n.segments]
    seg.mean.vec <- fit@parameters[n.segments, 1:n.segments]
    data.before.change <- end[-n.segments]
    data.after.change <- data.before.change+1
    pos.before.change <- as.integer(
      (pro$position[data.before.change]+pro$position[data.after.change])/2)
    start <- c(1, data.after.change)
    chromStart <- c(pro$position[1], pos.before.change)
    chromEnd <- c(pos.before.change, max(pro$position))
    data.mean.vec <- rep(seg.mean.vec, end-start+1)
    rss.vec[n.segments] <- sum((pro$logratio.norm-data.mean.vec)^2)
    six.segs.list[[paste(problem.i, n.segments)]] <- data.table(
      meta,
      n.segments,
      start,
      end,
      chromStart,
      chromEnd,
      mean=seg.mean.vec)
  }
  loss.dt <- data.table(
    meta,
    n.segments=1:max.segments,
    loss=rss.vec)
  six.selection.list[[problem.i]] <- penaltyLearning::modelSelection(
    loss.dt, complexity="n.segments")
}
six.selection <- do.call(rbind, six.selection.list)
six.segs <- do.call(rbind, six.segs.list)
six.changes <- six.segs[1 < start]

Inside of each iteration of the loop above, we used

Segmentor3IsBack::Segmentor to compute the optimal changepoint models, saving the tidy result data.frame of segments to an element of six.segs.list.
penaltyLearning::modelSelection to compute the model selection functions, saving the result to an element of six.selection.list.

Below, we use

penaltyLearning::labelError to compute the number of incorrect labels for each segmentation model.
penaltyLearning::targetIntervals to compute the target interval of penalty values for each labeled data sequence.

six.error.list <- penaltyLearning::labelError(
  six.selection, six.labels, six.changes,
  problem.vars=c("profile.id", "chromosome"))
six.targets <- penaltyLearning::targetIntervals(
  six.error.list$model.errors,
  problem.vars=c("profile.id", "chromosome"))
ymax.err <- 1.2
six.targets.tall <- data.table::melt(
  six.targets,
  measure.vars=c("min.log.lambda", "max.log.lambda"),
  variable.name="limit",
  value.name="log.lambda")[is.finite(log.lambda)]
six.gg.errors <- ggplot()+
  theme_bw()+
  theme(panel.spacing=grid::unit(0, "lines"))+
  facet_grid(profile.id ~ chromosome, labeller=function(df){
    if("chromosome" %in% names(df))
      df$chromosome <- paste0("chr", df$chromosome)
    if("profile.id" %in% names(df))
      df$profile.id <- paste("profile", df$profile.id)
    df
  })+
  geom_segment(aes(
    min.log.lambda, errors,
    xend=max.log.lambda, yend=errors),
    size=1,
    data=six.error.list$model.errors)+
  geom_point(aes(
    log.lambda,
    errors,
    fill=limit),
    shape=21,
    size=3,
    alpha=0.5,
    data=six.targets.tall)+
  scale_fill_manual("limit", values=c(
    min.log.lambda="black",
    max.log.lambda="white"))+
  theme(
    legend.position="bottom",
    legend.box="horizontal")+
  scale_y_continuous(
    "incorrectly predicted labels",
    limits=c(0,ymax.err),
    breaks=c(0,1))+
  scale_x_continuous(
    "predicted log(penalty) = log(lambda)")
print(six.gg.errors)

The plot above shows the error function and target intervals for each labeled data sequence. Each column of panels/facets represents a different chromosome, and each row represents a different profile/patient. It is clear from the plot above that a model with minimal incorrect labels will predict a different penalty for each data sequence.

Question: discuss with your neighbor for 1 minute, why is there only one limit per data sequence? (only one target limit point per panel in the plot above)

Exercise 2 during class: create a set of labels via visual inspection, and visualize how changing the labels affects the computation of the target interval.

Inputs

To train a machine learning model, we first need to compute a vector $\mathbf x_i\in\mathbb R^p$ of exactly $p$ features for each data sequence $i$. We then learn a function $f(\mathbf x_i)=\log\lambda_i\in\mathbb R$. To compute a feature matrix, use the code below.

New function: the penaltyLearning::featureMatrix function computes a feature matrix for some data sequences. Inputs are a data.frame of data sequences (six.profiles below), along with column names of IDs (problem.vars) and data (data.var). Output is a numeric matrix with one row per data sequence and one column per feature.

print(six.profiles)

##        profile.id chromosome position   logratio logratio.norm status.profile
##     1:          1          1   809681  0.4478436     0.4478436      relapse 1
##     2:          1          1   928433  0.4625759     0.4625759      relapse 1
##     3:          1          1   987423  0.5008021     0.5008021      relapse 1
##     4:          1          1  1083595  0.5762803     0.5762803      relapse 1
##     5:          1          1  1392490  0.4730076     0.4730076      relapse 1
##    ---                                                                       
## 21448:         11          Y 20840077 -2.6711635    -2.6711635          ok 11
## 21449:         11          Y 21743552 -1.1424170    -1.1424170          ok 11
## 21450:         11          Y 23148804 -1.3255393    -1.3255393          ok 11
## 21451:         11          Y 24282249 -3.8365013    -3.8365013          ok 11
## 21452:         11          Y 25746303 -1.9105018    -1.9105018          ok 11

feature.mat <- penaltyLearning::featureMatrix(
  six.profiles,
  problem.vars=c("profile.id", "chromosome"),
  data.var="logratio")
str(feature.mat)

##  num [1:144, 1:365] 474 250 191 154 184 163 129 126 118 137 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:144] "1 1" "1 2" "1 3" "1 4" ...
##   ..$ : chr [1:365] "n.identity" "n.sqrt" "n.log" "n.loglog" ...

The feature matrix is 144 rows (one for each data sequence) by 365 columns/features. Which features should we use to predict log(penalty) values? If you have a lot of labels, you can use penaltyLearning::IntervalRegressionCV which uses L1-regularization to automatically select relevant features – more details in the next section.

For now we will discuss the simpler unsupervised BIC/SIC model, which corresponds to solving the following changepoint optimization problem. \[ \operatorname{minimize}_{\mathbf m\in\mathbb R^{d_i}} \sum_{j=1}^{d_i} \ell(z_{ij}, m_j) + \underbrace{(\log d_i)}_{\lambda_i} \sum_{j=1}^{d_i-1} I(m_{j} \neq m_{j+1}). \] The BIC/SIC penalty is $\lambda_i = \log d_i$, where $d_i$ is the number of data points to segment in sequence $i$. In the changepoint package version 2.2 this is penalty="SIC0".

In our framework, we can write the BIC/SIC criterion in terms of $p=1$ feature, $x_i = \log\log d_i$ – this is the column n.loglog of the feature matrix. The unsupervised BIC/SIC penalty function is thus a linear model $\log\lambda_i = f(x_i)=w x_i + \beta$ with intercept $\beta=0$ and slope $w=1$.

slope <- 1
intercept <- 0
bic.line.dt <- data.table(
  slope, intercept, model="BIC/SIC")
six.BIC.dt <- six.profiles[, list(
  feature=log(log(.N)),
  model="BIC/SIC"
), by=list(profile.id, chromosome)]
six.BIC.dt[, pred.log.lambda := feature*slope + intercept]

The plot below visualizes the target interval limits in this feature space.

six.features.tall <- six.BIC.dt[six.targets.tall, on=list(
  profile.id, chromosome)]
six.gg.limits <- ggplot()+
  geom_point(aes(
    feature, log.lambda, fill=limit),
    shape=21,
    data=six.features.tall)+
  scale_fill_manual("limit", values=c(
    min.log.lambda="black",
    max.log.lambda="white"),
    breaks=c("max.log.lambda", "min.log.lambda"))+
  scale_x_continuous(
    "feature")+
  scale_y_continuous(
    "log(penalty)=log(lambda)")
print(six.gg.limits)

In the plot above, each point represents a finite limit of a target interval. The goal of learning is to find a regression function which is below the white dots (upper limits) and above the black dots (lower limits). This is a machine learning problem called “regression with left, right, and interval censored outputs” or “interval regression” for short. Note that this is not the same problem as binary classification, for which the vertical axis is a second feature.

In this context we can plot the BIC/SIC model as a line with slope 1 and intercept 0, and we can visualize its errors using the vertical line segments below.

six.features.targets <- six.BIC.dt[six.targets, on=list(profile.id, chromosome)]
six.features.targets[, residual := 0]
six.features.targets[{
  is.finite(min.log.lambda)|is.finite(max.log.lambda)
}, residual := {
  penaltyLearning::targetIntervalResidual(
    cbind(min.log.lambda, max.log.lambda), pred.log.lambda)
}]
model.colors <- c(
  constant="black",
  "BIC/SIC"="#1B9E77",#green
  survreg="#D95F02",#orange
  IntervalRegressionCV="#7570B3")#dark purple
six.gg.bic <- six.gg.limits+
  scale_color_manual(values=model.colors)+
  geom_abline(aes(
    slope=slope, intercept=intercept, color=model),
    size=1,
    data=bic.line.dt)+
  geom_segment(aes(
    feature, pred.log.lambda,
    xend=feature, yend=pred.log.lambda-residual,
    color=model),
    data=six.features.targets)
print(six.gg.bic)

New function: the penaltyLearning::targetIntervalResidual function computes the residual of a predicted log(penalty) with respect to a target interval (zero if prediction is inside the interval, positive for penalty too large, negative for penalty too small). The first argument is a 2-column numeric matrix of target intervals, and the second argument is a numeric vector of predicted log(penalty) values. The output is a numeric vector of residuals, useful for showing the errors in regression plots such as the one above.

We can see from the plot above that there are 5 data sequences for which the BIC/SIC predicts a penalty value outside of the target interval (penalty too large = too few changepoints). It is clear that the number of errors could be reduced by simply decreasing the intercept (moving the line down). However there is no line in this feature space that perfectly separates the white and black points.

We can also visualize the number of incorrect labels for the BIC/SIC model by plotting the data, segmentation and labels below. Note that this BIC/SIC model was computed using Segmentor3IsBack::Segmentor, but it yields exactly the same model as changepoint::cpt.mean (compare with previous figure above).

six.BIC.dt[, pred := pred.log.lambda]
six.BIC.selection <- data.table(six.selection)[six.BIC.dt, on=list(
  profile.id, chromosome,
  min.log.lambda < pred,
  max.log.lambda > pred)]
six.BIC.labels <- six.error.list$label.errors[six.BIC.selection, on=list(
  profile.id, chromosome, n.segments), nomatch=0L]
six.BIC.changes <- six.changes[six.BIC.selection, on=list(
  profile.id, chromosome, n.segments), nomatch=0L]
six.BIC.segs <- six.segs[six.BIC.selection, on=list(
  profile.id, chromosome, n.segments)]
gg.supervised+
  ggtitle("BIC/SIC model changepoints and label errors")+
  penaltyLearning::geom_tallrect(aes(
    xmin=min/1e6,
    xmax=max/1e6,
    linetype=status),
    fill=NA,
    data=six.BIC.labels)+
  scale_linetype_manual("error type", values=c(
    correct=0,
    "false negative"=3,
    "false positive"=1))+
  geom_vline(aes(
    xintercept=chromStart/1e6),
    data=six.BIC.changes,
    color="green",
    size=1,
    linetype="dashed")+
  theme(legend.box="horizontal")

## Warning: Removed 65 rows containing missing values (geom_point).

Question: discuss with your neighbor for a minute, why are there only four false negatives in this plot, but there are five predictions outside the target interval in the previous plot with the regression line? Hint: look at the plot of the error as a function of penalty.

Another way to visualize the predicted penalty values is with the dots in the error function plot below.

six.BIC.errors <- six.error.list$model.errors[six.BIC.selection, on=list(
  profile.id, chromosome, n.segments), nomatch=0L]
six.gg.errors.bic <- six.gg.errors+
  geom_point(aes(
    pred.log.lambda, errors, color=model),
    size=2,
    data=six.BIC.errors)+
  scale_color_manual(values=model.colors)
print(six.gg.errors.bic)

It is clear from the plot above that

Most of the BIC/SIC penalty values occur where the error function is zero. These are predictions which are correct with respect to the labeled regions.
Some of the penalty values occur where the error function is non-zero. These are prediction errors with respect to the labels.

Exercise after class: can you think of any other features which would be useful for predicting log(penalty) values? In other words, what features of a data sequence should be related to the number of changepoints we should predict? Some ideas are quantiles/mean/sd of the observed data points. Change the definition of feature in six.BIC.dt above, and re-do the plots in this section. Using your feature, is it possible to learn an affine function with fewer errors – are the white and black dots linearly separable?

Exercise after class: In the changepoint package version 2.2 penalty="SIC0"means $\lambda_i = \log d_i$ (same as the BIC/SIC we explored in this section). Verify that the predicted penalty values in this section are the same, by adding a geom_point for predicted log(penalty) values to the plot with the regression line above. Hint: when we computed changepoint::cpt.mean in the first part of the tutorial, we saved the predicted penalty in the pen.value column of unsupervised.models. Join that data.table with six.features.targets to get the predicted penalty for the labeled data sequences.

Exercise after class: In the changepoint package version 2.2 penalty="SIC"means $\lambda_i = 2\log d_i$ (twice as much penalty as the BIC/SIC in this section). What is the slope/intercept of this model in the same feature space? Re-do the plots in this section. Is the model more or less accurate? To check your work, use a geom_point with predictions from changepoint::cpt.mean, as in the previous exercise.

Learning and prediction

In the previous section we explored the unsupervised BIC/SIC penalty, which uses 1 feature, and does not learn any parameters using the labels. In this section, we will explain two supervised penalty learning methods that can be used to increase changepoint detection accuracy.

survreg (un-regularized interval regression)

We begin by using the same feature as the BIC/SIC penalty, but relaxing the assumption that the intercept $\beta=0$ and slope $w=1$. Instead of taking these parameter values for granted based on the theoretical arguments of the BIC/SIC penalty, we can use the survival::survreg function to optimize these parameters based on the labels. The penalty function is $\log\lambda_i = f( x_i) = \beta + w\log\log d_i$ which is equivalent to solving the following penalized changepoint detection problem, \[ \operatorname{minimize}_{\mathbf m\in\mathbb R^{d_i}} \sum_{j=1}^{d_i} \ell(z_{ij}, m_j) +\underbrace{e^\beta(\log d_i)^w }_{\lambda_i} \sum_{j=1}^{d_i-1} I(m_{j} \neq m_{j+1}), \] where the intercept $\beta$ and the slope $w$ are defined as the solution to the following optimization problem, \[ \operatorname{minimize}_{ \mathbf w\in\mathbb R^p, \beta\in\mathbb R, \sigma\in\mathbb R_+ } \sum_{i=1}^n -\log[ F(\overline y_i, \mathbf w^\intercal \mathbf x_i+\beta, \sigma)- F(\underline y_i, \mathbf w^\intercal \mathbf x_i+\beta, \sigma) ], \] where

the optimization variables are the coefficients or weight vector $\mathbf w\in\mathbb R^p$, the bias/intercept $\beta\in\mathbb R$, and the scale $\sigma\in\mathbb R_+$. In this section for the BIC/SIC penalty we have $p=1$ feature, but survreg can be used with more features.
$(\underline y_i,\overline y_i)$ is the target interval of log(penalty) values.
$\mathbf x_i\in\mathbb R^p$ is the feature vector for data sequence $i$.
$\log\lambda_i= f(\mathbf x_i)=\mathbf w^\intercal \mathbf x_i+\beta$ is the predicted log(penalty) value for data sequence $i$. The survreg Normal model is $\log \lambda_i \sim N(\mathbf w^\intercal \mathbf x_i+\beta, \sigma^2)$.
$F$ is the likelihood: a cumulative distribution function such as the normal, logistic, or Weibull. (specified using the dist argument)
The objective function to minimize is the total negative log likelihood, which is convex with respect to the weight vector $\mathbf w$ and the bias/intercept $\beta$, but not jointly convex with respect to the scale parameter $\sigma$. Thus overall the optimization problem is non-convex, meaning that the gradient-based algorithms that survreg uses are only guaranteed to reach a local minimum (not the globally most likely model).

To perform a more fair comparison, we will hold out chromosome 11 as a test set, and only train the model using the other labeled chromosomes.

library(survival)
train.dt <- six.features.targets[chromosome!=11]

New function: the survival::survreg function is used below with three arguments:

a formula output ~ inputs as in other R modeling functions such as lm – the difference is that the left/output must be a Surv object. When learning a penalty function for changepoint detection, we always use type="interval2" because all outputs are either left, right, or interval censored. (target intervals of log penalty values)
the data.frame where the input/output variables can be found.
the distribution to use, in this case normal/Gaussian.

(fit.survreg <- survreg(
  Surv(min.log.lambda, max.log.lambda, type="interval2") ~ feature,
  train.dt, dist="gaussian"))

## Call:
## survreg(formula = Surv(min.log.lambda, max.log.lambda, type = "interval2") ~ 
##     feature, data = train.dt, dist = "gaussian")
## 
## Coefficients:
## (Intercept)     feature 
##  -0.4788175   0.7032863 
## 
## Scale= 0.4312716 
## 
## Loglik(model)= -2.9   Loglik(intercept only)= -2.9
##  Chisq= 0.01 on 1 degrees of freedom, p= 0.908 
## n= 30

It is clear that the coefficients of the learned model are different from the BIC/SIC model (slope=1, intercept=0). We can visualize the regression line in the feature space via the code below:

survreg.line.dt <- data.table(t(coef(fit.survreg)), model="survreg")
six.survreg.pred <- data.table(six.BIC.dt)
six.survreg.pred[, model := "survreg"]
six.survreg.pred[, pred.log.lambda := predict(fit.survreg, six.survreg.pred)]
six.survreg.res <- six.survreg.pred[six.features.targets, on=list(
  profile.id, chromosome)]
six.survreg.res[, residual := 0]
six.survreg.res[{
  is.finite(min.log.lambda)|is.finite(max.log.lambda)
}, residual := {
  penaltyLearning::targetIntervalResidual(
    cbind(min.log.lambda, max.log.lambda), pred.log.lambda)
}]
six.gg.survreg <- six.gg.bic+
  geom_abline(aes(
    slope=feature, intercept=`(Intercept)`, color=model),
    size=1,
    data=survreg.line.dt)+
  geom_segment(aes(
    feature, pred.log.lambda,
    xend=feature, yend=pred.log.lambda-residual,
    color=model),
    data=six.survreg.res)
print(six.gg.survreg)

It is clear from the plot above that the learned survreg penalty reduces the number of incorrectly predicted target intervals from 5 to 2.

The plot below shows the predicted errors and the smooth surrogate loss minimized by the gradient descent algorithm implemented by the survreg function (gradient descent cannot minimize the non-convex error function directly).

six.survreg.pred[, pred.log.penalty := pred.log.lambda]
six.survreg.selection <- data.table(six.selection)[six.survreg.pred, on=list(
  profile.id, chromosome,
  min.log.lambda < pred.log.lambda,
  max.log.lambda > pred.log.lambda)]
six.survreg.errors <- six.error.list$model.errors[six.survreg.selection, on=list(
  profile.id, chromosome, n.segments), nomatch=0L]
log.pen.vec <- six.error.list$model.errors[, seq(
  min(max.log.lambda), max(min.log.lambda), l=100)]
six.survreg.surrogate <- six.survreg.res[, {
  ## For censored observations, survreg finds a linear function that
  ## predicts mean values that maximize the normal CDF (pnorm) --
  ## equivalent to minimizing the negative log likelihood below.
  neg.log.lik <- -log(
    pnorm(max.log.lambda, log.pen.vec, fit.survreg$scale)-
    pnorm(min.log.lambda, log.pen.vec, fit.survreg$scale))
  data.table(
    model="survreg",
    log.penalty=log.pen.vec,
    surrogate.loss=neg.log.lik
  )[neg.log.lik < ymax.err]
}, by=list(profile.id, chromosome)]
six.gg.errors.survreg <- six.gg.errors.bic+
  geom_point(aes(
    pred.log.penalty, errors, color=model),
    size=2,
    data=six.survreg.errors)+
  geom_line(aes(
    log.penalty, surrogate.loss, color=model),
    data=six.survreg.surrogate)
print(six.gg.errors.survreg)

Note in the plot above:

four survreg predicted penalties achieve 0 error where the BIC/SIC has 1 error.
one BIC/SIC prediction achieves 0 error where survreg has 1 error.
the smooth surrogate loss is a good approximation of the non-convex error function.

We can also visualize the survreg model predictions by plotting the data sequences via the code below.

six.survreg.labels <- six.error.list$label.errors[six.survreg.selection, on=list(
  profile.id, chromosome, n.segments), nomatch=0L]
six.survreg.changes <- six.changes[six.survreg.selection, on=list(
  profile.id, chromosome, n.segments), nomatch=0L]
six.survreg.segs <- six.segs[six.survreg.selection, on=list(
  profile.id, chromosome, n.segments)]
gg.supervised+
  ggtitle("survreg 1 feature model changepoints and label errors")+
  penaltyLearning::geom_tallrect(aes(
    xmin=min/1e6,
    xmax=max/1e6,
    linetype=status),
    fill=NA,
    data=six.survreg.labels)+
  scale_linetype_manual("error type", values=c(
    correct=0,
    "false negative"=3,
    "false positive"=1))+
  geom_vline(aes(
    xintercept=chromStart/1e6),
    data=six.survreg.changes,
    color="green",
    size=1,
    linetype="dashed")+
  theme(legend.box="horizontal")

## Warning: Removed 65 rows containing missing values (geom_point).

The false positive and false negative predictions are evident in the plot above.

Exercise 3 during class: the survreg model above assumed a Gaussian distribution for the log(penalty) values. Would using another distribution increase accuracy? Hint: try re-doing the analyses in this section with the survreg(dist="logistic") model. For other choices, see help(survreg). Also try other input features from feature.mat.

Exercise after class: for one of the neuroblastoma data sequences plotted above, verify that changepoint::cpt.mean(data.vec, penalty="Manual", pen.value=PREDICTED.PENALTY, method="PELT") and fpop::Fpop(data.vec, lambda=PREDICTED.PENALTY) yield the same changepoints. Fpop implements a “functional pruning optimal partitioning” algorithm which is guaranteed to prune more sub-optimal changepoints (i.e. be faster than) the PELT algorithm. Read On optimal multiple changepoint algorithms for large data for more info.

Exercise after class: try re-doing the analyses in this section using penaltyLearning::IntervalRegressionUnregularized, which learns an un-regularized model by minimizing the squared hinge loss with respect to the target interval limits. Is there any difference in prediction accuracy?

IntervalRegressionCV (L1-regularized interval regression)

In the previous sections, we assumed that the log(penalty) is an affine function of a given single feature. We chose that feature based on the theoretical arguments of the BIC/SIC penalty. In this section, we show how a multi-variate affine penalty function can be learned using L1 regularization. This technique is great when you don’t have a theoretical prior for the penalty. You can compute as many features as you like (256 using penaltyLearning::featureMatrix), and the L1 regularization with ignore any irrelevant features automatically (feature selection).

The penalty function we learn is $\log\lambda_i = f(\mathbf x_i) = \beta + \mathbf w^\intercal \mathbf x_i$ which is equivalent to solving the following penalized changepoint detection problem, \[ \operatorname{minimize}_{\mathbf m\in\mathbb R^{d_i}} \sum_{j=1}^{d_i} \ell(z_{ij}, m_j) + \underbrace{\exp(\beta+\mathbf w^\intercal \mathbf x_i)}_{\lambda_i} \sum_{j=1}^{d_i-1} I(m_{j} \neq m_{j+1}), \] where the intercept $\beta$ and the weight vector $\mathbf w$ are defined as the solution to the following optimization problem, \[ \operatorname{minimize}_{ \mathbf w\in\mathbb R^p, \beta\in\mathbb R } \gamma ||\mathbf w||_1 + \frac{1}{n} \sum_{i=1}^n \phi(\mathbf w^\intercal \mathbf x_i+\beta-\underline y_i)+ \phi(\overline y_i-\mathbf w^\intercal \mathbf x_i-\beta) \] where

the data and optimization variables are the same as in the un-regularized interval regression problem described in the previous section (except there is no scale $\sigma$).
the L1-norm $||\mathbf w||_1=\sum_{j=1}^p |w_j|$ is a regularization term which is used to avoid overfitting. It forces some of the $w_j$ weights to be exactly zero (feature selection).
the hyper-parameter $\gamma>0$ is the degree of L1-regularization (chosen via an internal cross-validation loop). This means that the model performs automatic feature selection. You can thus use as many features $p$ as you want, and the model will automatically ignore any which do not increase prediction accuracy.
$\phi(x)=(x-1)^2I(x<1)$ is the squared hinge loss ($I$ is the indicator function: 1 if the argument is true, and 0 otherwise).
The objective function to minimize is the L1-norm of the weight vector plus the mean squared hinge loss (both are convex). Overall the optimization problem is convex, so the global minimum is guaranteed to be found in a reasonable amount of time using gradient-based algorithms.

This is a machine learning technique that automatically selects features that result in good prediction accuracy. First we create a feature and target matrix for the data in the train set:

finite.feature.mat <- feature.mat[, apply(is.finite(feature.mat), 2, all)]
train.ids <- train.dt[, paste(profile.id, chromosome) ]
train.feature.mat <- finite.feature.mat[train.ids, ]
train.target.mat <- train.dt[, cbind(min.log.lambda, max.log.lambda)]
str(train.feature.mat)

##  num [1:30, 1:256] 474 250 191 154 171 428 234 171 146 153 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:30] "1 1" "1 2" "1 3" "1 4" ...
##   ..$ : chr [1:256] "n.identity" "n.sqrt" "n.log" "n.loglog" ...

The training set consists of 30 examples, each with 256 features. Before training, we register the multiprocess future plan (the internal cross-validation loop to choose the regularization parameter will be computed in parallel).

if(require(future)){
  options(mc.cores=parallel::detectCores()/2)
  plan(multiprocess)
}

We use set.seed below in order to make the results reproducible (there is randomness in the fold assignment in the internal cross-validation loop). Then, we use IntervalRegressionCV to fit the regularized linear model.

New function: the penaltyLearning::IntervalRegressionCV function computes an L1-regularized interval regression model. Its two main inputs are a numeric input feature matrix (n observations x p features), and a numeric output target interval matrix (n observations x 2). Its return value is a model fit object of class IntervalRegression, which has predict, print, plot, and coef methods.

set.seed(1)
any.finite <- apply(is.finite(train.target.mat), 1, any)
fit.cv <- penaltyLearning::IntervalRegressionCV(
  train.feature.mat[any.finite,], train.target.mat[any.finite,])

## Loading required namespace: future.apply

## Loading required namespace: directlabels
## Loading required namespace: directlabels
## Loading required namespace: directlabels
## Loading required namespace: directlabels
## Loading required namespace: directlabels

coef(fit.cv)

##                                                0.001
## (Intercept)                            -6.171689e-01
## data orig.identity.sum                  9.338593e-04
## data orig.identity.quantile.0%          2.832772e-02
## data orig.identity.quantile.100%        1.868335e-03
## data orig.square.sum                    2.378681e-05
## data orig.square.quantile.0%           -5.036387e-02
## data abs.identity.sum                   2.088770e-04
## data abs.sqrt.sum                       7.485475e-05
## data abs.sqrt.quantile.0%               1.199808e-02
## data abs.sqrt.quantile.50%              3.671291e-06
## data abs.log.sum                        9.699457e-07
## data abs.log.quantile.25%               2.118603e-05
## data abs.log.quantile.50%               2.747159e-04
## data abs.square.sum                     1.622015e-05
## data square.log.quantile.25%            1.059009e-05
## data square.log.quantile.50%            1.373699e-04
## residual orig.sqrt.quantile.100%        2.954781e-02
## residual orig.square.sum                1.472951e+26
## residual abs.identity.quantile.75%      2.959378e-02
## residual abs.square.quantile.75%       -2.102896e-02
## residual abs.square.quantile.100%       1.492553e-03
## residual square.identity.quantile.75%  -2.132852e-02
## residual square.identity.quantile.100%  1.492553e-03
## residual square.sqrt.quantile.75%       2.996788e-02
## diff orig.identity.sum                 -3.828749e-01
## diff orig.identity.mean                -7.781112e-01
## diff orig.identity.quantile.0%         -5.457965e-01
## diff orig.identity.quantile.50%        -3.095173e+00
## diff orig.identity.quantile.100%       -9.918066e-04
## diff orig.square.quantile.0%            6.242716e-01
## diff orig.square.quantile.25%           3.672376e-01
## diff orig.square.quantile.50%           3.306278e+00
## diff orig.square.quantile.100%         -5.921755e-03
## diff abs.square.quantile.50%            2.513148e-02
## diff abs.square.quantile.100%           4.326546e-05
## diff square.identity.quantile.50%       3.748871e-01
## diff square.identity.quantile.100%      4.326546e-05
## diff square.square.mean                 5.930345e+01
## diff square.square.sd                   1.043750e+01
## diff square.square.quantile.25%         1.782259e+01
## diff square.square.quantile.50%         1.453184e+03
## diff square.square.quantile.75%         2.065200e+00
## diff square.square.quantile.100%        1.859673e-01

It is clear from the output above that even though the model was trained using 256 possible input features, it has selected only a few of them to use in the prediction function. We visualize the predicted penalty values as dots in the error curve plots below:

six.cv.pred <- data.table(six.survreg.pred)
six.cv.pred[, pred.log.penalty := {
  pred.feat.mat <- finite.feature.mat[paste(profile.id, chromosome), ]
  predict(fit.cv, pred.feat.mat)
}]
six.cv.pred[, model := "IntervalRegressionCV"]
six.cv.pred[, pred.log.lambda := pred.log.penalty]
six.cv.selection <- data.table(six.selection)[six.cv.pred, on=list(
  profile.id, chromosome,
  min.log.lambda < pred.log.lambda,
  max.log.lambda > pred.log.lambda)]
six.cv.errors <- six.error.list$model.errors[six.cv.selection, on=list(
  profile.id, chromosome, n.segments), nomatch=0L]
six.cv.surrogate <- six.survreg.res[, {
  ## IntervalRegressionCV minimizes the squared hinge loss.
  data.table(
    model="IntervalRegressionCV",
    log.penalty=log.pen.vec,
    surrogate.loss=penaltyLearning::squared.hinge(log.pen.vec-min.log.lambda)+
      penaltyLearning::squared.hinge(max.log.lambda-log.pen.vec)
  )[surrogate.loss < ymax.err]
}, by=list(profile.id, chromosome)]
six.gg.errors.survreg+
  geom_point(aes(
    pred.log.penalty, errors, color=model),
    size=2,
    data=six.cv.errors)+
  geom_line(aes(
    log.penalty, surrogate.loss, color=model),
    data=six.cv.surrogate)

The plot above also shows the surrogate loss that IntervalRegressionCV minimizes using a gradient descent algorithm. Note how it is very similar to the survreg surrogate loss – both are smooth functions which increase quadratically as the predicted value goes past the target interval limit.

Finally, we can also visualize the IntervalRegressionCV model in terms of the data sequences, labels, and predicted segmentation:

six.cv.labels <- six.error.list$label.errors[six.cv.selection, on=list(
  profile.id, chromosome, n.segments), nomatch=0L]
six.cv.changes <- six.changes[six.cv.selection, on=list(
  profile.id, chromosome, n.segments), nomatch=0L]
six.cv.segs <- six.segs[six.cv.selection, on=list(
  profile.id, chromosome, n.segments)]
gg.supervised+
  ggtitle("IntervalRegressionCV multi-feature model changepoints and label errors")+
  penaltyLearning::geom_tallrect(aes(
    xmin=min/1e6,
    xmax=max/1e6,
    linetype=status),
    fill=NA,
    data=six.cv.labels)+
  scale_linetype_manual("error type", values=c(
    correct=0,
    "false negative"=3,
    "false positive"=1))+
  geom_vline(aes(
    xintercept=chromStart/1e6),
    data=six.cv.changes,
    color="green",
    size=1,
    linetype="dashed")+
  theme(legend.box="horizontal")

## Warning: Removed 65 rows containing missing values (geom_point).

The plot above clearly shows the advantage of the machine learning approach: all labels are correctly predicted, including all labels on the test set (chromosome 11).

Exercise after class: by default the IntervalRegressionCV function chooses the regularization parameter with minimum error, as estimated using an internal cross-validation loop. Another option is to choose the least complex model within 1 standard deviation. Re-fit the model using IntervalRegressionCV(reg.type="1sd"), view the error as a function of model complexity via plot(fit.cv), then re-do the model prediction error plots. Is this model more or less accurate? Why? Hint: use coef(fit.cv) to see how many features are included in the model.

Other methods

Exercise after class: try one of these other interval regression algorithms, and see if you can learn a more accurate penalty function.

iregnet, elastic net regularized Accelerated Failure Time models.
mmit, max margin interval trees.
trtf transformation trees and forests.

Evaluation

Since we did not use the chromosome 11 labels to train any of the models above, we can use those labels to evaluate the prediction accuracy of the learned models.

In fact, we have already qualitatively evaluated the accuracy of the predictions by visual inspection of the plots in the last section. Qualitative evaluation via visual inspection is also possible in the context of an unlabeled unsupervised changepoint analysis.
However because we have labels, we can quantitatively evaluate the prediction accuracy by computing the number of incorrect labels and AUC (area under the ROC curve). This is a major advantage of the supervised machine learning approach.

Below we create a list with an element for each of the three models we want to compare. Then we use a for loop to compute ROC curves for each model, with respect to the test data on chromosome 11.

New function: the penaltyLearning::ROChange function computes Receiver Operating Characteristic curves for penalty function learning problems. It inputs the model.errors data.table from penaltyLearning::labelError, and a data.table with predicted log(penalty) values in the pred.log.lambda column. It outputs a list which describes the ROC curves. Below we use the roc element which is a data.table with one row for each point on the ROC curve. We also use the thresholds element, which is a data.table which has a row with the error rates for the predicted threshold. For more info read help(ROChange).

model.pred.list <- list(
  IntervalRegressionCV=six.cv.pred,
  "BIC/SIC"=six.BIC.dt,
  survreg=six.survreg.pred)
roc.list <- list()
auc.list <- list()
for(model in names(model.pred.list)){
  model.pred <- model.pred.list[[model]]
  model.roc <- penaltyLearning::ROChange(
    six.error.list$model.errors, model.pred[chromosome==11],
    problem.vars=c("profile.id", "chromosome"))
  roc.list[[model]] <- data.table(model, model.roc$roc)
  auc.list[[model]] <- with(model.roc, data.table(
    model, auc, thresholds[threshold=="predicted"]))
}
roc <- do.call(rbind, roc.list)
(auc <- do.call(rbind, auc.list))

##                   model       auc threshold labels possible.fp possible.fn min.thresh max.thresh
## 1: IntervalRegressionCV 1.0000000 predicted      6           3           3 -2.3028026  0.3027037
## 2:              BIC/SIC 0.8888889 predicted      6           3           3 -1.0752140  0.4493126
## 3:              survreg 0.8888889 predicted      6           3           3 -0.1133423  1.4082383
## 8 variables not shown: [fp, fn, min.fp.fn, errors, FPR, tp, TPR, error.percent]

The table above shows the error/accuracy of the three models with respect to the labels in the test data (chromosome 11). It is clear that the IntervalRegressionCV is slightly more accurate (higher auc and lower errors). The ROC curves are shown below:

ggplot()+
  scale_color_manual(values=model.colors)+
  geom_path(aes(
    FPR, TPR, color=model, linetype=model),
    size=1,
    data=roc)+
  geom_point(aes(
    FPR, TPR, color=model),
    size=3,
    data=auc)+
  coord_equal()

However, with so few labels, and only one test set, it is difficult/impossible to tell if there is any significant difference between models. Below we use K-fold cross-validation to evaluate the models:

we assign each labeled data sequence to one of K folds.
for each fold, we set aside the corresponding data in that fold as a test set.
we then train each of the models using all of the other folds (not using the test fold at all during this training phase).
after each model has been trained, we compute predicted penalty values for the test data, then accuracy metrics (percent incorrect labels and AUC).
we then repeat the process for each of the folds, and check to see if any method is better on average over all of the test sets.

Since the neuroblastoma data set is relatively big (3418 labeled data sequences), we have pre-computed the Segmentor models, which we load using the code below. (to re-compute them on your computer, run Segmentor.models.R)

if(!file.exists("Segmentor.models.RData")){
  download.file(
    "https://rcdata.nau.edu/genomic-ml/Segmentor.models.RData",
    "Segmentor.models.RData")
}
load("Segmentor.models.RData")

Before performing cross-validation, we can compute the error functions, features and target intervals for all of the labeled data.

nb.selection <- Segmentor.models$loss[, {
  penaltyLearning::modelSelection(
    .SD, complexity="n.segments")
}, by=list(profile.id, chromosome)]
nb.error.list <- penaltyLearning::labelError(
  nb.selection,
  neuroblastoma$annotations,
  Segmentor.models$segs[1 < start],
  problem.vars=c("profile.id", "chromosome")
)
nb.target.dt <- penaltyLearning::targetIntervals(
  nb.error.list$model.errors,
  problem.vars=c("profile.id", "chromosome"))
labeled.profiles <- data.table(
  neuroblastoma$profiles)[nb.target.dt[, list(
    profile.id, chromosome)], on=list(
    profile.id, chromosome)]
labeled.feature.mat <- penaltyLearning::featureMatrix(
  labeled.profiles,
  problem.vars=c("profile.id", "chromosome"),
  data.var="logratio")

Rather than training on all features (which is possible but takes a bit too long for this tutorial), we will just use the following subset:

(some.feature.names <- c(
  colnames(labeled.feature.mat)[1:5],
  rownames(coef(fit.cv))[-1]))

##  [1] "n.identity"                             "n.sqrt"                                
##  [3] "n.log"                                  "n.loglog"                              
##  [5] "n.square"                               "data orig.identity.sum"                
##  [7] "data orig.identity.quantile.0%"         "data orig.identity.quantile.100%"      
##  [9] "data orig.square.sum"                   "data orig.square.quantile.0%"          
## [11] "data abs.identity.sum"                  "data abs.sqrt.sum"                     
## [13] "data abs.sqrt.quantile.0%"              "data abs.sqrt.quantile.50%"            
## [15] "data abs.log.sum"                       "data abs.log.quantile.25%"             
## [17] "data abs.log.quantile.50%"              "data abs.square.sum"                   
## [19] "data square.log.quantile.25%"           "data square.log.quantile.50%"          
## [21] "residual orig.sqrt.quantile.100%"       "residual orig.square.sum"              
## [23] "residual abs.identity.quantile.75%"     "residual abs.square.quantile.75%"      
## [25] "residual abs.square.quantile.100%"      "residual square.identity.quantile.75%" 
## [27] "residual square.identity.quantile.100%" "residual square.sqrt.quantile.75%"     
## [29] "diff orig.identity.sum"                 "diff orig.identity.mean"               
## [31] "diff orig.identity.quantile.0%"         "diff orig.identity.quantile.50%"       
## [33] "diff orig.identity.quantile.100%"       "diff orig.square.quantile.0%"          
## [35] "diff orig.square.quantile.25%"          "diff orig.square.quantile.50%"         
## [37] "diff orig.square.quantile.100%"         "diff abs.square.quantile.50%"          
## [39] "diff abs.square.quantile.100%"          "diff square.identity.quantile.50%"     
## [41] "diff square.identity.quantile.100%"     "diff square.square.mean"               
## [43] "diff square.square.sd"                  "diff square.square.quantile.25%"       
## [45] "diff square.square.quantile.50%"        "diff square.square.quantile.75%"       
## [47] "diff square.square.quantile.100%"

Next we assign a fold to each target,

set.seed(1)
n.folds <- 5
nb.target.dt[, fold := sample(rep(1:n.folds, l=.N))]

Then we loop over test folds. In each iteration of the for loop, we create train.target.dt and train.features by excluding data in the test fold. We then train each model using these data, and compute prediction accuracy for each model using penaltyLearning::ROChange.

cv.roc.list <- list()
cv.auc.list <- list()
for(test.fold in 1:n.folds){
  train.target.dt <- nb.target.dt[fold!=test.fold]
  train.id.vec <- train.target.dt[, paste(profile.id, chromosome)]
  train.features <- labeled.feature.mat[train.id.vec, ]
  train.target.mat <- train.target.dt[, cbind(min.log.lambda, max.log.lambda)]
  print(test.fold)
  set.seed(1)
  fit <- penaltyLearning::IntervalRegressionCV(
    train.features[, some.feature.names], train.target.mat)
  train.target.dt$feature <- train.features[, "n.loglog"]
  fit.survreg <- survreg(
    Surv(min.log.lambda, max.log.lambda, type="interval2") ~ feature,
    train.target.dt, dist="gaussian")
  ## compute predictions
  test.target.dt <- nb.target.dt[fold==test.fold]
  test.id.vec <- test.target.dt[, paste(profile.id, chromosome)]
  test.feature.vec <- labeled.feature.mat[test.id.vec, "n.loglog"]
  ## baseline constant prediction
  const.pen.vec <- seq(-4, 4, by=0.1)
  const.acc.vec <- sapply(const.pen.vec, function(pred.penalty){
    sum(train.target.mat[,1] < pred.penalty & pred.penalty < train.target.mat[,2])
  })
  best.constant <- const.pen.vec[which.max(const.acc.vec)]
  pred.list <- list(
    constant=rep(best.constant, length(test.feature.vec)),
    "BIC/SIC"=test.feature.vec,
    survreg=predict(fit.survreg, data.table(feature=test.feature.vec)),
    IntervalRegressionCV=predict(fit, labeled.feature.mat[test.id.vec,]))
  for(model in names(pred.list)){
    pred.dt <- data.table(
      test.target.dt,
      pred.log.lambda=as.numeric(pred.list[[model]]))
    pred.roc <- penaltyLearning::ROChange(
      nb.error.list$model.errors,
      pred.dt,
      problem.vars=c("profile.id", "chromosome"))
    cv.roc.list[[paste(test.fold, model)]] <- data.table(
      test.fold, model, pred.roc$roc)
    cv.auc.list[[paste(test.fold, model)]] <- with(pred.roc, data.table(
      test.fold, model, auc, thresholds[threshold=="predicted"]))
  }
}

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

cv.auc <- do.call(rbind, cv.auc.list)
cv.roc <- do.call(rbind, cv.roc.list)

Note that in the code above, we used cv.roc.list to save the ROC curve for each model and test fold, and we used cv.auc.list to save AUC. We plot the ROC curves below in order to visualize what kinds of errors the models commit (false positives or false negatives).

ggplot()+
  scale_color_manual(values=model.colors)+
  geom_path(aes(
    FPR, TPR, color=model, group=paste(model, test.fold)),
    data=cv.roc)+
  geom_point(aes(
    FPR, TPR, color=model),
    fill="white",
    shape=21,
    data=cv.auc)+
  coord_equal(xlim=c(0, 0.5), ylim=c(0.5, 1))

It is clear from the ROC curves above that the BIC/SIC model has a relatively low true positive rate (high false negative rate – many breakpoint/positive labels with no predicted changepoints).

Below we compute and plot two accuracy metrics, AUC and percent incorrectly predicted labels.

cv.auc[, accuracy.percent := 100-error.percent]
cv.auc.tall <- melt(cv.auc, measure.vars=c("accuracy.percent", "auc"))
cv.means <- cv.auc.tall[, list(
  mean=mean(value),
  sd=sd(value)
), by=list(model, variable)]
levs <- cv.means[variable=="auc"][order(mean), model]
cv.auc.tall[, model.fac := factor(model, levs)]
ggplot()+
  theme_bw()+
  theme(panel.spacing=grid::unit(0, "lines"))+
  facet_grid(. ~ variable, scales="free")+
  scale_color_manual(values=model.colors)+
  guides(color="none")+
  geom_point(aes(
    value, model.fac, color=model),
    data=cv.auc.tall)+
  xlab("")+
  ylab("model")

It is clear from the plot above that survreg is significantly more accurate than the BIC/SIC and constant penalties, and IntervalRegressionCV is slightly more accurate than survreg.

Remark 1: the accuracy rates above (BIC/SIC and constant penalties predict 92% correct labels whereas survreg and IntervalRegressionCV predict 97-98% correct labels) are specific to the neuroblastoma data set. However, in general you can expect that unsupervised penalty functions like BIC/SIC will be less accurate than supervised methods such as IntervalRegressionCV. This is the main motivation of supervised changepoint analysis: we can use labels to learn a penalty function with improved changepoint detection accuracy.

Remark 2: To learn the most accurate model, it is best to use all training data (instead of holding out some of the labels as a test set). The figure-neuroblastoma-predictions.R script fits an IntervalRegressionCV model to the entire neuroblastoma data set, uses the model to predict changepoints for all data sequences (even those without labels), then plots the predicted changepoints along with the train errors. Some overfitting/extrapolation errors are evident. For example some chrY sequences have too many changes — this could be fixed by adding some negative/normal labels on those sequences.

Conclusion: in labeled data sequences with a common signal/noise pattern, supervised learning algorithms can be used to predict penalty values and changepoints which are more accurate than unsupervised methods. An advantage of the supervised learning approach is that we can quantitatively evaluate the learned models by counting the number of incorrectly predicted labels in a held-out test set.

Exercise after class: is the logistic or Gaussian distribution a more accurate model for the log(penalty) values? In the code block above, add some code for computing the model predictions and error rates for survreg(dist="logistic"). Then re-do the plots in this section – there should be 4 models instead of 3 in the last plot.

Introduction to supervised changepoint detection

Toby Dylan Hocking, McGill University, Montreal, Canada