See the GitHub repository for links to source code and exercises: https://github.com/tdhock/change-tutorial

Before executing the code in this tutorial Rmd, make sure to install the required packages:

packages.R <- if(file.exists("packages.R")){
  "packages.R"
}else{
  "https://raw.githubusercontent.com/tdhock/change-tutorial/master/packages.R"
}
source(packages.R)
## Loading required package: survival
## Loading required package: data.table
## Loading required package: neuroblastoma
## Loading required package: future
## 
## Attaching package: 'future'
## The following object is masked from 'package:survival':
## 
##     cluster
## Loading required package: Segmentor3IsBack
## Segmentor3IsBack v2.0 Loaded
## Loading required package: changepoint
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Successfully loaded changepoint package version 2.3.1
##  NOTE: Predefined penalty values changed in version 2.2.  Previous penalty values with a postfix 1 i.e. SIC1 are now without i.e. SIC and previous penalties without a postfix i.e. SIC are now with a postfix 0 i.e. SIC0. See NEWS and help files for further details.
## Loading required package: directlabels
## Loading required package: ggplot2
## Loading required package: geometry
## Loading required package: penaltyLearning
options(width=100)

What is the difference between unsupervised and supervised changepoint detection?

There are two features of a data set that make it possible to do a supervised changepoint analysis:

The goal of the supervised analysis is to find a model that learns from the limited labeled data, and provides accurate changepoint predictions for all of the data (a different number of changepoints for each separate data sequence).

Neuroblastoma data set

We will use the neuroblastoma data set to explore supervised changepoint detection.

data(neuroblastoma, package="neuroblastoma")
str(neuroblastoma)
## List of 2
##  $ profiles   :'data.frame': 4616846 obs. of  4 variables:
##   ..$ profile.id: Factor w/ 575 levels "1","2","4","5",..: 7 7 7 7 7 7 7 7 7 7 ...
##   ..$ chromosome: Factor w/ 24 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ position  : int [1:4616846] 809681 928433 987423 1083595 1125548 1199359 1392490 1672814 1985786 2046695 ...
##   ..$ logratio  : num [1:4616846] -0.01742 0.06626 0.06764 0.04264 0.00576 ...
##  $ annotations:'data.frame': 3418 obs. of  5 variables:
##   ..$ profile.id: Factor w/ 575 levels "1","2","4","5",..: 207 322 379 144 388 497 564 331 189 315 ...
##   ..$ chromosome: Factor w/ 24 levels "1","2","3","4",..: 11 11 11 11 11 11 11 11 11 11 ...
##   ..$ min       : int [1:3418] 53700000 53700000 53700000 53700000 53700000 53700000 53700000 53700000 53700000 53700000 ...
##   ..$ max       : int [1:3418] 135006516 135006516 135006516 135006516 135006516 135006516 135006516 135006516 135006516 135006516 ...
##   ..$ annotation: Factor w/ 2 levels "breakpoint","normal": 2 2 2 2 2 2 1 2 2 2 ...
lapply(neuroblastoma, head, 10)
## $profiles
##    profile.id chromosome position     logratio
## 1           8          1   809681 -0.017417053
## 2           8          1   928433  0.066261442
## 3           8          1   987423  0.067638717
## 4           8          1  1083595  0.042644337
## 5           8          1  1125548  0.005759269
## 6           8          1  1199359  0.000000000
## 7           8          1  1392490  0.069014678
## 8           8          1  1672814  0.204140717
## 9           8          1  1985786  0.039840265
## 10          8          1  2046695  0.073134705
## 
## $annotations
##    profile.id chromosome      min       max annotation
## 1         213         11 53700000 135006516     normal
## 2         333         11 53700000 135006516     normal
## 3         393         11 53700000 135006516     normal
## 4         148         11 53700000 135006516     normal
## 5         402         11 53700000 135006516     normal
## 6         521         11 53700000 135006516     normal
## 7         590         11 53700000 135006516 breakpoint
## 8         342         11 53700000 135006516     normal
## 9         195         11 53700000 135006516     normal
## 10        326         11 53700000 135006516     normal

The neuroblastoma data set consists of two data.frames:

  • profiles contains the noisy data (logratio), which is approximate DNA copy number (measured at chromosome, position), for a particular patient (profile.id). Every unique combination of (profile.id, chromosome) defines a separate multiple changepoint detection problem. The logratio measures how many copies of each part of the human genome are present in the patient’s tumor. Normally there are two copies of each chromosome (logratio=0), but in cancer samples there can be gains (changes up) and losses (changes down) of specific regions or entire chromosomes.
  • annotations contains the labels, which indicate presence (breakpoint) or absence (normal) of changepoints in specific regions (profile.id, chromosome, min, max) of the data.

These data come from children who were treated at the Institut Curie (Paris, France). For six children we also have follow-up data on whether they recovered (ok) or had a relapse, several years after treatment:

followup <- data.frame(
  profile.id=paste(c(10, 8, 4, 6, 11, 1)),
  status=c("ok", "relapse", "relapse", "ok", "ok", "relapse"))
followup
##   profile.id  status
## 1         10      ok
## 2          8 relapse
## 3          4 relapse
## 4          6      ok
## 5         11      ok
## 6          1 relapse

Unsupervised: no labels, model selection via theory

When there are no labels, the input for a changepoint analysis just uses the noisy DNA copy number data in neuroblastoma$profiles$logratio. We begin by selecting the profiles of the six patients for which we have follow-up data.

rownames(followup) <- followup$profile.id
followup$status.profile <- with(followup, paste(status, profile.id))
some.ids <- rownames(followup)
library(data.table)
someProfiles <- function(all.profiles){
  some <- subset(all.profiles, paste(profile.id) %in% some.ids)
  status.profile <- followup[paste(some$profile.id), "status.profile"]
  some$status.profile <- ifelse(
    is.na(status.profile), paste(some$profile.id), status.profile)
  data.table(some)
}
neuroblastoma.dt <- data.table(neuroblastoma$profiles)
## neuroblastoma.dt[, logratio.norm := logratio/mad(logratio), by=list(
##   profile.id, chromosome)]
neuroblastoma.dt[, logratio.norm := logratio]
six.profiles <- someProfiles(neuroblastoma.dt)

For these six patients, the total number of separate changepoint detection problems is 24 chromosomes x 6 patients = 144 problems. We plot each problem below in a separate facet (panel).

library(ggplot2)
gg.unsupervised <- ggplot()+
  ggtitle("unsupervised changepoint detection = only noisy data sequences")+
  theme(
    panel.spacing=grid::unit(0, "lines"),
    panel.border=element_rect(fill=NA, color="grey50")
  )+
  facet_grid(status.profile ~ chromosome, scales="free", space="free_x")+
  geom_point(aes(position/1e6, logratio),
             data=six.profiles,
             shape=1)+
  scale_x_continuous(
    "position on chromosome (mega bases)",
    breaks=c(100, 200))+
  scale_y_continuous(
    "logratio (approximate DNA copy number)",
    limits=c(-1,1)*1.1)
print(gg.unsupervised)
## Warning: Removed 65 rows containing missing values (geom_point).

Note the facet titles on the right:

  • The top three profiles are “ok” because those children completely recovered.
  • The bottom three profiles are “relapse” because the neuroblastoma cancer came back, and required another treatment at the hospital.

Question: can you visually identify any differences between the “ok” and “relapse” profiles?

How to choose the number of changepoints? In the unsupervised setting, our only option is to use theoretical/statistical arguments, which give information criteria such as Aikaike Information Criterion (AIC, penalty=2) or Bayesian/Schwarz Information Criterion (BIC/SIC, penalty=log n). More on this later.

Supervised: learn a penalty function with minimal incorrect labels

In supervised changepoint detection, there are labels which indicate presence and absence of changepoints in particular data subsets. For the same set of 6 profiles, we superimpose the labels in the plot below.

six.labels <- someProfiles(neuroblastoma$annotations)
change.colors <- c(
  "1change"="#ff7d7d",
  breakpoint="#a445ee",
  normal="#f6c48f"
  )
gg.supervised <- gg.unsupervised+
  ggtitle(paste(
    "supervised changepoint detection = data + labels",
    "that indicate specific regions with or without changepoints"))+
  penaltyLearning::geom_tallrect(aes(
    xmin=min/1e6, xmax=max/1e6, fill=annotation),
    alpha=0.5,
    color=NA,
    data=six.labels)+
  scale_fill_manual("label", values=change.colors)+
  theme(legend.position="bottom")
print(gg.supervised)
## Warning: Removed 65 rows containing missing values (geom_point).

New function: geom_tallrect is like geom_rect but the ymin is always the bottom of the panel and the ymax is always the top of the panel. These are useful for drawing labeled intervals of x values.

It is clear from this plot that the neuroblastoma data satisfy the two criteria which are necessary for a supervised changepoint analysis:

  • The data consist of several sequences with similar signal/noise patterns, which are treated as separate changepoint detection problems.
  • Some data sequences have labels which indicate specific regions where the model should predict presence or absence of changepoints.

As we will see later in the tutorial, the labels can be used for choosing the best model and parameters. The main idea is that the changepoint model should be consistent with the labels:

  • Every breakpoint/positive label is a region where the model should predict at least one changepoint. If the model predicts no changepoints in a region with a breakpoint label, that is considered a false negative.
  • Every normal/negative label is a region where the model should predict no changepoints. If the model predicts one or more changepoints in a region with a normal label, that is considered a false positive.
  • The overall goal is find a changepoint model that minimizes the number of incorrectly predicted labels, which is defined as the sum of false positives and false negatives.

Label error of an unsupervised changepoint model

In this section we show that typical unsupervised changepoint models result in prediction errors with respect to the labels in the neuroblastoma data. We begin by using the changepoint::cpt.mean function to compute an unsupervised changepoint model for each data sequence. To compute the number of incorrect labels later, make sure to save a description of the model in a “tidy” data.frame (observations on rows and variables on columns). In the code below we

  • fit the model inside by=list(status.profile, chromosome) which adds those two columns to the resulting data.table.
  • add the pen.name column to identify the model.
  • use the changes list column to save the changepoint positions (in the same units as the label positions: bases on the chromosome).
pen.name <- "SIC0"
(unsupervised.models <- six.profiles[, {
  fit.pelt <- changepoint::cpt.mean(
    logratio.norm, penalty=pen.name, method="PELT")
  end <- fit.pelt@cpts
  before.change <- end[-length(end)]
  after.change <- before.change+1L
  data.table(
    pen.name,
    pen.value=fit.pelt@pen.value,
    changes=list(
    as.integer((position[before.change]+position[after.change])/2)
    ))
}, by=list(profile.id, status.profile, chromosome)])
##      profile.id status.profile chromosome pen.name pen.value changes
##   1:          8      relapse 8          1     SIC0  6.013715        
##   2:          8      relapse 8          2     SIC0  5.375278        
##   3:          8      relapse 8          3     SIC0  5.081404        
##   4:          8      relapse 8          4     SIC0  4.882802        
##   5:          8      relapse 8          5     SIC0  5.164786        
##  ---                                                                
## 140:         10          ok 10         20     SIC0  4.828314        
## 141:         10          ok 10         21     SIC0  4.110874        
## 142:         10          ok 10         22     SIC0  4.543295        
## 143:         10          ok 10          X     SIC0  5.231109        
## 144:         10          ok 10          Y     SIC0  3.295837

Next, we convert the list column changes to tall format (one row for each changepoint).

(unsupervised.changes <- unsupervised.models[, data.table(
  change=changes[[1]]
), by=list(profile.id, status.profile, chromosome, pen.name)])
##     profile.id status.profile chromosome pen.name    change
##  1:          8      relapse 8          7     SIC0 141405948
##  2:          8      relapse 8         11     SIC0  70573642
##  3:          8      relapse 8         17     SIC0  29895917
##  4:          8      relapse 8          Y     SIC0   4459374
##  5:         11          ok 11          1     SIC0  32819999
##  6:         11          ok 11          Y     SIC0   5918094
##  7:         11          ok 11          Y     SIC0  11393013
##  8:         11          ok 11          Y     SIC0  23715526
##  9:          4      relapse 4          1     SIC0  59792500
## 10:          4      relapse 4          2     SIC0  45164625
## 11:          4      relapse 4          3     SIC0  69953902
## 12:          4      relapse 4         14     SIC0  76603452
## 13:          4      relapse 4         17     SIC0  41646489
## 14:          1      relapse 1          1     SIC0 212809180
## 15:          1      relapse 1          7     SIC0  60381530
## 16:          1      relapse 1         11     SIC0  80058339
## 17:          1      relapse 1          Y     SIC0   5918094
## 18:          1      relapse 1          Y     SIC0   9420652
## 19:          1      relapse 1          Y     SIC0  19070635

New function: we now introduce the penaltyLearning::labelError function, which computes the number of incorrect labels for a set of predicted changepoints. Its inputs are 3 data.frames:

  • models one row per changepoint model.
  • labels one row for each label, with columns for label location and type.
  • changes one row for each predicted changepoint, with a column for the predicted changepoint position.
unsupervised.error.list <- penaltyLearning::labelError(
  unsupervised.models, six.labels, unsupervised.changes,
  problem.vars=c("status.profile", "chromosome"),
  model.vars="pen.name",
  change.var="change")

The result of labelError is a list of two data tables:

  • model.errors has one row per (sequence,model), with the total number of incorrect labels. We use these totals in the ggtitle of the plot below.
  • label.errors has one row per (sequence,model,label), with the fp/fn status of each label. We use these to visualize false negative and false positive predictions using the geom_tallrect with aes(linetype=status) below.
gg.supervised+
  theme(legend.box="horizontal")+
  unsupervised.error.list$model.errors[, ggtitle(paste0(
    "Unsupervised ", pen.name, " penalty has ",
    sum(errors),
    " incorrect labels: ",
    sum(fp), " FP + ",
    sum(fn), " FN"))]+
  geom_vline(aes(
    xintercept=change/1e6),
    color="green",
    size=1,
    linetype="dashed",
    data=unsupervised.changes)+
  geom_tallrect(aes(
    xmin=min/1e6, xmax=max/1e6, linetype=status),
    fill=NA,
    data=unsupervised.error.list$label.errors)+
  scale_linetype_manual("error type", values=c(
    correct=0,
    "false negative"=3,
    "false positive"=1))
## Warning: Removed 65 rows containing missing values (geom_point).