See the GitHub repository for links to source code and exercises: https://github.com/tdhock/change-tutorial
Before executing the code in this tutorial Rmd, make sure to install the required packages:
packages.R <- if(file.exists("packages.R")){
"packages.R"
}else{
"https://raw.githubusercontent.com/tdhock/change-tutorial/master/packages.R"
}
source(packages.R)
## Loading required package: survival
## Loading required package: data.table
## Loading required package: neuroblastoma
## Loading required package: future
##
## Attaching package: 'future'
## The following object is masked from 'package:survival':
##
## cluster
## Loading required package: Segmentor3IsBack
## Segmentor3IsBack v2.0 Loaded
## Loading required package: changepoint
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Successfully loaded changepoint package version 2.3.1
## NOTE: Predefined penalty values changed in version 2.2. Previous penalty values with a postfix 1 i.e. SIC1 are now without i.e. SIC and previous penalties without a postfix i.e. SIC are now with a postfix 0 i.e. SIC0. See NEWS and help files for further details.
## Loading required package: directlabels
## Loading required package: ggplot2
## Loading required package: geometry
## Loading required package: penaltyLearning
options(width=100)
There are two features of a data set that make it possible to do a supervised changepoint analysis:
The goal of the supervised analysis is to find a model that learns from the limited labeled data, and provides accurate changepoint predictions for all of the data (a different number of changepoints for each separate data sequence).
We will use the neuroblastoma data set to explore supervised changepoint detection.
data(neuroblastoma, package="neuroblastoma")
str(neuroblastoma)
## List of 2
## $ profiles :'data.frame': 4616846 obs. of 4 variables:
## ..$ profile.id: Factor w/ 575 levels "1","2","4","5",..: 7 7 7 7 7 7 7 7 7 7 ...
## ..$ chromosome: Factor w/ 24 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## ..$ position : int [1:4616846] 809681 928433 987423 1083595 1125548 1199359 1392490 1672814 1985786 2046695 ...
## ..$ logratio : num [1:4616846] -0.01742 0.06626 0.06764 0.04264 0.00576 ...
## $ annotations:'data.frame': 3418 obs. of 5 variables:
## ..$ profile.id: Factor w/ 575 levels "1","2","4","5",..: 207 322 379 144 388 497 564 331 189 315 ...
## ..$ chromosome: Factor w/ 24 levels "1","2","3","4",..: 11 11 11 11 11 11 11 11 11 11 ...
## ..$ min : int [1:3418] 53700000 53700000 53700000 53700000 53700000 53700000 53700000 53700000 53700000 53700000 ...
## ..$ max : int [1:3418] 135006516 135006516 135006516 135006516 135006516 135006516 135006516 135006516 135006516 135006516 ...
## ..$ annotation: Factor w/ 2 levels "breakpoint","normal": 2 2 2 2 2 2 1 2 2 2 ...
lapply(neuroblastoma, head, 10)
## $profiles
## profile.id chromosome position logratio
## 1 8 1 809681 -0.017417053
## 2 8 1 928433 0.066261442
## 3 8 1 987423 0.067638717
## 4 8 1 1083595 0.042644337
## 5 8 1 1125548 0.005759269
## 6 8 1 1199359 0.000000000
## 7 8 1 1392490 0.069014678
## 8 8 1 1672814 0.204140717
## 9 8 1 1985786 0.039840265
## 10 8 1 2046695 0.073134705
##
## $annotations
## profile.id chromosome min max annotation
## 1 213 11 53700000 135006516 normal
## 2 333 11 53700000 135006516 normal
## 3 393 11 53700000 135006516 normal
## 4 148 11 53700000 135006516 normal
## 5 402 11 53700000 135006516 normal
## 6 521 11 53700000 135006516 normal
## 7 590 11 53700000 135006516 breakpoint
## 8 342 11 53700000 135006516 normal
## 9 195 11 53700000 135006516 normal
## 10 326 11 53700000 135006516 normal
The neuroblastoma data set consists of two data.frames:
These data come from children who were treated at the Institut Curie (Paris, France). For six children we also have follow-up data on whether they recovered (ok) or had a relapse, several years after treatment:
followup <- data.frame(
profile.id=paste(c(10, 8, 4, 6, 11, 1)),
status=c("ok", "relapse", "relapse", "ok", "ok", "relapse"))
followup
## profile.id status
## 1 10 ok
## 2 8 relapse
## 3 4 relapse
## 4 6 ok
## 5 11 ok
## 6 1 relapse
When there are no labels, the input for a changepoint analysis just uses the noisy DNA copy number data in neuroblastoma$profiles$logratio
. We begin by selecting the profiles of the six patients for which we have follow-up data.
rownames(followup) <- followup$profile.id
followup$status.profile <- with(followup, paste(status, profile.id))
some.ids <- rownames(followup)
library(data.table)
someProfiles <- function(all.profiles){
some <- subset(all.profiles, paste(profile.id) %in% some.ids)
status.profile <- followup[paste(some$profile.id), "status.profile"]
some$status.profile <- ifelse(
is.na(status.profile), paste(some$profile.id), status.profile)
data.table(some)
}
neuroblastoma.dt <- data.table(neuroblastoma$profiles)
## neuroblastoma.dt[, logratio.norm := logratio/mad(logratio), by=list(
## profile.id, chromosome)]
neuroblastoma.dt[, logratio.norm := logratio]
six.profiles <- someProfiles(neuroblastoma.dt)
For these six patients, the total number of separate changepoint detection problems is 24 chromosomes x 6 patients = 144 problems. We plot each problem below in a separate facet (panel).
library(ggplot2)
gg.unsupervised <- ggplot()+
ggtitle("unsupervised changepoint detection = only noisy data sequences")+
theme(
panel.spacing=grid::unit(0, "lines"),
panel.border=element_rect(fill=NA, color="grey50")
)+
facet_grid(status.profile ~ chromosome, scales="free", space="free_x")+
geom_point(aes(position/1e6, logratio),
data=six.profiles,
shape=1)+
scale_x_continuous(
"position on chromosome (mega bases)",
breaks=c(100, 200))+
scale_y_continuous(
"logratio (approximate DNA copy number)",
limits=c(-1,1)*1.1)
print(gg.unsupervised)
## Warning: Removed 65 rows containing missing values (geom_point).
Note the facet titles on the right:
Question: can you visually identify any differences between the “ok” and “relapse” profiles?
How to choose the number of changepoints? In the unsupervised setting, our only option is to use theoretical/statistical arguments, which give information criteria such as Aikaike Information Criterion (AIC, penalty=2) or Bayesian/Schwarz Information Criterion (BIC/SIC, penalty=log n). More on this later.
In supervised changepoint detection, there are labels which indicate presence and absence of changepoints in particular data subsets. For the same set of 6 profiles, we superimpose the labels in the plot below.
six.labels <- someProfiles(neuroblastoma$annotations)
change.colors <- c(
"1change"="#ff7d7d",
breakpoint="#a445ee",
normal="#f6c48f"
)
gg.supervised <- gg.unsupervised+
ggtitle(paste(
"supervised changepoint detection = data + labels",
"that indicate specific regions with or without changepoints"))+
penaltyLearning::geom_tallrect(aes(
xmin=min/1e6, xmax=max/1e6, fill=annotation),
alpha=0.5,
color=NA,
data=six.labels)+
scale_fill_manual("label", values=change.colors)+
theme(legend.position="bottom")
print(gg.supervised)
## Warning: Removed 65 rows containing missing values (geom_point).
New function: geom_tallrect
is like geom_rect
but the ymin is always the bottom of the panel and the ymax is always the top of the panel. These are useful for drawing labeled intervals of x values.
It is clear from this plot that the neuroblastoma data satisfy the two criteria which are necessary for a supervised changepoint analysis:
As we will see later in the tutorial, the labels can be used for choosing the best model and parameters. The main idea is that the changepoint model should be consistent with the labels:
In this section we show that typical unsupervised changepoint models result in prediction errors with respect to the labels in the neuroblastoma data. We begin by using the changepoint::cpt.mean
function to compute an unsupervised changepoint model for each data sequence. To compute the number of incorrect labels later, make sure to save a description of the model in a “tidy” data.frame (observations on rows and variables on columns). In the code below we
by=list(status.profile, chromosome)
which adds those two columns to the resulting data.table
.pen.name
column to identify the model.changes
list column to save the changepoint positions (in the same units as the label positions: bases on the chromosome).pen.name <- "SIC0"
(unsupervised.models <- six.profiles[, {
fit.pelt <- changepoint::cpt.mean(
logratio.norm, penalty=pen.name, method="PELT")
end <- fit.pelt@cpts
before.change <- end[-length(end)]
after.change <- before.change+1L
data.table(
pen.name,
pen.value=fit.pelt@pen.value,
changes=list(
as.integer((position[before.change]+position[after.change])/2)
))
}, by=list(profile.id, status.profile, chromosome)])
## profile.id status.profile chromosome pen.name pen.value changes
## 1: 8 relapse 8 1 SIC0 6.013715
## 2: 8 relapse 8 2 SIC0 5.375278
## 3: 8 relapse 8 3 SIC0 5.081404
## 4: 8 relapse 8 4 SIC0 4.882802
## 5: 8 relapse 8 5 SIC0 5.164786
## ---
## 140: 10 ok 10 20 SIC0 4.828314
## 141: 10 ok 10 21 SIC0 4.110874
## 142: 10 ok 10 22 SIC0 4.543295
## 143: 10 ok 10 X SIC0 5.231109
## 144: 10 ok 10 Y SIC0 3.295837
Next, we convert the list column changes
to tall format (one row for each changepoint).
(unsupervised.changes <- unsupervised.models[, data.table(
change=changes[[1]]
), by=list(profile.id, status.profile, chromosome, pen.name)])
## profile.id status.profile chromosome pen.name change
## 1: 8 relapse 8 7 SIC0 141405948
## 2: 8 relapse 8 11 SIC0 70573642
## 3: 8 relapse 8 17 SIC0 29895917
## 4: 8 relapse 8 Y SIC0 4459374
## 5: 11 ok 11 1 SIC0 32819999
## 6: 11 ok 11 Y SIC0 5918094
## 7: 11 ok 11 Y SIC0 11393013
## 8: 11 ok 11 Y SIC0 23715526
## 9: 4 relapse 4 1 SIC0 59792500
## 10: 4 relapse 4 2 SIC0 45164625
## 11: 4 relapse 4 3 SIC0 69953902
## 12: 4 relapse 4 14 SIC0 76603452
## 13: 4 relapse 4 17 SIC0 41646489
## 14: 1 relapse 1 1 SIC0 212809180
## 15: 1 relapse 1 7 SIC0 60381530
## 16: 1 relapse 1 11 SIC0 80058339
## 17: 1 relapse 1 Y SIC0 5918094
## 18: 1 relapse 1 Y SIC0 9420652
## 19: 1 relapse 1 Y SIC0 19070635
New function: we now introduce the penaltyLearning::labelError
function, which computes the number of incorrect labels for a set of predicted changepoints. Its inputs are 3 data.frames:
models
one row per changepoint model.labels
one row for each label, with columns for label location and type.changes
one row for each predicted changepoint, with a column for the predicted changepoint position.unsupervised.error.list <- penaltyLearning::labelError(
unsupervised.models, six.labels, unsupervised.changes,
problem.vars=c("status.profile", "chromosome"),
model.vars="pen.name",
change.var="change")
The result of labelError
is a list of two data tables:
model.errors
has one row per (sequence,model), with the total number of incorrect labels. We use these totals in the ggtitle
of the plot below.label.errors
has one row per (sequence,model,label), with the fp/fn status of each label. We use these to visualize false negative and false positive predictions using the geom_tallrect
with aes(linetype=status)
below.gg.supervised+
theme(legend.box="horizontal")+
unsupervised.error.list$model.errors[, ggtitle(paste0(
"Unsupervised ", pen.name, " penalty has ",
sum(errors),
" incorrect labels: ",
sum(fp), " FP + ",
sum(fn), " FN"))]+
geom_vline(aes(
xintercept=change/1e6),
color="green",
size=1,
linetype="dashed",
data=unsupervised.changes)+
geom_tallrect(aes(
xmin=min/1e6, xmax=max/1e6, linetype=status),
fill=NA,
data=unsupervised.error.list$label.errors)+
scale_linetype_manual("error type", values=c(
correct=0,
"false negative"=3,
"false positive"=1))
## Warning: Removed 65 rows containing missing values (geom_point).
Note that in the plot above we also plotted the predicted changepoints using a green dashed geom_vline
. Labels with inconsistent changepoints are shown with a black border in the geom_tallrect
, with scale_linetype_manual
to show the type of prediction error:
Remark: the result above (BIC/SIC predicts too few changepoints, resulting in false negatives) is specific to the neuroblastoma data set. However, in general you can expect that unsupervised penalty functions will commit changepoint detection errors which are visually obvious. This is the main motivation of supervised changepoint analysis: we can use labels to learn a penalty function with improved changepoint detection accuracy.
Exercise 1 during class: try changing the pen.name
variable to another theretically-motivated unsupervised penalty. Other options from help(cpt.mean)
are SIC (without a zero), MBIC, AIC, and Hannan-Quinn. Can you find a penalty with improved changepoint detection accuracy? Also try changing the some.ids
variable (to some profile IDs from 1 to 603) to see the changepoints and label errors for another set of profiles.
To quickly create labels for many data sequences, I recommend using a GUI in which you can use the mouse to draw label rectangles on the scatterplots. I wrote annotate_regions.py for labeling the neuroblastoma data set, and SegAnnDB for labeling general DNA copy number profile data. You may want to write another GUI for your particular data – make sure it supports visualizing the data sets and interactive labeling.
Even if you don’t have GUI software, you can always just plot the data in R, and then write the labels in the R code. Begin by plotting one data set by itself to see the details of the signal and noise.
zoom.profile <- 4 #Exercise: change this value!
zoom.chromosome <- 14 #Exercise: change this value!
zoom.pro <- neuroblastoma.dt[
profile.id==zoom.profile & chromosome==zoom.chromosome]
zoom.gg <- ggplot()+
geom_point(aes(position/1e6, logratio),
data=zoom.pro,
shape=1)+
scale_y_continuous(
"logratio (approximate DNA copy number)")
print(zoom.gg)
Based on your visual interpretation of the plot above, you can create a set of labels.
The region for each label can be as small or as large as you want. Because the goal of learning is to minimize the number of incorrectly predicted labels, it is important to make sure that each label is a correct representation of the changepoint model you want. Avoid labeling any regions for which you are not sure about the label. Save your labels to a data.frame with the label type and positions, as below.
label <- function(min, max, annotation){
data.frame(
profile.id=paste(zoom.profile),
chromosome=paste(zoom.chromosome),
min, max, annotation)
}
zoom.labels <- rbind(# Exercise: change these values!
label(70e6, 90e6, "1change"),
label(20e6, 60e6, "normal"))
zoom.gg.lab <- zoom.gg+
penaltyLearning::geom_tallrect(aes(
xmin=min/1e6, xmax=max/1e6, fill=annotation),
alpha=0.5,
color=NA,
data=zoom.labels)+
scale_fill_manual("label", values=change.colors)
print(zoom.gg.lab)
The plot above shows the noisy data along with the labels. Note that it is possible to have multiple labels per data sequence, even though that is not the case for the neuroblastoma data set.
In this section we explain how to perform a supervised changepoint analysis on a labeled data set.
A typical supervised changepoint analysis consists of the following computations:
penaltyLearning::labelError
to compute the number of incorrect labels for each labeled segmentation problem and each changepoint model. The goal of learning is to minimize the number of incorrectly predicted labels.penaltyLearning::modelSelection
to compute the exact path of models that will be selected for every possible non-negative penalty value.penaltyLearning::targetIntervals
to compute a target interval of log(penalty) values that predicts the minimum number of incorrect labels for each segmentation problem. Create a target interval matrix (one row for each segmentation problems, 2 columns) which can be used as the output in the supervised machine learning problem.penaltyLearning::featureMatrix
. Or do it yourself using simple statistics of each segmentation problem (quantiles, mean, number of data points, estimated variance, etc).survival::survreg
or penaltyLearning::IntervalRegressionCV
to learn a penalty function.penaltyLearning::ROChange
to compute ROC curves using labels in a test set (that were not used for training). The AUC (area under the ROC curve) and the percent incorrect labels in the test set can be used to evaluate the prediction accuracy of your model.The mathematical details are described in our ICML’13 paper, Learning Sparse Penalties for Change-point Detection using Max Margin Interval Regression.
To begin we will perform the supervised changepoint computations on just one of the labeled neuroblastoma data sequences,
zoom.gg.lab <- ggplot()+
geom_point(aes(position/1e6, logratio),
data=zoom.pro,
shape=1)+
scale_y_continuous(
"logratio (approximate DNA copy number)")+
penaltyLearning::geom_tallrect(aes(
xmin=min/1e6, xmax=max/1e6, fill=annotation),
alpha=0.5,
color=NA,
data=zoom.labels)+
scale_fill_manual("label", values=change.colors)
print(zoom.gg.lab)
We use the code below to compute a sequence of maximum likelihood Gaussian models with \(s\in\{1,\dots,7\}\) segments. In contrast, a typical unsupervised analysis will include computation of one changepoint model. In supervised analysis we need to compute a sequence of changepoint models of increasing complexity in order to determine which are consistent and inconsistent with the labels. Remember the goal of the machine learning is to predict a model with complexity consistent with the labels (minimize the number of incorrectly predicted labels).
New function: Segmentor3IsBack::Segmentor
computes maximum likelihood segmentations from 1 to Kmax segments. Inputs are the data sequence to segment, the likelihood (model=2 means normal/Gaussian), and the maximum number of segments (Kmax). For a supervised changepoint analysis, Kmax should be rather large, but how do you know if it is large enough? You should check to make sure there is at least one model with a false positive (a labeled region with too many predicted changepoints), if that is possible (not possible for data sequences with only breakpoint labels). These false positives in the training labels/models are used to learn a penalty function that avoids predicting too many changepoints. More details below in the Outputs section.
max.segments <- 7
(fit <- Segmentor3IsBack::Segmentor(
zoom.pro$logratio.norm, model=2, Kmax=max.segments))
## Object of class Segmentor
##
## Model used for the segmentation:
## [1] "Normal"
##
## Maximum number of segments:
## [1] 7
##
## Compression factor used:
## [1] 1
##
## Matrix of breakpoints:
## 1th break 2th break 3th break 4th break 5th break 6th break 7th break
## 1 segment 76 0 0 0 0 0 0
## 2 segments 50 76 0 0 0 0 0
## 3 segments 1 50 76 0 0 0 0
## 4 segments 50 66 68 76 0 0 0
## 5 segments 1 50 66 68 76 0 0
## 6 segments 1 9 50 66 68 76 0
## 7 segments 1 50 54 58 66 68 76
##
## Parameter of each segment:
## num [1:7, 1:7] -0.1348 0.0391 -0.2653 0.0391 -0.2653 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:7] "1 segment" "2 segments" "3 segments" "4 segments" ...
## ..$ : chr [1:7] "1th parameter" "2th parameter" "3th parameter" "4th parameter" ...
##
## Likelihood of the segmentation
## [,1]
## 1 segment 74.59080
## 2 segments -51.78161
## 3 segments -54.48624
## 4 segments -61.02661
## 5 segments -63.73125
## 6 segments -65.10483
## 7 segments -67.54654
The Segmentor fit
object is an S4 class with slots
breaks
an integer matrix of end position of each segment.parameters
a numeric matrix of the mean of each segment.Next, we convert the model to tidy format (a data.frame
with one row per segment). The code below is specific to the Segmentor
function; if you use a different changepoint detection package/function, you will have to write some other code to tidy the data.
zoom.segs.list <- list()
zoom.loss.vec <- rep(NA, max.segments)
for(n.segments in 1:max.segments){
end <- fit@breaks[n.segments, 1:n.segments]
data.before.change <- end[-n.segments]
data.after.change <- data.before.change+1
pos.before.change <- as.integer(
(zoom.pro$position[data.before.change]+
zoom.pro$position[data.after.change])/2)
start <- c(1, data.after.change)
chromStart <- c(zoom.pro$position[1], pos.before.change)
chromEnd <- c(pos.before.change, max(zoom.pro$position))
seg.mean.vec <- fit@parameters[n.segments, 1:n.segments]
zoom.segs.list[[n.segments]] <- data.table(
profile.id=paste(zoom.profile),
chromosome=paste(zoom.chromosome),
n.segments, # model complexity.
start, # in data points.
end,
chromStart, # in bases on chromosome.
chromEnd,
mean=seg.mean.vec)
data.mean.vec <- rep(seg.mean.vec, end-start+1)
stopifnot(length(data.mean.vec)==nrow(zoom.pro))
zoom.loss.vec[n.segments] <- sum((zoom.pro$logratio.norm-data.mean.vec)^2)
}
(zoom.segs <- do.call(rbind, zoom.segs.list))
## profile.id chromosome n.segments start end chromStart chromEnd mean
## 1: 4 14 1 1 76 20079729 106051083 -0.13477759
## 2: 4 14 2 1 50 20079729 76603452 0.03911700
## 3: 4 14 2 51 76 76603452 106051083 -0.46919028
## 4: 4 14 3 1 1 20079729 20448228 -0.26534457
## 5: 4 14 3 2 50 20448228 76603452 0.04533050
## 6: 4 14 3 51 76 76603452 106051083 -0.46919028
## 7: 4 14 4 1 50 20079729 76603452 0.03911700
## 8: 4 14 4 51 66 76603452 95013692 -0.48732984
## 9: 4 14 4 67 68 95013692 96042442 -0.08823781
## 10: 4 14 4 69 76 96042442 106051083 -0.52814926
## 11: 4 14 5 1 1 20079729 20448228 -0.26534457
## 12: 4 14 5 2 50 20448228 76603452 0.04533050
## 13: 4 14 5 51 66 76603452 95013692 -0.48732984
## 14: 4 14 5 67 68 95013692 96042442 -0.08823781
## 15: 4 14 5 69 76 96042442 106051083 -0.52814926
## 16: 4 14 6 1 1 20079729 20448228 -0.26534457
## 17: 4 14 6 2 9 20448228 25599644 -0.02555248
## 18: 4 14 6 10 50 25599644 76603452 0.05916133
## 19: 4 14 6 51 66 76603452 95013692 -0.48732984
## 20: 4 14 6 67 68 95013692 96042442 -0.08823781
## 21: 4 14 6 69 76 96042442 106051083 -0.52814926
## 22: 4 14 7 1 1 20079729 20448228 -0.26534457
## 23: 4 14 7 2 50 20448228 76603452 0.04533050
## 24: 4 14 7 51 54 76603452 80470604 -0.57770426
## 25: 4 14 7 55 58 80470604 85708989 -0.33481831
## 26: 4 14 7 59 66 85708989 95013692 -0.51839840
## 27: 4 14 7 67 68 95013692 96042442 -0.08823781
## 28: 4 14 7 69 76 96042442 106051083 -0.52814926
## profile.id chromosome n.segments start end chromStart chromEnd mean
Above we show the data.table of optimal segment means. Below we compute the data.table of predicted changepoint positions, then plot the models.
(zoom.changes <- zoom.segs[1 < start, data.table(
profile.id, chromosome, n.segments,
changepoint=chromStart)])
## profile.id chromosome n.segments changepoint
## 1: 4 14 2 76603452
## 2: 4 14 3 20448228
## 3: 4 14 3 76603452
## 4: 4 14 4 76603452
## 5: 4 14 4 95013692
## 6: 4 14 4 96042442
## 7: 4 14 5 20448228
## 8: 4 14 5 76603452
## 9: 4 14 5 95013692
## 10: 4 14 5 96042442
## 11: 4 14 6 20448228
## 12: 4 14 6 25599644
## 13: 4 14 6 76603452
## 14: 4 14 6 95013692
## 15: 4 14 6 96042442
## 16: 4 14 7 20448228
## 17: 4 14 7 76603452
## 18: 4 14 7 80470604
## 19: 4 14 7 85708989
## 20: 4 14 7 95013692
## 21: 4 14 7 96042442
## profile.id chromosome n.segments changepoint
zoom.gg.models <- zoom.gg.lab+
theme(
panel.spacing=grid::unit(0, "lines"),
panel.border=element_rect(fill=NA, color="grey50")
)+
facet_grid(n.segments ~ .)+
geom_vline(aes(
xintercept=changepoint/1e6),
data=zoom.changes,
color="green",
size=1,
linetype="dashed")
print(zoom.gg.models)
The plot above shows a panel for each model from 1 to 7 segments (0 to 6 changepoints). Each changepoint is drawn using a green dashed geom_vline
.
For more details about the Segmentor
function, read the references below. To save time during class we will skip to the next section, “Label error.”
For the Normal homoscedastic model we have
A few references about the algorithm implemented in the Segmentor
function:
changepoint::cpt.mean(method="SegNeigh")
but it is very slow for large \(d\), due to the quadratic time complexity.cghseg:::segmeanCO
only works for the Gaussian/normal likelihood.Segmentor3IsBack::Segmentor
implements the \(O(S_{\text{max}}d\log d)\) PDPA for Poisson, Normal homoscedastic (uniform variance – change in mean), Negative Binomial, Normal Heteroscedastic (uniform mean – change in variance), and Exponential likelihood models.Exercise after class: instead of using Segmentor
to compute a sequence of models, try using changepoint::cpt.mean(penalty="CROPS")
, arXiv:1412.3617. Then re-do the rest of the analyses below. Another option is cghseg:::segmeanCO
. Using any of these three packages you should be able to get the same results.
Next, we compute the number of incorrect labels for each model, using the penaltyLearning::labelError
function that we saw above. We then plot the label errors.
zoom.models <- data.table(
profile.id=paste(zoom.profile),
chromosome=paste(zoom.chromosome),
loss=zoom.loss.vec,
n.segments=as.numeric(1:max.segments))
zoom.error.list <- penaltyLearning::labelError(
zoom.models,
zoom.labels,
zoom.changes,
change.var="changepoint",
problem.vars=c("profile.id", "chromosome"))
zoom.gg.models+
penaltyLearning::geom_tallrect(aes(
xmin=min/1e6,
xmax=max/1e6,
linetype=status),
data=zoom.error.list$label.errors,
fill=NA)+
scale_linetype_manual("error type", values=c(
correct=0,
"false negative"=3,
"false positive"=1))
It is clear from the plot above that each model has a different number of labels which are correctly predicted. Models with too few changepoints cause false negatives, and models with too many changepoints cause false positives. However there is a subset of models with minimal incorrect labels. The goal of a supervised changepoint detection algorithm is to predict one of those models.
Rather than using a regression model to directly predict the integer-valued number of segments/changepoints, we will instead learn a regression model that predicts the real-valued log(penalty). The model selection function is a mapping from penalty values to model size (in segments/changepoints).
Notice that zoom.models
contains a loss
column, which is the sum of squared residuals of each model. Below we use that to compute the models that will be selected for every possible penalty value.
New function: In the code below we use the penaltyLearning::modelSelection
function, which takes as input a data.frame
with one row per changepoint model. There should be at least the following columns (but there can be others):
profile.id
and chromosome
).complexity
argument, as below.loss
argument (default is "loss"
which is present in zoom.models
so no need to specify in the code below).The output is a data.frame with one row per model that is selected for at least one penalty value. There are columns for the min/max penalty that will select each model, as shown below.
print(zoom.models)
## profile.id chromosome loss n.segments
## 1: 4 14 5.5436090 1
## 2: 4 14 1.1240146 2
## 3: 4 14 1.0294260 3
## 4: 4 14 0.8006910 4
## 5: 4 14 0.7061024 5
## 6: 4 14 0.6580643 6
## 7: 4 14 0.5726711 7
(zoom.selection <- penaltyLearning::modelSelection(
zoom.models, complexity="n.segments"))
## min.lambda max.lambda min.log.lambda max.log.lambda cum.iterations profile.id chromosome
## 7 0.00000000 0.06671563 -Inf -2.707316 8 4 14
## 5 0.06671563 0.09458862 -2.707316 -2.358218 5 4 14
## 4 0.09458862 0.16166178 -2.358218 -1.822249 4 4 14
## 2 0.16166178 4.41959443 -1.822249 1.486048 1 4 14
## 1 4.41959443 Inf 1.486048 Inf 0 4 14
## loss n.segments
## 7 0.5726711 7
## 5 0.7061024 5
## 4 0.8006910 4
## 2 1.1240146 2
## 1 5.5436090 1
ggplot()+
geom_segment(aes(
min.log.lambda, n.segments,
xend=max.log.lambda, yend=n.segments),
size=1,
data=zoom.selection)+
xlab("log(penalty) = log(lambda)")+
scale_y_continuous(breaks=zoom.selection$n.segments)
We plot the model selection function in the figure above, which shows that it is a piecewise constant non-increasing function. Intuitively, the larger the penalty, the fewer the number of changepoints/segments (and vice versa). More concretely, let \(L_{i,s}\) be the loss (sum of squared residuals) of the model with \(s\in\{1,\dots,s_{\text{max}}\}\) segments for data sequence \(i\). The model selection function is defined for every non-negative penalty \(\lambda\geq 0\) as
\[ s_i^*(\lambda) = \text{arg min}_s L_{i,s} + \lambda s \]
The output in the machine learning problem is an interval of log(penalty) values that achieves minimum incorrect labels. Remember that we have computed zoom.error.list$model.errors
which contains the number of incorrect labels \(e_i(s)\in\{0,1,2,\dots\}\) for this data sequence \(i\) and every number of segments \(s\). We can thus compute the label error as function of the penalty, \[
E_i(\lambda) = e_i[s^*_i(\lambda)].
\]
In R this computation amounts to a join with zoom.selection
(which came from penaltyLearning::modelSelection
).
zoom.error.join <- zoom.error.list$model.errors[J(zoom.selection), on=list(
profile.id, chromosome, n.segments, loss)]
zoom.errors.tall <- data.table::melt(
zoom.error.join,
measure.vars=c("n.segments", "errors"))
## Warning in melt.data.table(zoom.error.join, measure.vars = c("n.segments", :
## 'measure.vars' [n.segments, errors, ...] are not all of the same type. By order of hierarchy, the
## molten data value column will be of type 'double'. All measure variables not of type 'double' will
## be coerced too. Check DETAILS in ?melt.data.table for more on coercion.
zoom.gg.errors <- ggplot()+
geom_segment(aes(
min.log.lambda, value,
xend=max.log.lambda, yend=value),
size=1,
data=zoom.errors.tall)+
theme_bw()+
theme(panel.spacing=grid::unit(0, "lines"))+
facet_grid(variable ~ ., scales="free")+
scale_y_continuous("", breaks=0:max.segments)+
xlab("log(penalty) = log(lambda)")
print(zoom.gg.errors)
The code above plots the number of incorrect labels \(E_i(\lambda)\), clearly showing that it is a piecewise constant function that takes integer values (like the 01 loss in binary classification). The goal of the machine learning algorithm is to provide predictions that minimize this function (on test data).
Below we compute the output for the machine learning problem, a target interval of penalty values that yield changepoint models with minimal incorrect labels.
New function: the penaltyLearning::targetIntervals
function computes the upper and lower limits of the target interval, which is the largest range of log(penalty) values that results in minimum incorrectly predicted labels. Its input parameter is a data.frame with one row per model (from penaltyLearning::modelSelection
), with an additional column errors
for the number of incorrect labels. The problem.vars
argument indicates the data sequence ID columns. The function returns a data.table with one row per labeled data sequence, with columns for the lower and upper bounds of the target interval (min.log.lambda
, max.log.lambda
).
print(zoom.error.join)
## profile.id chromosome loss n.segments possible.fp fp possible.fn fn labels errors
## 1: 4 14 0.5726711 7 2 2 1 0 2 2
## 2: 4 14 0.7061024 5 2 1 1 0 2 1
## 3: 4 14 0.8006910 4 2 0 1 0 2 0
## 4: 4 14 1.1240146 2 2 0 1 0 2 0
## 5: 4 14 5.5436090 1 2 0 1 1 2 1
## 5 variables not shown: [min.lambda, max.lambda, min.log.lambda, max.log.lambda, cum.iterations]
(zoom.target <- penaltyLearning::targetIntervals(
zoom.error.join,
problem.vars=c("profile.id", "chromosome")))
## profile.id chromosome min.log.lambda max.log.lambda errors
## 1: 4 14 -2.358218 1.486048 0
zoom.target.tall <- data.table::melt(
zoom.target,
measure.vars=c("min.log.lambda", "max.log.lambda"),
variable.name="limit")[is.finite(value)]
zoom.gg.errors+
geom_text(aes(
ifelse(limit=="min.log.lambda", value-1, value+1),
errors+1,
label=paste(
"false",
ifelse(limit=="min.log.lambda", "positives", "negatives"),
"\ntoo",
ifelse(limit=="min.log.lambda", "many", "few"),
"changes"),
hjust=ifelse(
limit=="min.log.lambda", 0, 1)),
data=data.frame(zoom.target.tall, variable="errors"),
vjust=-1)+
geom_point(aes(
value,
errors,
fill=limit),
shape=21,
size=4,
data=data.frame(zoom.target.tall, variable="errors"))+
scale_fill_manual("limit", values=c(
min.log.lambda="black",
max.log.lambda="white"))+
theme(
legend.position="bottom",
legend.box="horizontal")
The plot above emphasizes the target interval, which is used as the output in the machine learning problem:
The target interval is different for each labeled data sequence, as we show below. We will compute the target interval for each of the labeled data sequences in the six profiles from the beginning of the tutorial. The first step is to compute changepoints and model selection functions, which we do via a for loop over each data sequence below.
six.problems.dt <- unique(six.profiles[, list(
profile.id, chromosome, status.profile)])
setkey(six.profiles, profile.id, chromosome)
six.segs.list <- list()
six.selection.list <- list()
for(problem.i in 1:nrow(six.problems.dt)){
meta <- six.problems.dt[problem.i,]
pro <- six.profiles[meta]
max.segments <- min(nrow(pro), 10)
fit <- Segmentor3IsBack::Segmentor(
pro$logratio.norm, model=2, Kmax=max.segments)
rss.vec <- rep(NA, max.segments)
for(n.segments in 1:max.segments){
end <- fit@breaks[n.segments, 1:n.segments]
seg.mean.vec <- fit@parameters[n.segments, 1:n.segments]
data.before.change <- end[-n.segments]
data.after.change <- data.before.change+1
pos.before.change <- as.integer(
(pro$position[data.before.change]+pro$position[data.after.change])/2)
start <- c(1, data.after.change)
chromStart <- c(pro$position[1], pos.before.change)
chromEnd <- c(pos.before.change, max(pro$position))
data.mean.vec <- rep(seg.mean.vec, end-start+1)
rss.vec[n.segments] <- sum((pro$logratio.norm-data.mean.vec)^2)
six.segs.list[[paste(problem.i, n.segments)]] <- data.table(
meta,
n.segments,
start,
end,
chromStart,
chromEnd,
mean=seg.mean.vec)
}
loss.dt <- data.table(
meta,
n.segments=1:max.segments,
loss=rss.vec)
six.selection.list[[problem.i]] <- penaltyLearning::modelSelection(
loss.dt, complexity="n.segments")
}
six.selection <- do.call(rbind, six.selection.list)
six.segs <- do.call(rbind, six.segs.list)
six.changes <- six.segs[1 < start]
Inside of each iteration of the loop above, we used
Segmentor3IsBack::Segmentor
to compute the optimal changepoint models, saving the tidy result data.frame of segments to an element of six.segs.list
.penaltyLearning::modelSelection
to compute the model selection functions, saving the result to an element of six.selection.list
.Below, we use
penaltyLearning::labelError
to compute the number of incorrect labels for each segmentation model.penaltyLearning::targetIntervals
to compute the target interval of penalty values for each labeled data sequence.six.error.list <- penaltyLearning::labelError(
six.selection, six.labels, six.changes,
problem.vars=c("profile.id", "chromosome"))
six.targets <- penaltyLearning::targetIntervals(
six.error.list$model.errors,
problem.vars=c("profile.id", "chromosome"))
ymax.err <- 1.2
six.targets.tall <- data.table::melt(
six.targets,
measure.vars=c("min.log.lambda", "max.log.lambda"),
variable.name="limit",
value.name="log.lambda")[is.finite(log.lambda)]
six.gg.errors <- ggplot()+
theme_bw()+
theme(panel.spacing=grid::unit(0, "lines"))+
facet_grid(profile.id ~ chromosome, labeller=function(df){
if("chromosome" %in% names(df))
df$chromosome <- paste0("chr", df$chromosome)
if("profile.id" %in% names(df))
df$profile.id <- paste("profile", df$profile.id)
df
})+
geom_segment(aes(
min.log.lambda, errors,
xend=max.log.lambda, yend=errors),
size=1,
data=six.error.list$model.errors)+
geom_point(aes(
log.lambda,
errors,
fill=limit),
shape=21,
size=3,
alpha=0.5,
data=six.targets.tall)+
scale_fill_manual("limit", values=c(
min.log.lambda="black",
max.log.lambda="white"))+
theme(
legend.position="bottom",
legend.box="horizontal")+
scale_y_continuous(
"incorrectly predicted labels",
limits=c(0,ymax.err),
breaks=c(0,1))+
scale_x_continuous(
"predicted log(penalty) = log(lambda)")
print(six.gg.errors)
The plot above shows the error function and target intervals for each labeled data sequence. Each column of panels/facets represents a different chromosome, and each row represents a different profile/patient. It is clear from the plot above that a model with minimal incorrect labels will predict a different penalty for each data sequence.
Question: discuss with your neighbor for 1 minute, why is there only one limit per data sequence? (only one target limit point per panel in the plot above)
Exercise 2 during class: create a set of labels via visual inspection, and visualize how changing the labels affects the computation of the target interval.
To train a machine learning model, we first need to compute a vector \(\mathbf x_i\in\mathbb R^p\) of exactly \(p\) features for each data sequence \(i\). We then learn a function \(f(\mathbf x_i)=\log\lambda_i\in\mathbb R\). To compute a feature matrix, use the code below.
New function: the penaltyLearning::featureMatrix
function computes a feature matrix for some data sequences. Inputs are a data.frame of data sequences (six.profiles
below), along with column names of IDs (problem.vars
) and data (data.var
). Output is a numeric matrix with one row per data sequence and one column per feature.
print(six.profiles)
## profile.id chromosome position logratio logratio.norm status.profile
## 1: 1 1 809681 0.4478436 0.4478436 relapse 1
## 2: 1 1 928433 0.4625759 0.4625759 relapse 1
## 3: 1 1 987423 0.5008021 0.5008021 relapse 1
## 4: 1 1 1083595 0.5762803 0.5762803 relapse 1
## 5: 1 1 1392490 0.4730076 0.4730076 relapse 1
## ---
## 21448: 11 Y 20840077 -2.6711635 -2.6711635 ok 11
## 21449: 11 Y 21743552 -1.1424170 -1.1424170 ok 11
## 21450: 11 Y 23148804 -1.3255393 -1.3255393 ok 11
## 21451: 11 Y 24282249 -3.8365013 -3.8365013 ok 11
## 21452: 11 Y 25746303 -1.9105018 -1.9105018 ok 11
feature.mat <- penaltyLearning::featureMatrix(
six.profiles,
problem.vars=c("profile.id", "chromosome"),
data.var="logratio")
str(feature.mat)
## num [1:144, 1:365] 474 250 191 154 184 163 129 126 118 137 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:144] "1 1" "1 2" "1 3" "1 4" ...
## ..$ : chr [1:365] "n.identity" "n.sqrt" "n.log" "n.loglog" ...
The feature matrix is 144 rows (one for each data sequence) by 365 columns/features. Which features should we use to predict log(penalty) values? If you have a lot of labels, you can use penaltyLearning::IntervalRegressionCV
which uses L1-regularization to automatically select relevant features – more details in the next section.
For now we will discuss the simpler unsupervised BIC/SIC model, which corresponds to solving the following changepoint optimization problem. \[
\operatorname{minimize}_{\mathbf m\in\mathbb R^{d_i}}
\sum_{j=1}^{d_i}
\ell(z_{ij}, m_j) +
\underbrace{(\log d_i)}_{\lambda_i}
\sum_{j=1}^{d_i-1}
I(m_{j} \neq m_{j+1}).
\] The BIC/SIC penalty is \(\lambda_i = \log d_i\), where \(d_i\) is the number of data points to segment in sequence \(i\). In the changepoint package version 2.2 this is penalty="SIC0"
.
In our framework, we can write the BIC/SIC criterion in terms of \(p=1\) feature, \(x_i = \log\log d_i\) – this is the column n.loglog
of the feature matrix. The unsupervised BIC/SIC penalty function is thus a linear model \(\log\lambda_i = f(x_i)=w x_i + \beta\) with intercept \(\beta=0\) and slope \(w=1\).
slope <- 1
intercept <- 0
bic.line.dt <- data.table(
slope, intercept, model="BIC/SIC")
six.BIC.dt <- six.profiles[, list(
feature=log(log(.N)),
model="BIC/SIC"
), by=list(profile.id, chromosome)]
six.BIC.dt[, pred.log.lambda := feature*slope + intercept]
The plot below visualizes the target interval limits in this feature space.
six.features.tall <- six.BIC.dt[six.targets.tall, on=list(
profile.id, chromosome)]
six.gg.limits <- ggplot()+
geom_point(aes(
feature, log.lambda, fill=limit),
shape=21,
data=six.features.tall)+
scale_fill_manual("limit", values=c(
min.log.lambda="black",
max.log.lambda="white"),
breaks=c("max.log.lambda", "min.log.lambda"))+
scale_x_continuous(
"feature")+
scale_y_continuous(
"log(penalty)=log(lambda)")
print(six.gg.limits)
In the plot above, each point represents a finite limit of a target interval. The goal of learning is to find a regression function which is below the white dots (upper limits) and above the black dots (lower limits). This is a machine learning problem called “regression with left, right, and interval censored outputs” or “interval regression” for short. Note that this is not the same problem as binary classification, for which the vertical axis is a second feature.
In this context we can plot the BIC/SIC model as a line with slope 1 and intercept 0, and we can visualize its errors using the vertical line segments below.
six.features.targets <- six.BIC.dt[six.targets, on=list(profile.id, chromosome)]
six.features.targets[, residual := 0]
six.features.targets[{
is.finite(min.log.lambda)|is.finite(max.log.lambda)
}, residual := {
penaltyLearning::targetIntervalResidual(
cbind(min.log.lambda, max.log.lambda), pred.log.lambda)
}]
model.colors <- c(
constant="black",
"BIC/SIC"="#1B9E77",#green
survreg="#D95F02",#orange
IntervalRegressionCV="#7570B3")#dark purple
six.gg.bic <- six.gg.limits+
scale_color_manual(values=model.colors)+
geom_abline(aes(
slope=slope, intercept=intercept, color=model),
size=1,
data=bic.line.dt)+
geom_segment(aes(
feature, pred.log.lambda,
xend=feature, yend=pred.log.lambda-residual,
color=model),
data=six.features.targets)
print(six.gg.bic)
New function: the penaltyLearning::targetIntervalResidual
function computes the residual of a predicted log(penalty) with respect to a target interval (zero if prediction is inside the interval, positive for penalty too large, negative for penalty too small). The first argument is a 2-column numeric matrix of target intervals, and the second argument is a numeric vector of predicted log(penalty) values. The output is a numeric vector of residuals, useful for showing the errors in regression plots such as the one above.
We can see from the plot above that there are 5 data sequences for which the BIC/SIC predicts a penalty value outside of the target interval (penalty too large = too few changepoints). It is clear that the number of errors could be reduced by simply decreasing the intercept (moving the line down). However there is no line in this feature space that perfectly separates the white and black points.
We can also visualize the number of incorrect labels for the BIC/SIC model by plotting the data, segmentation and labels below. Note that this BIC/SIC model was computed using Segmentor3IsBack::Segmentor
, but it yields exactly the same model as changepoint::cpt.mean
(compare with previous figure above).
six.BIC.dt[, pred := pred.log.lambda]
six.BIC.selection <- data.table(six.selection)[six.BIC.dt, on=list(
profile.id, chromosome,
min.log.lambda < pred,
max.log.lambda > pred)]
six.BIC.labels <- six.error.list$label.errors[six.BIC.selection, on=list(
profile.id, chromosome, n.segments), nomatch=0L]
six.BIC.changes <- six.changes[six.BIC.selection, on=list(
profile.id, chromosome, n.segments), nomatch=0L]
six.BIC.segs <- six.segs[six.BIC.selection, on=list(
profile.id, chromosome, n.segments)]
gg.supervised+
ggtitle("BIC/SIC model changepoints and label errors")+
penaltyLearning::geom_tallrect(aes(
xmin=min/1e6,
xmax=max/1e6,
linetype=status),
fill=NA,
data=six.BIC.labels)+
scale_linetype_manual("error type", values=c(
correct=0,
"false negative"=3,
"false positive"=1))+
geom_vline(aes(
xintercept=chromStart/1e6),
data=six.BIC.changes,
color="green",
size=1,
linetype="dashed")+
theme(legend.box="horizontal")
## Warning: Removed 65 rows containing missing values (geom_point).
Question: discuss with your neighbor for a minute, why are there only four false negatives in this plot, but there are five predictions outside the target interval in the previous plot with the regression line? Hint: look at the plot of the error as a function of penalty.
Another way to visualize the predicted penalty values is with the dots in the error function plot below.
six.BIC.errors <- six.error.list$model.errors[six.BIC.selection, on=list(
profile.id, chromosome, n.segments), nomatch=0L]
six.gg.errors.bic <- six.gg.errors+
geom_point(aes(
pred.log.lambda, errors, color=model),
size=2,
data=six.BIC.errors)+
scale_color_manual(values=model.colors)
print(six.gg.errors.bic)
It is clear from the plot above that
Exercise after class: can you think of any other features which would be useful for predicting log(penalty) values? In other words, what features of a data sequence should be related to the number of changepoints we should predict? Some ideas are quantiles/mean/sd of the observed data points. Change the definition of feature
in six.BIC.dt
above, and re-do the plots in this section. Using your feature, is it possible to learn an affine function with fewer errors – are the white and black dots linearly separable?
Exercise after class: In the changepoint package version 2.2 penalty="SIC0"
means \(\lambda_i = \log d_i\) (same as the BIC/SIC we explored in this section). Verify that the predicted penalty values in this section are the same, by adding a geom_point
for predicted log(penalty) values to the plot with the regression line above. Hint: when we computed changepoint::cpt.mean
in the first part of the tutorial, we saved the predicted penalty in the pen.value
column of unsupervised.models
. Join that data.table with six.features.targets
to get the predicted penalty for the labeled data sequences.
Exercise after class: In the changepoint package version 2.2 penalty="SIC"
means \(\lambda_i = 2\log d_i\) (twice as much penalty as the BIC/SIC in this section). What is the slope/intercept of this model in the same feature space? Re-do the plots in this section. Is the model more or less accurate? To check your work, use a geom_point
with predictions from changepoint::cpt.mean
, as in the previous exercise.
In the previous section we explored the unsupervised BIC/SIC penalty, which uses 1 feature, and does not learn any parameters using the labels. In this section, we will explain two supervised penalty learning methods that can be used to increase changepoint detection accuracy.
We begin by using the same feature as the BIC/SIC penalty, but relaxing the assumption that the intercept \(\beta=0\) and slope \(w=1\). Instead of taking these parameter values for granted based on the theoretical arguments of the BIC/SIC penalty, we can use the survival::survreg
function to optimize these parameters based on the labels. The penalty function is \(\log\lambda_i = f( x_i) = \beta + w\log\log d_i\) which is equivalent to solving the following penalized changepoint detection problem, \[
\operatorname{minimize}_{\mathbf m\in\mathbb R^{d_i}}
\sum_{j=1}^{d_i}
\ell(z_{ij}, m_j)
+\underbrace{e^\beta(\log d_i)^w }_{\lambda_i}
\sum_{j=1}^{d_i-1}
I(m_{j} \neq m_{j+1}),
\] where the intercept \(\beta\) and the slope \(w\) are defined as the solution to the following optimization problem, \[
\operatorname{minimize}_{
\mathbf w\in\mathbb R^p,
\beta\in\mathbb R,
\sigma\in\mathbb R_+
}
\sum_{i=1}^n
-\log[
F(\overline y_i, \mathbf w^\intercal \mathbf x_i+\beta, \sigma)-
F(\underline y_i, \mathbf w^\intercal \mathbf x_i+\beta, \sigma)
],
\] where
survreg
can be used with more features.survreg
Normal model is \(\log \lambda_i \sim N(\mathbf w^\intercal \mathbf x_i+\beta, \sigma^2)\).dist
argument)survreg
uses are only guaranteed to reach a local minimum (not the globally most likely model).To perform a more fair comparison, we will hold out chromosome 11 as a test set, and only train the model using the other labeled chromosomes.
library(survival)
train.dt <- six.features.targets[chromosome!=11]
New function: the survival::survreg
function is used below with three arguments:
output ~ inputs
as in other R modeling functions such as lm
– the difference is that the left/output must be a Surv
object. When learning a penalty function for changepoint detection, we always use type="interval2"
because all outputs are either left, right, or interval censored. (target intervals of log penalty values)(fit.survreg <- survreg(
Surv(min.log.lambda, max.log.lambda, type="interval2") ~ feature,
train.dt, dist="gaussian"))
## Call:
## survreg(formula = Surv(min.log.lambda, max.log.lambda, type = "interval2") ~
## feature, data = train.dt, dist = "gaussian")
##
## Coefficients:
## (Intercept) feature
## -0.4788175 0.7032863
##
## Scale= 0.4312716
##
## Loglik(model)= -2.9 Loglik(intercept only)= -2.9
## Chisq= 0.01 on 1 degrees of freedom, p= 0.908
## n= 30
It is clear that the coefficients of the learned model are different from the BIC/SIC model (slope=1, intercept=0). We can visualize the regression line in the feature space via the code below:
survreg.line.dt <- data.table(t(coef(fit.survreg)), model="survreg")
six.survreg.pred <- data.table(six.BIC.dt)
six.survreg.pred[, model := "survreg"]
six.survreg.pred[, pred.log.lambda := predict(fit.survreg, six.survreg.pred)]
six.survreg.res <- six.survreg.pred[six.features.targets, on=list(
profile.id, chromosome)]
six.survreg.res[, residual := 0]
six.survreg.res[{
is.finite(min.log.lambda)|is.finite(max.log.lambda)
}, residual := {
penaltyLearning::targetIntervalResidual(
cbind(min.log.lambda, max.log.lambda), pred.log.lambda)
}]
six.gg.survreg <- six.gg.bic+
geom_abline(aes(
slope=feature, intercept=`(Intercept)`, color=model),
size=1,
data=survreg.line.dt)+
geom_segment(aes(
feature, pred.log.lambda,
xend=feature, yend=pred.log.lambda-residual,
color=model),
data=six.survreg.res)
print(six.gg.survreg)
It is clear from the plot above that the learned survreg penalty reduces the number of incorrectly predicted target intervals from 5 to 2.
The plot below shows the predicted errors and the smooth surrogate loss minimized by the gradient descent algorithm implemented by the survreg function (gradient descent cannot minimize the non-convex error function directly).
six.survreg.pred[, pred.log.penalty := pred.log.lambda]
six.survreg.selection <- data.table(six.selection)[six.survreg.pred, on=list(
profile.id, chromosome,
min.log.lambda < pred.log.lambda,
max.log.lambda > pred.log.lambda)]
six.survreg.errors <- six.error.list$model.errors[six.survreg.selection, on=list(
profile.id, chromosome, n.segments), nomatch=0L]
log.pen.vec <- six.error.list$model.errors[, seq(
min(max.log.lambda), max(min.log.lambda), l=100)]
six.survreg.surrogate <- six.survreg.res[, {
## For censored observations, survreg finds a linear function that
## predicts mean values that maximize the normal CDF (pnorm) --
## equivalent to minimizing the negative log likelihood below.
neg.log.lik <- -log(
pnorm(max.log.lambda, log.pen.vec, fit.survreg$scale)-
pnorm(min.log.lambda, log.pen.vec, fit.survreg$scale))
data.table(
model="survreg",
log.penalty=log.pen.vec,
surrogate.loss=neg.log.lik
)[neg.log.lik < ymax.err]
}, by=list(profile.id, chromosome)]
six.gg.errors.survreg <- six.gg.errors.bic+
geom_point(aes(
pred.log.penalty, errors, color=model),
size=2,
data=six.survreg.errors)+
geom_line(aes(
log.penalty, surrogate.loss, color=model),
data=six.survreg.surrogate)
print(six.gg.errors.survreg)
Note in the plot above:
survreg
predicted penalties achieve 0 error where the BIC/SIC has 1 error.survreg
has 1 error.We can also visualize the survreg model predictions by plotting the data sequences via the code below.
six.survreg.labels <- six.error.list$label.errors[six.survreg.selection, on=list(
profile.id, chromosome, n.segments), nomatch=0L]
six.survreg.changes <- six.changes[six.survreg.selection, on=list(
profile.id, chromosome, n.segments), nomatch=0L]
six.survreg.segs <- six.segs[six.survreg.selection, on=list(
profile.id, chromosome, n.segments)]
gg.supervised+
ggtitle("survreg 1 feature model changepoints and label errors")+
penaltyLearning::geom_tallrect(aes(
xmin=min/1e6,
xmax=max/1e6,
linetype=status),
fill=NA,
data=six.survreg.labels)+
scale_linetype_manual("error type", values=c(
correct=0,
"false negative"=3,
"false positive"=1))+
geom_vline(aes(
xintercept=chromStart/1e6),
data=six.survreg.changes,
color="green",
size=1,
linetype="dashed")+
theme(legend.box="horizontal")
## Warning: Removed 65 rows containing missing values (geom_point).
The false positive and false negative predictions are evident in the plot above.
Exercise 3 during class: the survreg model above assumed a Gaussian distribution for the log(penalty) values. Would using another distribution increase accuracy? Hint: try re-doing the analyses in this section with the survreg(dist="logistic")
model. For other choices, see help(survreg)
. Also try other input features from feature.mat
.
Exercise after class: for one of the neuroblastoma data sequences plotted above, verify that changepoint::cpt.mean(data.vec, penalty="Manual", pen.value=PREDICTED.PENALTY, method="PELT")
and fpop::Fpop(data.vec, lambda=PREDICTED.PENALTY)
yield the same changepoints. Fpop implements a “functional pruning optimal partitioning” algorithm which is guaranteed to prune more sub-optimal changepoints (i.e. be faster than) the PELT algorithm. Read On optimal multiple changepoint algorithms for large data for more info.
Exercise after class: try re-doing the analyses in this section using penaltyLearning::IntervalRegressionUnregularized
, which learns an un-regularized model by minimizing the squared hinge loss with respect to the target interval limits. Is there any difference in prediction accuracy?
In the previous sections, we assumed that the log(penalty) is an affine function of a given single feature. We chose that feature based on the theoretical arguments of the BIC/SIC penalty. In this section, we show how a multi-variate affine penalty function can be learned using L1 regularization. This technique is great when you don’t have a theoretical prior for the penalty. You can compute as many features as you like (256 using penaltyLearning::featureMatrix
), and the L1 regularization with ignore any irrelevant features automatically (feature selection).
The penalty function we learn is \(\log\lambda_i = f(\mathbf x_i) = \beta + \mathbf w^\intercal \mathbf x_i\) which is equivalent to solving the following penalized changepoint detection problem, \[ \operatorname{minimize}_{\mathbf m\in\mathbb R^{d_i}} \sum_{j=1}^{d_i} \ell(z_{ij}, m_j) + \underbrace{\exp(\beta+\mathbf w^\intercal \mathbf x_i)}_{\lambda_i} \sum_{j=1}^{d_i-1} I(m_{j} \neq m_{j+1}), \] where the intercept \(\beta\) and the weight vector \(\mathbf w\) are defined as the solution to the following optimization problem, \[ \operatorname{minimize}_{ \mathbf w\in\mathbb R^p, \beta\in\mathbb R } \gamma ||\mathbf w||_1 + \frac{1}{n} \sum_{i=1}^n \phi(\mathbf w^\intercal \mathbf x_i+\beta-\underline y_i)+ \phi(\overline y_i-\mathbf w^\intercal \mathbf x_i-\beta) \] where
This is a machine learning technique that automatically selects features that result in good prediction accuracy. First we create a feature and target matrix for the data in the train set:
finite.feature.mat <- feature.mat[, apply(is.finite(feature.mat), 2, all)]
train.ids <- train.dt[, paste(profile.id, chromosome) ]
train.feature.mat <- finite.feature.mat[train.ids, ]
train.target.mat <- train.dt[, cbind(min.log.lambda, max.log.lambda)]
str(train.feature.mat)
## num [1:30, 1:256] 474 250 191 154 171 428 234 171 146 153 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:30] "1 1" "1 2" "1 3" "1 4" ...
## ..$ : chr [1:256] "n.identity" "n.sqrt" "n.log" "n.loglog" ...
The training set consists of 30 examples, each with 256 features. Before training, we register the multiprocess future plan (the internal cross-validation loop to choose the regularization parameter will be computed in parallel).
if(require(future)){
options(mc.cores=parallel::detectCores()/2)
plan(multiprocess)
}
We use set.seed
below in order to make the results reproducible (there is randomness in the fold assignment in the internal cross-validation loop). Then, we use IntervalRegressionCV
to fit the regularized linear model.
New function: the penaltyLearning::IntervalRegressionCV
function computes an L1-regularized interval regression model. Its two main inputs are a numeric input feature matrix (n observations x p features), and a numeric output target interval matrix (n observations x 2). Its return value is a model fit object of class IntervalRegression
, which has predict
, print
, plot
, and coef
methods.
set.seed(1)
any.finite <- apply(is.finite(train.target.mat), 1, any)
fit.cv <- penaltyLearning::IntervalRegressionCV(
train.feature.mat[any.finite,], train.target.mat[any.finite,])
## Loading required namespace: future.apply
## Loading required namespace: directlabels
## Loading required namespace: directlabels
## Loading required namespace: directlabels
## Loading required namespace: directlabels
## Loading required namespace: directlabels
coef(fit.cv)
## 0.001
## (Intercept) -6.171689e-01
## data orig.identity.sum 9.338593e-04
## data orig.identity.quantile.0% 2.832772e-02
## data orig.identity.quantile.100% 1.868335e-03
## data orig.square.sum 2.378681e-05
## data orig.square.quantile.0% -5.036387e-02
## data abs.identity.sum 2.088770e-04
## data abs.sqrt.sum 7.485475e-05
## data abs.sqrt.quantile.0% 1.199808e-02
## data abs.sqrt.quantile.50% 3.671291e-06
## data abs.log.sum 9.699457e-07
## data abs.log.quantile.25% 2.118603e-05
## data abs.log.quantile.50% 2.747159e-04
## data abs.square.sum 1.622015e-05
## data square.log.quantile.25% 1.059009e-05
## data square.log.quantile.50% 1.373699e-04
## residual orig.sqrt.quantile.100% 2.954781e-02
## residual orig.square.sum 1.472951e+26
## residual abs.identity.quantile.75% 2.959378e-02
## residual abs.square.quantile.75% -2.102896e-02
## residual abs.square.quantile.100% 1.492553e-03
## residual square.identity.quantile.75% -2.132852e-02
## residual square.identity.quantile.100% 1.492553e-03
## residual square.sqrt.quantile.75% 2.996788e-02
## diff orig.identity.sum -3.828749e-01
## diff orig.identity.mean -7.781112e-01
## diff orig.identity.quantile.0% -5.457965e-01
## diff orig.identity.quantile.50% -3.095173e+00
## diff orig.identity.quantile.100% -9.918066e-04
## diff orig.square.quantile.0% 6.242716e-01
## diff orig.square.quantile.25% 3.672376e-01
## diff orig.square.quantile.50% 3.306278e+00
## diff orig.square.quantile.100% -5.921755e-03
## diff abs.square.quantile.50% 2.513148e-02
## diff abs.square.quantile.100% 4.326546e-05
## diff square.identity.quantile.50% 3.748871e-01
## diff square.identity.quantile.100% 4.326546e-05
## diff square.square.mean 5.930345e+01
## diff square.square.sd 1.043750e+01
## diff square.square.quantile.25% 1.782259e+01
## diff square.square.quantile.50% 1.453184e+03
## diff square.square.quantile.75% 2.065200e+00
## diff square.square.quantile.100% 1.859673e-01
It is clear from the output above that even though the model was trained using 256 possible input features, it has selected only a few of them to use in the prediction function. We visualize the predicted penalty values as dots in the error curve plots below:
six.cv.pred <- data.table(six.survreg.pred)
six.cv.pred[, pred.log.penalty := {
pred.feat.mat <- finite.feature.mat[paste(profile.id, chromosome), ]
predict(fit.cv, pred.feat.mat)
}]
six.cv.pred[, model := "IntervalRegressionCV"]
six.cv.pred[, pred.log.lambda := pred.log.penalty]
six.cv.selection <- data.table(six.selection)[six.cv.pred, on=list(
profile.id, chromosome,
min.log.lambda < pred.log.lambda,
max.log.lambda > pred.log.lambda)]
six.cv.errors <- six.error.list$model.errors[six.cv.selection, on=list(
profile.id, chromosome, n.segments), nomatch=0L]
six.cv.surrogate <- six.survreg.res[, {
## IntervalRegressionCV minimizes the squared hinge loss.
data.table(
model="IntervalRegressionCV",
log.penalty=log.pen.vec,
surrogate.loss=penaltyLearning::squared.hinge(log.pen.vec-min.log.lambda)+
penaltyLearning::squared.hinge(max.log.lambda-log.pen.vec)
)[surrogate.loss < ymax.err]
}, by=list(profile.id, chromosome)]
six.gg.errors.survreg+
geom_point(aes(
pred.log.penalty, errors, color=model),
size=2,
data=six.cv.errors)+
geom_line(aes(
log.penalty, surrogate.loss, color=model),
data=six.cv.surrogate)
The plot above also shows the surrogate loss that IntervalRegressionCV
minimizes using a gradient descent algorithm. Note how it is very similar to the survreg
surrogate loss – both are smooth functions which increase quadratically as the predicted value goes past the target interval limit.
Finally, we can also visualize the IntervalRegressionCV
model in terms of the data sequences, labels, and predicted segmentation:
six.cv.labels <- six.error.list$label.errors[six.cv.selection, on=list(
profile.id, chromosome, n.segments), nomatch=0L]
six.cv.changes <- six.changes[six.cv.selection, on=list(
profile.id, chromosome, n.segments), nomatch=0L]
six.cv.segs <- six.segs[six.cv.selection, on=list(
profile.id, chromosome, n.segments)]
gg.supervised+
ggtitle("IntervalRegressionCV multi-feature model changepoints and label errors")+
penaltyLearning::geom_tallrect(aes(
xmin=min/1e6,
xmax=max/1e6,
linetype=status),
fill=NA,
data=six.cv.labels)+
scale_linetype_manual("error type", values=c(
correct=0,
"false negative"=3,
"false positive"=1))+
geom_vline(aes(
xintercept=chromStart/1e6),
data=six.cv.changes,
color="green",
size=1,
linetype="dashed")+
theme(legend.box="horizontal")
## Warning: Removed 65 rows containing missing values (geom_point).
The plot above clearly shows the advantage of the machine learning approach: all labels are correctly predicted, including all labels on the test set (chromosome 11).
Exercise after class: by default the IntervalRegressionCV
function chooses the regularization parameter with minimum error, as estimated using an internal cross-validation loop. Another option is to choose the least complex model within 1 standard deviation. Re-fit the model using IntervalRegressionCV(reg.type="1sd")
, view the error as a function of model complexity via plot(fit.cv)
, then re-do the model prediction error plots. Is this model more or less accurate? Why? Hint: use coef(fit.cv)
to see how many features are included in the model.
Since we did not use the chromosome 11 labels to train any of the models above, we can use those labels to evaluate the prediction accuracy of the learned models.
Below we create a list with an element for each of the three models we want to compare. Then we use a for
loop to compute ROC curves for each model, with respect to the test data on chromosome 11.
New function: the penaltyLearning::ROChange
function computes Receiver Operating Characteristic curves for penalty function learning problems. It inputs the model.errors
data.table from penaltyLearning::labelError
, and a data.table with predicted log(penalty) values in the pred.log.lambda
column. It outputs a list which describes the ROC curves. Below we use the roc
element which is a data.table with one row for each point on the ROC curve. We also use the thresholds
element, which is a data.table which has a row with the error rates for the predicted threshold. For more info read help(ROChange)
.
model.pred.list <- list(
IntervalRegressionCV=six.cv.pred,
"BIC/SIC"=six.BIC.dt,
survreg=six.survreg.pred)
roc.list <- list()
auc.list <- list()
for(model in names(model.pred.list)){
model.pred <- model.pred.list[[model]]
model.roc <- penaltyLearning::ROChange(
six.error.list$model.errors, model.pred[chromosome==11],
problem.vars=c("profile.id", "chromosome"))
roc.list[[model]] <- data.table(model, model.roc$roc)
auc.list[[model]] <- with(model.roc, data.table(
model, auc, thresholds[threshold=="predicted"]))
}
roc <- do.call(rbind, roc.list)
(auc <- do.call(rbind, auc.list))
## model auc threshold labels possible.fp possible.fn min.thresh max.thresh
## 1: IntervalRegressionCV 1.0000000 predicted 6 3 3 -2.3028026 0.3027037
## 2: BIC/SIC 0.8888889 predicted 6 3 3 -1.0752140 0.4493126
## 3: survreg 0.8888889 predicted 6 3 3 -0.1133423 1.4082383
## 8 variables not shown: [fp, fn, min.fp.fn, errors, FPR, tp, TPR, error.percent]
The table above shows the error/accuracy of the three models with respect to the labels in the test data (chromosome 11). It is clear that the IntervalRegressionCV is slightly more accurate (higher auc and lower errors). The ROC curves are shown below:
ggplot()+
scale_color_manual(values=model.colors)+
geom_path(aes(
FPR, TPR, color=model, linetype=model),
size=1,
data=roc)+
geom_point(aes(
FPR, TPR, color=model),
size=3,
data=auc)+
coord_equal()
However, with so few labels, and only one test set, it is difficult/impossible to tell if there is any significant difference between models. Below we use K-fold cross-validation to evaluate the models:
Since the neuroblastoma data set is relatively big (3418 labeled data sequences), we have pre-computed the Segmentor models, which we load using the code below. (to re-compute them on your computer, run Segmentor.models.R)
if(!file.exists("Segmentor.models.RData")){
download.file(
"https://rcdata.nau.edu/genomic-ml/Segmentor.models.RData",
"Segmentor.models.RData")
}
load("Segmentor.models.RData")
Before performing cross-validation, we can compute the error functions, features and target intervals for all of the labeled data.
nb.selection <- Segmentor.models$loss[, {
penaltyLearning::modelSelection(
.SD, complexity="n.segments")
}, by=list(profile.id, chromosome)]
nb.error.list <- penaltyLearning::labelError(
nb.selection,
neuroblastoma$annotations,
Segmentor.models$segs[1 < start],
problem.vars=c("profile.id", "chromosome")
)
nb.target.dt <- penaltyLearning::targetIntervals(
nb.error.list$model.errors,
problem.vars=c("profile.id", "chromosome"))
labeled.profiles <- data.table(
neuroblastoma$profiles)[nb.target.dt[, list(
profile.id, chromosome)], on=list(
profile.id, chromosome)]
labeled.feature.mat <- penaltyLearning::featureMatrix(
labeled.profiles,
problem.vars=c("profile.id", "chromosome"),
data.var="logratio")
Rather than training on all features (which is possible but takes a bit too long for this tutorial), we will just use the following subset:
(some.feature.names <- c(
colnames(labeled.feature.mat)[1:5],
rownames(coef(fit.cv))[-1]))
## [1] "n.identity" "n.sqrt"
## [3] "n.log" "n.loglog"
## [5] "n.square" "data orig.identity.sum"
## [7] "data orig.identity.quantile.0%" "data orig.identity.quantile.100%"
## [9] "data orig.square.sum" "data orig.square.quantile.0%"
## [11] "data abs.identity.sum" "data abs.sqrt.sum"
## [13] "data abs.sqrt.quantile.0%" "data abs.sqrt.quantile.50%"
## [15] "data abs.log.sum" "data abs.log.quantile.25%"
## [17] "data abs.log.quantile.50%" "data abs.square.sum"
## [19] "data square.log.quantile.25%" "data square.log.quantile.50%"
## [21] "residual orig.sqrt.quantile.100%" "residual orig.square.sum"
## [23] "residual abs.identity.quantile.75%" "residual abs.square.quantile.75%"
## [25] "residual abs.square.quantile.100%" "residual square.identity.quantile.75%"
## [27] "residual square.identity.quantile.100%" "residual square.sqrt.quantile.75%"
## [29] "diff orig.identity.sum" "diff orig.identity.mean"
## [31] "diff orig.identity.quantile.0%" "diff orig.identity.quantile.50%"
## [33] "diff orig.identity.quantile.100%" "diff orig.square.quantile.0%"
## [35] "diff orig.square.quantile.25%" "diff orig.square.quantile.50%"
## [37] "diff orig.square.quantile.100%" "diff abs.square.quantile.50%"
## [39] "diff abs.square.quantile.100%" "diff square.identity.quantile.50%"
## [41] "diff square.identity.quantile.100%" "diff square.square.mean"
## [43] "diff square.square.sd" "diff square.square.quantile.25%"
## [45] "diff square.square.quantile.50%" "diff square.square.quantile.75%"
## [47] "diff square.square.quantile.100%"
Next we assign a fold to each target,
set.seed(1)
n.folds <- 5
nb.target.dt[, fold := sample(rep(1:n.folds, l=.N))]
Then we loop over test folds. In each iteration of the for loop, we create train.target.dt
and train.features
by excluding data in the test fold. We then train each model using these data, and compute prediction accuracy for each model using penaltyLearning::ROChange
.
cv.roc.list <- list()
cv.auc.list <- list()
for(test.fold in 1:n.folds){
train.target.dt <- nb.target.dt[fold!=test.fold]
train.id.vec <- train.target.dt[, paste(profile.id, chromosome)]
train.features <- labeled.feature.mat[train.id.vec, ]
train.target.mat <- train.target.dt[, cbind(min.log.lambda, max.log.lambda)]
print(test.fold)
set.seed(1)
fit <- penaltyLearning::IntervalRegressionCV(
train.features[, some.feature.names], train.target.mat)
train.target.dt$feature <- train.features[, "n.loglog"]
fit.survreg <- survreg(
Surv(min.log.lambda, max.log.lambda, type="interval2") ~ feature,
train.target.dt, dist="gaussian")
## compute predictions
test.target.dt <- nb.target.dt[fold==test.fold]
test.id.vec <- test.target.dt[, paste(profile.id, chromosome)]
test.feature.vec <- labeled.feature.mat[test.id.vec, "n.loglog"]
## baseline constant prediction
const.pen.vec <- seq(-4, 4, by=0.1)
const.acc.vec <- sapply(const.pen.vec, function(pred.penalty){
sum(train.target.mat[,1] < pred.penalty & pred.penalty < train.target.mat[,2])
})
best.constant <- const.pen.vec[which.max(const.acc.vec)]
pred.list <- list(
constant=rep(best.constant, length(test.feature.vec)),
"BIC/SIC"=test.feature.vec,
survreg=predict(fit.survreg, data.table(feature=test.feature.vec)),
IntervalRegressionCV=predict(fit, labeled.feature.mat[test.id.vec,]))
for(model in names(pred.list)){
pred.dt <- data.table(
test.target.dt,
pred.log.lambda=as.numeric(pred.list[[model]]))
pred.roc <- penaltyLearning::ROChange(
nb.error.list$model.errors,
pred.dt,
problem.vars=c("profile.id", "chromosome"))
cv.roc.list[[paste(test.fold, model)]] <- data.table(
test.fold, model, pred.roc$roc)
cv.auc.list[[paste(test.fold, model)]] <- with(pred.roc, data.table(
test.fold, model, auc, thresholds[threshold=="predicted"]))
}
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
cv.auc <- do.call(rbind, cv.auc.list)
cv.roc <- do.call(rbind, cv.roc.list)
Note that in the code above, we used cv.roc.list
to save the ROC curve for each model and test fold, and we used cv.auc.list
to save AUC. We plot the ROC curves below in order to visualize what kinds of errors the models commit (false positives or false negatives).
ggplot()+
scale_color_manual(values=model.colors)+
geom_path(aes(
FPR, TPR, color=model, group=paste(model, test.fold)),
data=cv.roc)+
geom_point(aes(
FPR, TPR, color=model),
fill="white",
shape=21,
data=cv.auc)+
coord_equal(xlim=c(0, 0.5), ylim=c(0.5, 1))
It is clear from the ROC curves above that the BIC/SIC model has a relatively low true positive rate (high false negative rate – many breakpoint/positive labels with no predicted changepoints).
Below we compute and plot two accuracy metrics, AUC and percent incorrectly predicted labels.
cv.auc[, accuracy.percent := 100-error.percent]
cv.auc.tall <- melt(cv.auc, measure.vars=c("accuracy.percent", "auc"))
cv.means <- cv.auc.tall[, list(
mean=mean(value),
sd=sd(value)
), by=list(model, variable)]
levs <- cv.means[variable=="auc"][order(mean), model]
cv.auc.tall[, model.fac := factor(model, levs)]
ggplot()+
theme_bw()+
theme(panel.spacing=grid::unit(0, "lines"))+
facet_grid(. ~ variable, scales="free")+
scale_color_manual(values=model.colors)+
guides(color="none")+
geom_point(aes(
value, model.fac, color=model),
data=cv.auc.tall)+
xlab("")+
ylab("model")
It is clear from the plot above that survreg is significantly more accurate than the BIC/SIC and constant penalties, and IntervalRegressionCV is slightly more accurate than survreg.
Remark 1: the accuracy rates above (BIC/SIC and constant penalties predict 92% correct labels whereas survreg and IntervalRegressionCV predict 97-98% correct labels) are specific to the neuroblastoma data set. However, in general you can expect that unsupervised penalty functions like BIC/SIC will be less accurate than supervised methods such as IntervalRegressionCV. This is the main motivation of supervised changepoint analysis: we can use labels to learn a penalty function with improved changepoint detection accuracy.
Remark 2: To learn the most accurate model, it is best to use all training data (instead of holding out some of the labels as a test set). The figure-neuroblastoma-predictions.R script fits an IntervalRegressionCV model to the entire neuroblastoma data set, uses the model to predict changepoints for all data sequences (even those without labels), then plots the predicted changepoints along with the train errors. Some overfitting/extrapolation errors are evident. For example some chrY sequences have too many changes — this could be fixed by adding some negative/normal labels on those sequences.
Conclusion: in labeled data sequences with a common signal/noise pattern, supervised learning algorithms can be used to predict penalty values and changepoints which are more accurate than unsupervised methods. An advantage of the supervised learning approach is that we can quantitatively evaluate the learned models by counting the number of incorrectly predicted labels in a held-out test set.
Exercise after class: is the logistic or Gaussian distribution a more accurate model for the log(penalty) values? In the code block above, add some code for computing the model predictions and error rates for survreg(dist="logistic")
. Then re-do the plots in this section – there should be 4 models instead of 3 in the last plot.