See the GitHub repository for links to source code and exercises: https://github.com/tdhock/change-tutorial
Before executing the code in this tutorial Rmd, make sure to install the required packages:
packages.R <- if(file.exists("packages.R")){
"packages.R"
}else{
"https://raw.githubusercontent.com/tdhock/change-tutorial/master/packages.R"
}
source(packages.R)
## Loading required package: survival
## Loading required package: data.table
## Loading required package: neuroblastoma
## Loading required package: future
##
## Attaching package: 'future'
## The following object is masked from 'package:survival':
##
## cluster
## Loading required package: Segmentor3IsBack
## Segmentor3IsBack v2.0 Loaded
## Loading required package: changepoint
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Successfully loaded changepoint package version 2.3.1
## NOTE: Predefined penalty values changed in version 2.2. Previous penalty values with a postfix 1 i.e. SIC1 are now without i.e. SIC and previous penalties without a postfix i.e. SIC are now with a postfix 0 i.e. SIC0. See NEWS and help files for further details.
## Loading required package: directlabels
## Loading required package: ggplot2
## Loading required package: geometry
## Loading required package: penaltyLearning
options(width=100)
There are two features of a data set that make it possible to do a supervised changepoint analysis:
The goal of the supervised analysis is to find a model that learns from the limited labeled data, and provides accurate changepoint predictions for all of the data (a different number of changepoints for each separate data sequence).
We will use the neuroblastoma data set to explore supervised changepoint detection.
data(neuroblastoma, package="neuroblastoma")
str(neuroblastoma)
## List of 2
## $ profiles :'data.frame': 4616846 obs. of 4 variables:
## ..$ profile.id: Factor w/ 575 levels "1","2","4","5",..: 7 7 7 7 7 7 7 7 7 7 ...
## ..$ chromosome: Factor w/ 24 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## ..$ position : int [1:4616846] 809681 928433 987423 1083595 1125548 1199359 1392490 1672814 1985786 2046695 ...
## ..$ logratio : num [1:4616846] -0.01742 0.06626 0.06764 0.04264 0.00576 ...
## $ annotations:'data.frame': 3418 obs. of 5 variables:
## ..$ profile.id: Factor w/ 575 levels "1","2","4","5",..: 207 322 379 144 388 497 564 331 189 315 ...
## ..$ chromosome: Factor w/ 24 levels "1","2","3","4",..: 11 11 11 11 11 11 11 11 11 11 ...
## ..$ min : int [1:3418] 53700000 53700000 53700000 53700000 53700000 53700000 53700000 53700000 53700000 53700000 ...
## ..$ max : int [1:3418] 135006516 135006516 135006516 135006516 135006516 135006516 135006516 135006516 135006516 135006516 ...
## ..$ annotation: Factor w/ 2 levels "breakpoint","normal": 2 2 2 2 2 2 1 2 2 2 ...
lapply(neuroblastoma, head, 10)
## $profiles
## profile.id chromosome position logratio
## 1 8 1 809681 -0.017417053
## 2 8 1 928433 0.066261442
## 3 8 1 987423 0.067638717
## 4 8 1 1083595 0.042644337
## 5 8 1 1125548 0.005759269
## 6 8 1 1199359 0.000000000
## 7 8 1 1392490 0.069014678
## 8 8 1 1672814 0.204140717
## 9 8 1 1985786 0.039840265
## 10 8 1 2046695 0.073134705
##
## $annotations
## profile.id chromosome min max annotation
## 1 213 11 53700000 135006516 normal
## 2 333 11 53700000 135006516 normal
## 3 393 11 53700000 135006516 normal
## 4 148 11 53700000 135006516 normal
## 5 402 11 53700000 135006516 normal
## 6 521 11 53700000 135006516 normal
## 7 590 11 53700000 135006516 breakpoint
## 8 342 11 53700000 135006516 normal
## 9 195 11 53700000 135006516 normal
## 10 326 11 53700000 135006516 normal
The neuroblastoma data set consists of two data.frames:
These data come from children who were treated at the Institut Curie (Paris, France). For six children we also have follow-up data on whether they recovered (ok) or had a relapse, several years after treatment:
followup <- data.frame(
profile.id=paste(c(10, 8, 4, 6, 11, 1)),
status=c("ok", "relapse", "relapse", "ok", "ok", "relapse"))
followup
## profile.id status
## 1 10 ok
## 2 8 relapse
## 3 4 relapse
## 4 6 ok
## 5 11 ok
## 6 1 relapse
When there are no labels, the input for a changepoint analysis just uses the noisy DNA copy number data in neuroblastoma$profiles$logratio
. We begin by selecting the profiles of the six patients for which we have follow-up data.
rownames(followup) <- followup$profile.id
followup$status.profile <- with(followup, paste(status, profile.id))
some.ids <- rownames(followup)
library(data.table)
someProfiles <- function(all.profiles){
some <- subset(all.profiles, paste(profile.id) %in% some.ids)
status.profile <- followup[paste(some$profile.id), "status.profile"]
some$status.profile <- ifelse(
is.na(status.profile), paste(some$profile.id), status.profile)
data.table(some)
}
neuroblastoma.dt <- data.table(neuroblastoma$profiles)
## neuroblastoma.dt[, logratio.norm := logratio/mad(logratio), by=list(
## profile.id, chromosome)]
neuroblastoma.dt[, logratio.norm := logratio]
six.profiles <- someProfiles(neuroblastoma.dt)
For these six patients, the total number of separate changepoint detection problems is 24 chromosomes x 6 patients = 144 problems. We plot each problem below in a separate facet (panel).
library(ggplot2)
gg.unsupervised <- ggplot()+
ggtitle("unsupervised changepoint detection = only noisy data sequences")+
theme(
panel.spacing=grid::unit(0, "lines"),
panel.border=element_rect(fill=NA, color="grey50")
)+
facet_grid(status.profile ~ chromosome, scales="free", space="free_x")+
geom_point(aes(position/1e6, logratio),
data=six.profiles,
shape=1)+
scale_x_continuous(
"position on chromosome (mega bases)",
breaks=c(100, 200))+
scale_y_continuous(
"logratio (approximate DNA copy number)",
limits=c(-1,1)*1.1)
print(gg.unsupervised)
## Warning: Removed 65 rows containing missing values (geom_point).
Note the facet titles on the right:
Question: can you visually identify any differences between the “ok” and “relapse” profiles?
How to choose the number of changepoints? In the unsupervised setting, our only option is to use theoretical/statistical arguments, which give information criteria such as Aikaike Information Criterion (AIC, penalty=2) or Bayesian/Schwarz Information Criterion (BIC/SIC, penalty=log n). More on this later.
In supervised changepoint detection, there are labels which indicate presence and absence of changepoints in particular data subsets. For the same set of 6 profiles, we superimpose the labels in the plot below.
six.labels <- someProfiles(neuroblastoma$annotations)
change.colors <- c(
"1change"="#ff7d7d",
breakpoint="#a445ee",
normal="#f6c48f"
)
gg.supervised <- gg.unsupervised+
ggtitle(paste(
"supervised changepoint detection = data + labels",
"that indicate specific regions with or without changepoints"))+
penaltyLearning::geom_tallrect(aes(
xmin=min/1e6, xmax=max/1e6, fill=annotation),
alpha=0.5,
color=NA,
data=six.labels)+
scale_fill_manual("label", values=change.colors)+
theme(legend.position="bottom")
print(gg.supervised)
## Warning: Removed 65 rows containing missing values (geom_point).
New function: geom_tallrect
is like geom_rect
but the ymin is always the bottom of the panel and the ymax is always the top of the panel. These are useful for drawing labeled intervals of x values.
It is clear from this plot that the neuroblastoma data satisfy the two criteria which are necessary for a supervised changepoint analysis:
As we will see later in the tutorial, the labels can be used for choosing the best model and parameters. The main idea is that the changepoint model should be consistent with the labels:
In this section we show that typical unsupervised changepoint models result in prediction errors with respect to the labels in the neuroblastoma data. We begin by using the changepoint::cpt.mean
function to compute an unsupervised changepoint model for each data sequence. To compute the number of incorrect labels later, make sure to save a description of the model in a “tidy” data.frame (observations on rows and variables on columns). In the code below we
by=list(status.profile, chromosome)
which adds those two columns to the resulting data.table
.pen.name
column to identify the model.changes
list column to save the changepoint positions (in the same units as the label positions: bases on the chromosome).pen.name <- "SIC0"
(unsupervised.models <- six.profiles[, {
fit.pelt <- changepoint::cpt.mean(
logratio.norm, penalty=pen.name, method="PELT")
end <- fit.pelt@cpts
before.change <- end[-length(end)]
after.change <- before.change+1L
data.table(
pen.name,
pen.value=fit.pelt@pen.value,
changes=list(
as.integer((position[before.change]+position[after.change])/2)
))
}, by=list(profile.id, status.profile, chromosome)])
## profile.id status.profile chromosome pen.name pen.value changes
## 1: 8 relapse 8 1 SIC0 6.013715
## 2: 8 relapse 8 2 SIC0 5.375278
## 3: 8 relapse 8 3 SIC0 5.081404
## 4: 8 relapse 8 4 SIC0 4.882802
## 5: 8 relapse 8 5 SIC0 5.164786
## ---
## 140: 10 ok 10 20 SIC0 4.828314
## 141: 10 ok 10 21 SIC0 4.110874
## 142: 10 ok 10 22 SIC0 4.543295
## 143: 10 ok 10 X SIC0 5.231109
## 144: 10 ok 10 Y SIC0 3.295837
Next, we convert the list column changes
to tall format (one row for each changepoint).
(unsupervised.changes <- unsupervised.models[, data.table(
change=changes[[1]]
), by=list(profile.id, status.profile, chromosome, pen.name)])
## profile.id status.profile chromosome pen.name change
## 1: 8 relapse 8 7 SIC0 141405948
## 2: 8 relapse 8 11 SIC0 70573642
## 3: 8 relapse 8 17 SIC0 29895917
## 4: 8 relapse 8 Y SIC0 4459374
## 5: 11 ok 11 1 SIC0 32819999
## 6: 11 ok 11 Y SIC0 5918094
## 7: 11 ok 11 Y SIC0 11393013
## 8: 11 ok 11 Y SIC0 23715526
## 9: 4 relapse 4 1 SIC0 59792500
## 10: 4 relapse 4 2 SIC0 45164625
## 11: 4 relapse 4 3 SIC0 69953902
## 12: 4 relapse 4 14 SIC0 76603452
## 13: 4 relapse 4 17 SIC0 41646489
## 14: 1 relapse 1 1 SIC0 212809180
## 15: 1 relapse 1 7 SIC0 60381530
## 16: 1 relapse 1 11 SIC0 80058339
## 17: 1 relapse 1 Y SIC0 5918094
## 18: 1 relapse 1 Y SIC0 9420652
## 19: 1 relapse 1 Y SIC0 19070635
New function: we now introduce the penaltyLearning::labelError
function, which computes the number of incorrect labels for a set of predicted changepoints. Its inputs are 3 data.frames:
models
one row per changepoint model.labels
one row for each label, with columns for label location and type.changes
one row for each predicted changepoint, with a column for the predicted changepoint position.unsupervised.error.list <- penaltyLearning::labelError(
unsupervised.models, six.labels, unsupervised.changes,
problem.vars=c("status.profile", "chromosome"),
model.vars="pen.name",
change.var="change")
The result of labelError
is a list of two data tables:
model.errors
has one row per (sequence,model), with the total number of incorrect labels. We use these totals in the ggtitle
of the plot below.label.errors
has one row per (sequence,model,label), with the fp/fn status of each label. We use these to visualize false negative and false positive predictions using the geom_tallrect
with aes(linetype=status)
below.gg.supervised+
theme(legend.box="horizontal")+
unsupervised.error.list$model.errors[, ggtitle(paste0(
"Unsupervised ", pen.name, " penalty has ",
sum(errors),
" incorrect labels: ",
sum(fp), " FP + ",
sum(fn), " FN"))]+
geom_vline(aes(
xintercept=change/1e6),
color="green",
size=1,
linetype="dashed",
data=unsupervised.changes)+
geom_tallrect(aes(
xmin=min/1e6, xmax=max/1e6, linetype=status),
fill=NA,
data=unsupervised.error.list$label.errors)+
scale_linetype_manual("error type", values=c(
correct=0,
"false negative"=3,
"false positive"=1))
## Warning: Removed 65 rows containing missing values (geom_point).