Toby Dylan Hocking | Learning with Area Under the Min

My collaborator Joe Barr of Acronis is working on an application of machine learning to predicting security vulnerabilities based on source code analysis, and a database of exploits. The data they are using for training/testing their models is highly imbalanced so I suggested trying my new Area Under the Min(FP,FN) loss function (AUM). After talking with them about the easiest way to try this new loss function on their data, I coded it in torch and verified that the gradients computed by torch are reasonable with respect to the expected directional directives. But does it work for learning? Yes, it does, as I show below.

Reading and scaling spam data

As an example data set, consider the spam data, one of the examples from the Elements of Statistical Learning textbook.

import pandas as pd
spam_df = pd.read_csv(
    "~/teaching/cs570-spring-2022/data/spam.data",
    header=None,
    sep=" ")
spam_df

##         0     1     2    3     4     5   ...     52     53     54   55    56  57
## 0     0.00  0.64  0.64  0.0  0.32  0.00  ...  0.000  0.000  3.756   61   278   1
## 1     0.21  0.28  0.50  0.0  0.14  0.28  ...  0.180  0.048  5.114  101  1028   1
## 2     0.06  0.00  0.71  0.0  1.23  0.19  ...  0.184  0.010  9.821  485  2259   1
## 3     0.00  0.00  0.00  0.0  0.63  0.00  ...  0.000  0.000  3.537   40   191   1
## 4     0.00  0.00  0.00  0.0  0.63  0.00  ...  0.000  0.000  3.537   40   191   1
## ...    ...   ...   ...  ...   ...   ...  ...    ...    ...    ...  ...   ...  ..
## 4596  0.31  0.00  0.62  0.0  0.00  0.31  ...  0.000  0.000  1.142    3    88   0
## 4597  0.00  0.00  0.00  0.0  0.00  0.00  ...  0.000  0.000  1.555    4    14   0
## 4598  0.30  0.00  0.30  0.0  0.00  0.00  ...  0.000  0.000  1.404    6   118   0
## 4599  0.96  0.00  0.00  0.0  0.32  0.00  ...  0.000  0.000  1.147    5    78   0
## 4600  0.00  0.00  0.65  0.0  0.00  0.00  ...  0.000  0.000  1.250    5    40   0
## 
## [4601 rows x 58 columns]

After reading them into python above, we scale and center the inputs/features below,

import numpy as np
spam_features = spam_df.iloc[:,:-1].to_numpy()
spam_labels = spam_df.iloc[:,-1].to_numpy()
# 1. feature scaling
spam_mean = spam_features.mean(axis=0)
spam_sd = np.sqrt(spam_features.var(axis=0))
scaled_features = (spam_features-spam_mean)/spam_sd
scaled_features.mean(axis=0)

## array([ 1.85318688e-17,  2.77978032e-17,  2.47091584e-17,  0.00000000e+00,
##         4.94183168e-17,  3.70637376e-17, -2.47091584e-17,  0.00000000e+00,
##         2.47091584e-17,  1.23545792e-17, -2.47091584e-17,  6.94945080e-18,
##         0.00000000e+00, -3.39750928e-17,  0.00000000e+00,  7.41274751e-17,
##        -1.23545792e-17, -2.47091584e-17,  8.64820543e-17, -1.23545792e-17,
##        -4.94183168e-17, -1.85318688e-17,  0.00000000e+00, -2.47091584e-17,
##         0.00000000e+00,  1.23545792e-17, -1.23545792e-17,  0.00000000e+00,
##        -1.23545792e-17, -3.70637376e-17,  0.00000000e+00, -3.70637376e-17,
##         6.17728960e-18,  1.85318688e-17, -1.85318688e-17,  1.23545792e-17,
##        -3.70637376e-17,  0.00000000e+00,  0.00000000e+00,  2.47091584e-17,
##         0.00000000e+00, -5.55956064e-17, -4.32410272e-17,  1.85318688e-17,
##        -2.47091584e-17, -2.47091584e-17, -1.85318688e-17, -6.17728960e-18,
##         3.08864480e-17, -7.41274751e-17,  1.54432240e-17,  0.00000000e+00,
##        -2.47091584e-17,  0.00000000e+00,  4.32410272e-17,  1.23545792e-17,
##         2.47091584e-17])

scaled_features.var(axis=0)

## array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
##        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
##        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
##        1., 1., 1., 1., 1., 1.])

scaled_features

## array([[-3.42433707e-01,  3.30884903e-01,  7.12858774e-01, ...,
##         -4.52472762e-02,  4.52979198e-02, -8.72413388e-03],
##        [ 3.45359395e-01,  5.19091945e-02,  4.35129540e-01, ...,
##         -2.44326749e-03,  2.50562832e-01,  1.22832407e+00],
##        [-1.45921392e-01, -1.65071912e-01,  8.51723390e-01, ...,
##          1.45920848e-01,  2.22110599e+00,  3.25873251e+00],
##        ...,
##        [ 6.40127868e-01, -1.65071912e-01,  3.83734930e-02, ...,
##         -1.19382054e-01, -2.36941335e-01, -2.72627750e-01],
##        [ 2.80176333e+00, -1.65071912e-01, -5.56760578e-01, ...,
##         -1.27482666e-01, -2.42072958e-01, -3.38603654e-01],
##        [-3.42433707e-01, -1.65071912e-01,  7.32696576e-01, ...,
##         -1.24236117e-01, -2.42072958e-01, -4.01280763e-01]])

The above output indicates that each feature column is indeed scaled to mean 0 and variance 1.

Dividing the data into subtrain and validation sets

Next, we randomly divide the data into a subtrain and validation set. The gradient descent parameter updates will be computed using the subtrain set, and the validation set will be used to check for overfitting.

np.random.seed(1)
n_folds = 5
fold_vec = np.random.randint(low=0, high=n_folds, size=spam_labels.size)
validation_fold = 0
is_set_dict = {
    "validation":fold_vec == validation_fold,
    "subtrain":fold_vec != validation_fold,
}

After having defined logical vectors above that indicate each set, we use them below to construct the corresponding tensors,

import torch
set_features = {}
set_labels = {}
for set_name, is_set in is_set_dict.items():
    set_features[set_name] = torch.from_numpy(
        scaled_features[is_set,:]).float()
    set_labels[set_name] = torch.from_numpy(
        spam_labels[is_set]).float()
{set_name:array.shape for set_name, array in set_features.items()}

## {'validation': torch.Size([921, 57]), 'subtrain': torch.Size([3680, 57])}

The output above shows the number of rows and columns in each data set.

Defining Dataset and DataLoader, neural network, loss function

To do the gradient descent learning randomly with a given batch size, it is easiest in torch to create a Dataset sub-class and a DataLoader instance, as we do below:

class CSV(torch.utils.data.Dataset):
    def __init__(self, features, labels):
        self.features = features
        self.labels = labels
    def __getitem__(self, item):
        return self.features[item,:], self.labels[item]
    def __len__(self):
        return len(self.labels)
subtrain_dataset = CSV(set_features["subtrain"], set_labels["subtrain"])
subtrain_dataloader = torch.utils.data.DataLoader(
    subtrain_dataset, batch_size=20, shuffle=True)

Next we define a neural network below with a single hidden layer of 200 hidden units,

nrow, ncol = set_features["subtrain"].shape
class NNet(torch.nn.Module):
    def __init__(self, n_hidden=200):
        super(NNet, self).__init__()
        self.seq = torch.nn.Sequential(
            torch.nn.Linear(ncol, n_hidden),
            torch.nn.ReLU(),
            torch.nn.Linear(n_hidden, 1)
            )
    def forward(self, feature_mat):
        return self.seq(feature_mat)

Below we define the AUM loss function and a helper function which we use to compute the loss during each gradient descent batch, and at the end of every epoch during the subtrain/validation loss computation,

def AUM(pred_tensor, label_tensor):
    """Area Under Min(FP,FN)

    Loss function for imbalanced binary classification
    problems. Minimizing AUM empirically results in maximizing Area
    Under the ROC Curve (AUC). Arguments: pred_tensor and label_tensor
    should both be 1d tensors (vectors of real-valued predictions and
    labels for each observation in the set/batch).

    """
    fn_diff = torch.where(label_tensor == 1, -1, 0)
    fp_diff = torch.where(label_tensor == 1, 0, 1)
    thresh_tensor = -pred_tensor.flatten()
    sorted_indices = torch.argsort(thresh_tensor)
    sorted_fp_cum = fp_diff[sorted_indices].cumsum(axis=0)
    sorted_fn_cum = -fn_diff[sorted_indices].flip(0).cumsum(axis=0).flip(0)
    sorted_thresh = thresh_tensor[sorted_indices]
    sorted_is_diff = sorted_thresh.diff() != 0
    sorted_fp_end = torch.cat([sorted_is_diff, torch.tensor([True])])
    sorted_fn_end = torch.cat([torch.tensor([True]), sorted_is_diff])
    uniq_thresh = sorted_thresh[sorted_fp_end]
    uniq_fp_after = sorted_fp_cum[sorted_fp_end]
    uniq_fn_before = sorted_fn_cum[sorted_fn_end]
    uniq_min = torch.minimum(uniq_fn_before[1:], uniq_fp_after[:-1])
    return torch.mean(uniq_min * uniq_thresh.diff())

def compute_loss_pred(features, labels):
    pred_mat = model(features)
    pred_vec = pred_mat.reshape(len(pred_mat))
    return AUM(pred_vec, labels), pred_vec

from pytorch_lightning.metrics.classification import AUROC
compute_auc = AUROC()

We additionally use AUROC to compute the Area Under the ROC Curve.

Learning iterations

Finally we use a for loop over epochs below to do the gradient descent learning. Note the step size of the Adam optimizer was chosen to be small enough such that the loss decreases smoothly (no abrupt jumps up), but big enough so that the loss is minimized within a reasonable number of epochs.

loss_df_list = []
max_epochs=100
torch.manual_seed(1)

## <torch._C.Generator object at 0x0000000060DDF490>

model = NNet()
optimizer = torch.optim.Adam(model.parameters(), lr=0.00004)
for epoch in range(max_epochs):
    # first update weights.
    for batch_features, batch_labels in subtrain_dataloader:
        optimizer.zero_grad()
        batch_loss, pred_vec = compute_loss_pred(batch_features, batch_labels)
        batch_loss.backward()
        optimizer.step()
    # then compute subtrain/validation loss.
    for set_name in set_features:
        feature_mat = set_features[set_name]
        label_vec = set_labels[set_name]
        set_loss, pred_vec = compute_loss_pred(feature_mat, label_vec)
        loss_df_list.append(pd.DataFrame({
            "epoch":[epoch],
            "set_name":[set_name],
            "variable":["AUM"],
            "value":set_loss.detach().numpy(),
            }))
        loss_df_list.append(pd.DataFrame({
            "epoch":[epoch],
            "set_name":[set_name],
            "variable":["AUC"],
            "value":compute_auc(pred_vec, label_vec).detach().numpy(),
            }))
loss_df = pd.concat(loss_df_list)

Finally we visualize the subtrain/validation loss and AUC using plotnine,

import plotnine as p9
gg = p9.ggplot()+\
    p9.facet_grid("variable ~ .", labeller="label_both", scales="free")+\
    p9.geom_line(
        p9.aes(
            x="epoch",
            y="value",
            color="set_name"
        ),
        data=loss_df)
show(gg, "AUM-AUC")

plot of AUM-AUC

The plot above shows that the algorithm clearly is learning, and that decreasing the AUM results in increased AUC.