This post is similar to a previous post, but here we explain how to do the same computation on the cluster.

A number of students have asked me how to properly code cross-validation, in order to compare generalization / accuracy of different machine learning algorithms, when trained on different data sets. For example in one project, a student is trying to build an algorithm for satellite image segmentation. For each pixel in each image, there should be a predicted class, which represents the intensity of forest fire burn in that area of the image. The student has about 10000 rows (pixels with labels), where each row comes from one of 9 different images. I suggested the following cross-validation experiment to determine the extent to which prediction error increases when generalizing to new images.

Let’s say there are 1000 rows,

import pandas as pd
import numpy as np
import math

N = 1000

full_df = pd.DataFrame({
    "label":np.tile(["burned","no burn"],int(N/2)),
    "image":np.concatenate([
        np.repeat(1, 0.4*N), np.repeat(2, 0.4*N),
        np.repeat(3, 0.1*N), np.repeat(4, 0.1*N)
    ])
})

len_series = full_df.groupby("image").apply(len)

uniq_images = full_df.image.unique()

n_images = len(uniq_images)

full_df["signal"] = np.where(
    full_df.label=="no burn", 0, 1
)*full_df.image

np.random.seed(1)

for n_noise in range(n_images-1):
    full_df["feature_easy_%s"%n_noise] = np.random.randn(N) 

full_df["feature_easy_%s"%n_images] = np.random.randn(N)+full_df.signal 

for img in uniq_images:
    noise = np.random.randn(N)
    full_df["feature_impossible_%s"%img] = np.where(
        full_df.image == img, full_df.signal+noise, noise)

n_folds=3
def get_fold_ids(some_array):
    n_out = len(some_array)
    arange_vec = np.arange(n_folds)
    tile_vec = np.tile(arange_vec, math.ceil(n_out/n_folds))[:n_out]
    return pd.DataFrame({
        "fold":np.random.permutation(tile_vec),
        })

new_fold_df = full_df.groupby("image").apply(get_fold_ids)
full_df["fold"] = new_fold_df.reset_index().fold
pd.crosstab(full_df.fold, full_df.image)
## image    1    2   3   4
## fold                   
## 0      134  134  34  34
## 1      133  133  33  33
## 2      133  133  33  33
full_df.columns 
## Index(['label', 'image', 'signal', 'feature_easy_0', 'feature_easy_1',
##        'feature_easy_2', 'feature_easy_4', 'feature_impossible_1',
##        'feature_impossible_2', 'feature_impossible_3', 'feature_impossible_4',
##        'fold'],
##       dtype='object')

The output above shows that the fold IDs have been assigned such that, for each image, there is an equal amount of data across folds. It also shows the columns of the full data, with four “easy” feature columns and four “impossible” feature columns, which were simulated using this model:

  • In the easy features, there is one signal feature, with the same pattern across images (larger is more likely to be burned). There are three other features which are just noise.
  • In the impossible features, each image has a corresponding signal feature, so learning from one image does not help when making predictions on another image.

We would like to compute the test error, when predicting on any given image, and training on either that same image or other images. In the code below we create a DataFrame of parameters that we can run in parallel to do that computation,

extract_df = full_df.columns.str.extract("feature_(?P<type>[^_]+)")
feature_name_series = extract_df.value_counts().reset_index().type
params_dict = {
    "feature_name":feature_name_series,
    "test_fold":range(n_folds),
    "target_img":full_df.image.unique(),
}
params_df = pd.MultiIndex.from_product(
    params_dict.values(),
    names=params_dict.keys()
).to_frame().reset_index(drop=True)
params_df
##    feature_name  test_fold  target_img
## 0          easy          0           1
## 1          easy          0           2
## 2          easy          0           3
## 3          easy          0           4
## 4          easy          1           1
## 5          easy          1           2
## 6          easy          1           3
## 7          easy          1           4
## 8          easy          2           1
## 9          easy          2           2
## 10         easy          2           3
## 11         easy          2           4
## 12   impossible          0           1
## 13   impossible          0           2
## 14   impossible          0           3
## 15   impossible          0           4
## 16   impossible          1           1
## 17   impossible          1           2
## 18   impossible          1           3
## 19   impossible          1           4
## 20   impossible          2           1
## 21   impossible          2           2
## 22   impossible          2           3
## 23   impossible          2           4

The table above has three columns, for feature_name, test_fold, and target_img. Each row represents a particular set of test data and features that we can compute in parallel on the cluster. Each row can be used for the arguments of the function below, which computes the test error:

from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
def compute_one(feature_name, test_fold, target_img):
    is_feature = np.array([feature_name in col for col in full_df.columns])
    is_all_test = full_df.fold == test_fold
    is_dict = {"all_train":False,"all_test":True}
    df_dict = {
        set_name:full_df.loc[is_all_test==is_value]
        for set_name, is_value in is_dict.items()}
    is_same_dict = {
        set_name:set_df.image == target_img
        for set_name, set_df in df_dict.items()}
    df_dict["test"] = df_dict["all_test"].loc[ is_same_dict["all_test"], :]
    is_same_values_dict = {
        "all":[True,False],
        "same":[True],
        "other":[False],
        }
    train_name_list = []
    for data_name, is_same_values in is_same_values_dict.items():
        train_name_list.append(data_name)
        train_is_same = is_same_dict["all_train"].isin(is_same_values)
        df_dict[data_name] = df_dict["all_train"].loc[train_is_same,:]
    same_nrow, same_ncol = df_dict["same"].shape
    for data_name in "all", "other":
        small_name = data_name+"_small"
        train_name_list.append(small_name)
        train_df = df_dict[data_name]
        all_indices = np.arange(len(train_df))
        small_indices = np.random.permutation(all_indices)[:same_nrow]
        df_dict[small_name] = train_df.iloc[small_indices,:]
    xy_dict = {
        data_name:(df.loc[:,is_feature], df.label)
        for data_name, df in df_dict.items()}
    test_X, test_y = xy_dict["test"]
    error_df_list = []
    for train_name in train_name_list:
        train_X, train_y = xy_dict[train_name]
        param_dicts = [{'n_neighbors':[x]} for x in range(1, 21)]
        learner_dict = {
            "linear_model":LogisticRegressionCV(),
            "nearest_neighbors":GridSearchCV(
                KNeighborsClassifier(), param_dicts)
            }
        featureless_pred = train_y.value_counts().index[0]
        pred_dict = {
            "featureless":np.repeat(featureless_pred, len(test_y)),
        }
        for algorithm, learner in learner_dict.items():
            learner.fit(train_X, train_y)
            pred_dict[algorithm] = learner.predict(test_X)
        for algorithm, pred_vec in pred_dict.items():
            error_vec = pred_vec != test_y
            out_df=pd.DataFrame({
                "feature_name":[feature_name],
                "test_fold":[test_fold],
                "target_img":[target_img],
                "data_for_img":[len_series[target_img]],
                "test_nrow":[len(test_y)],
                "train_nrow":[len(train_y)],
                "train_name":[train_name],
                "algorithm":[algorithm],
                "error_percent":[error_vec.mean()*100],
                })
            error_df_list.append(out_df)
    return pd.concat(error_df_list)

one_param_num=0
one_param_series = params_df.iloc[one_param_num,:]
one_param_dict = dict(one_param_series)
compute_one(**one_param_dict)
##   feature_name  test_fold  ...          algorithm  error_percent
## 0         easy          0  ...        featureless      58.208955
## 0         easy          0  ...       linear_model      29.104478
## 0         easy          0  ...  nearest_neighbors      29.104478
## 0         easy          0  ...        featureless      58.208955
## 0         easy          0  ...       linear_model      33.582090
## 0         easy          0  ...  nearest_neighbors      44.776119
## 0         easy          0  ...        featureless      58.208955
## 0         easy          0  ...       linear_model      32.835821
## 0         easy          0  ...  nearest_neighbors      30.597015
## 0         easy          0  ...        featureless      58.208955
## 0         easy          0  ...       linear_model      26.865672
## 0         easy          0  ...  nearest_neighbors      29.104478
## 0         easy          0  ...        featureless      41.791045
## 0         easy          0  ...       linear_model      32.089552
## 0         easy          0  ...  nearest_neighbors      30.597015
## 
## [15 rows x 9 columns]

To run this experiment on the monsoon cluster, as previously described, we need to create three python scripts.

  • params.py creates parameter combination CSV files, which is param_df above.
  • run_one.py computes the result for one parameter combination, which is the compute_one function above.
  • analyze.py combines the results for each parameter combination into a single result for analysis/plotting.

Params script

The params.py script contents are below. It creates params.csv and full.csv in both the original/current directory, and in the scratch sub-directory for this job.

from datetime import datetime
import pandas as pd
import numpy as np
import os
import shutil
import math
N = 1000
full_df = pd.DataFrame({
    "label":np.tile(["burned","no burn"],int(N/2)),
    "image":np.concatenate([
        np.repeat(1, 0.4*N), np.repeat(2, 0.4*N),
        np.repeat(3, 0.1*N), np.repeat(4, 0.1*N)
    ])
})
uniq_images = full_df.image.unique()
n_images = len(uniq_images)
full_df["signal"] = np.where(
    full_df.label=="no burn", 0, 1
)*full_df.image
np.random.seed(1)
for n_noise in range(n_images-1):
    full_df["feature_easy_%s"%n_noise] = np.random.randn(N) 
full_df["feature_easy_%s"%n_images] = np.random.randn(N)+full_df.signal 
for img in uniq_images:
    noise = np.random.randn(N)
    full_df["feature_impossible_%s"%img] = np.where(
        full_df.image == img, full_df.signal+noise, noise)
n_folds=3
def get_fold_ids(some_array):
    n_out = len(some_array)
    arange_vec = np.arange(n_folds)
    tile_vec = np.tile(arange_vec, math.ceil(n_out/n_folds))[:n_out]
    return pd.DataFrame({
        "fold":np.random.permutation(tile_vec),
        })
new_fold_df = full_df.groupby("image").apply(get_fold_ids)
full_df["fold"] = new_fold_df.reset_index().fold
extract_df = full_df.columns.str.extract("feature_(?P<type>[^_]+)")
feature_name_series = extract_df.value_counts().reset_index().type
params_dict = {
    "feature_name":feature_name_series,
    "test_fold":range(n_folds),
    "target_img":full_df.image.unique(),
}
params_df = pd.MultiIndex.from_product(
    params_dict.values(),
    names=params_dict.keys()
).to_frame().reset_index(drop=True)
n_tasks, ncol = params_df.shape
job_name = datetime.now().strftime("%Y-%m-%d_%H:%M:%S")
job_name = "gabi_demo"
job_dir = "/scratch/th798/"+job_name
results_dir = os.path.join(job_dir, "results")
os.system("mkdir -p "+results_dir)
params_csv = os.path.join(job_dir, "params.csv")
params_df.to_csv(params_csv,index=False)
full_csv = os.path.join(job_dir, "full.csv")
full_df.to_csv(full_csv,index=False)
run_one_contents = f"""#!/bin/bash
#SBATCH --array=0-{n_tasks-1}
#SBATCH --time=24:00:00
#SBATCH --mem=8GB
#SBATCH --cpus-per-task=1
#SBATCH --output={job_dir}/slurm-%A_%a.out
#SBATCH --error={job_dir}/slurm-%A_%a.out
#SBATCH --job-name={job_name}
cd {job_dir}
python run_one.py $SLURM_ARRAY_TASK_ID
"""
run_one_sh = os.path.join(job_dir, "run_one.sh")
with open(run_one_sh, "w") as run_one_f:
    run_one_f.write(run_one_contents)
run_one_py = os.path.join(job_dir, "run_one.py")
run_orig_py = "run_one.py"
shutil.copyfile(run_orig_py, run_one_py)
orig_dir = os.path.dirname(run_orig_py)
orig_results = os.path.join(orig_dir, "results")
os.system("mkdir -p "+orig_results)
orig_csv = os.path.join(orig_dir, "params.csv")
params_df.to_csv(orig_csv,index=False)
orig_full_csv = os.path.join(orig_dir, "full.csv")
full_df.to_csv(orig_full_csv,index=False)
msg=f"""created params CSV files and job scripts, test with
python {run_orig_py} 0
SLURM_ARRAY_TASK_ID=0 bash {run_one_sh}"""
print(msg)

Run one script

The run_one.py script below computes the test error for one parameter combination, and saves the result in a CSV file under the results sub-directory.

from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np
import sys
full_df = pd.read_csv("full.csv")
len_series = full_df.groupby("image").apply(len)
params_df = pd.read_csv("params.csv")
if len(sys.argv)==2:
    prog_name, task_str = sys.argv
    one_param_num = int(task_str)
else:
    print("len(sys.argv)=%d so trying first param"%len(sys.argv))
    one_param_num = 0
def compute_one(feature_name, test_fold, target_img):
    is_feature = np.array([feature_name in col for col in full_df.columns])
    is_all_test = full_df.fold == test_fold
    is_dict = {"all_train":False,"all_test":True}
    df_dict = {
        set_name:full_df.loc[is_all_test==is_value]
        for set_name, is_value in is_dict.items()}
    is_same_dict = {
        set_name:set_df.image == target_img
        for set_name, set_df in df_dict.items()}
    df_dict["test"] = df_dict["all_test"].loc[ is_same_dict["all_test"], :]
    is_same_values_dict = {
        "all":[True,False],
        "same":[True],
        "other":[False],
        }
    train_name_list = []
    for data_name, is_same_values in is_same_values_dict.items():
        train_name_list.append(data_name)
        train_is_same = is_same_dict["all_train"].isin(is_same_values)
        df_dict[data_name] = df_dict["all_train"].loc[train_is_same,:]
    same_nrow, same_ncol = df_dict["same"].shape
    for data_name in "all", "other":
        small_name = data_name+"_small"
        train_name_list.append(small_name)
        train_df = df_dict[data_name]
        all_indices = np.arange(len(train_df))
        small_indices = np.random.permutation(all_indices)[:same_nrow]
        df_dict[small_name] = train_df.iloc[small_indices,:]
    xy_dict = {
        data_name:(df.loc[:,is_feature], df.label)
        for data_name, df in df_dict.items()}
    test_X, test_y = xy_dict["test"]
    error_df_list = []
    for train_name in train_name_list:
        train_X, train_y = xy_dict[train_name]
        param_dicts = [{'n_neighbors':[x]} for x in range(1, 21)]
        learner_dict = {
            "linear_model":LogisticRegressionCV(),
            "nearest_neighbors":GridSearchCV(
                KNeighborsClassifier(), param_dicts)
            }
        featureless_pred = train_y.value_counts().index[0]
        pred_dict = {
            "featureless":np.repeat(featureless_pred, len(test_y)),
        }
        for algorithm, learner in learner_dict.items():
            learner.fit(train_X, train_y)
            pred_dict[algorithm] = learner.predict(test_X)
        for algorithm, pred_vec in pred_dict.items():
            error_vec = pred_vec != test_y
            out_df=pd.DataFrame({
                "feature_name":[feature_name],
                "test_fold":[test_fold],
                "target_img":[target_img],
                "data_for_img":[len_series[target_img]],
                "test_nrow":[len(test_y)],
                "train_nrow":[len(train_y)],
                "train_name":[train_name],
                "algorithm":[algorithm],
                "error_percent":[error_vec.mean()*100],
                })
            error_df_list.append(out_df)
    return pd.concat(error_df_list)
one_param_series = params_df.iloc[one_param_num,:]
one_param_dict = dict(one_param_series)
error_df = compute_one(**one_param_dict)
out_file = f"results/{one_param_num}.csv"
error_df.to_csv(out_file,index=False)

Running the scripts on the cluster

I put these scripts on the cluster and first ran the params script on a compute node,

(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ python params.py
created params CSV files and job scripts, test with
python run_one.py 0
SLURM_ARRAY_TASK_ID=0 bash /scratch/th798/gabi_demo/run_one.sh
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ ls
full.csv  params.csv  params.py  params.py~  results  run_one.py  run_one.py~
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ head *.csv
==> full.csv <==
label,image,signal,feature_easy_0,feature_easy_1,feature_easy_2,feature_easy_4,feature_impossible_1,feature_impossible_2,feature_impossible_3,feature_impossible_4,fold
burned,1,1,1.6243453636632417,-0.15323616176709168,0.4895166181417871,0.9228930431325937,0.8596290051283277,-0.9247552836219404,-2.8510695426019397,0.9200785993027285,0
no burn,1,0,-0.6117564136500754,-2.432508512647113,0.2387958575229506,0.2078252673129942,0.1416416728308121,1.1288898954126778,0.8231234224707914,0.018099210462107714,2
burned,1,1,-0.5281717522634557,0.5079843366153792,-0.4481118060935155,1.9861959282732953,1.311968612695531,-1.1287912685853583,-0.27474362331135227,1.4758691196562372,0
no burn,1,0,-1.0729686221561705,-0.3240323290200066,-0.6107950028484128,1.432756426622143,0.7690851809940938,-0.7247376220830969,-0.6881014060987432,0.3731985591697631,1
burned,1,1,0.8654076293246785,-1.5110766079102027,-2.0299450691852883,1.528258499515287,1.5842857679634836,0.6235712087829979,1.120080836099801,1.2951831782590502,0
no burn,1,0,-2.3015386968802827,-0.8714220655949576,0.607946586089135,-0.3677319748239843,1.7885926519290443,1.4379664640750502,0.2520928412415682,0.3277513475414965,0
burned,1,1,1.74481176421648,-0.8648299414655872,-0.35410888355969,1.691720859369497,0.975368956749922,0.2741418428159658,0.7824721743051218,-0.08139687871858095,2
no burn,1,0,-0.7612069008951028,0.6087490823417022,0.15258149012149536,-0.798346748264865,1.4902897985704333,-0.5044690146598694,-0.6030809780235539,0.1343882401809338,2
burned,1,1,0.31903909605709857,0.5616380965234054,0.5012748490408392,1.2138177272479647,0.6789228521496135,1.1066709960295764,1.248752273820923,1.7385846511025012,2

==> params.csv <==
feature_name,test_fold,target_img
easy,0,1
easy,0,2
easy,0,3
easy,0,4
easy,1,1
easy,1,2
easy,1,3
easy,1,4
easy,2,1

The output above indicates that the params script successfully created the full.csv and params.csv files. Now we test that the computation works for one parameter combination,

(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ python run_one.py 0
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ python run_one.py 5
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ head results/*.csv
==> results/0.csv <==
feature_name,test_fold,target_img,data_for_img,test_nrow,train_nrow,train_name,algorithm,error_percent
easy,0,1,400,134,664,all,featureless,58.2089552238806
easy,0,1,400,134,664,all,linear_model,29.1044776119403
easy,0,1,400,134,664,all,nearest_neighbors,29.1044776119403
easy,0,1,400,134,266,same,featureless,58.2089552238806
easy,0,1,400,134,266,same,linear_model,33.582089552238806
easy,0,1,400,134,266,same,nearest_neighbors,44.776119402985074
easy,0,1,400,134,398,other,featureless,58.2089552238806
easy,0,1,400,134,398,other,linear_model,32.83582089552239
easy,0,1,400,134,398,other,nearest_neighbors,30.597014925373134

==> results/5.csv <==
feature_name,test_fold,target_img,data_for_img,test_nrow,train_nrow,train_name,algorithm,error_percent
easy,1,2,400,133,668,all,featureless,55.639097744360896
easy,1,2,400,133,668,all,linear_model,9.022556390977442
easy,1,2,400,133,668,all,nearest_neighbors,17.293233082706767
easy,1,2,400,133,267,same,featureless,55.639097744360896
easy,1,2,400,133,267,same,linear_model,8.270676691729323
easy,1,2,400,133,267,same,nearest_neighbors,19.548872180451127
easy,1,2,400,133,401,other,featureless,55.639097744360896
easy,1,2,400,133,401,other,linear_model,10.526315789473683
easy,1,2,400,133,401,other,nearest_neighbors,16.541353383458645

The output above indicates that the computation worked for subsets 0 and 5. Now we try a similar computation under scratch,

(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ SLURM_ARRAY_TASK_ID=1 bash /scratch/th798/gabi_demo/run_one.sh
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ SLURM_ARRAY_TASK_ID=20 bash /scratch/th798/gabi_demo/run_one.sh
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ head /scratch/th798/gabi_demo/results/*.csv
==> /scratch/th798/gabi_demo/results/1.csv <==
feature_name,test_fold,target_img,data_for_img,test_nrow,train_nrow,train_name,algorithm,error_percent
easy,0,2,400,134,664,all,featureless,52.98507462686567
easy,0,2,400,134,664,all,linear_model,17.91044776119403
easy,0,2,400,134,664,all,nearest_neighbors,19.402985074626866
easy,0,2,400,134,266,same,featureless,52.98507462686567
easy,0,2,400,134,266,same,linear_model,18.65671641791045
easy,0,2,400,134,266,same,nearest_neighbors,18.65671641791045
easy,0,2,400,134,398,other,featureless,52.98507462686567
easy,0,2,400,134,398,other,linear_model,18.65671641791045
easy,0,2,400,134,398,other,nearest_neighbors,23.134328358208954

==> /scratch/th798/gabi_demo/results/20.csv <==
feature_name,test_fold,target_img,data_for_img,test_nrow,train_nrow,train_name,algorithm,error_percent
impossible,2,1,400,133,668,all,featureless,43.609022556390975
impossible,2,1,400,133,668,all,linear_model,42.857142857142854
impossible,2,1,400,133,668,all,nearest_neighbors,48.1203007518797
impossible,2,1,400,133,267,same,featureless,56.390977443609025
impossible,2,1,400,133,267,same,linear_model,39.849624060150376
impossible,2,1,400,133,267,same,nearest_neighbors,42.10526315789473
impossible,2,1,400,133,401,other,featureless,43.609022556390975
impossible,2,1,400,133,401,other,linear_model,51.127819548872175
impossible,2,1,400,133,401,other,nearest_neighbors,51.8796992481203

The output above indicates the result was computed successfully. Next we send the job to the compute cluster,

(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ sbatch /scratch/th798/gabi_demo/run_one.sh
Submitted batch job 55610497
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ squeue -j55610497
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
   55610497_[0-23]      core gabi_dem    th798 PD       0:00      1 (Priority)
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ squeue -j55610497
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
       55610497_23      core gabi_dem    th798  R       0:01      1 cn95
       55610497_22      core gabi_dem    th798  R       0:01      1 cn100
       55610497_21      core gabi_dem    th798  R       0:01      1 cn101
       55610497_20      core gabi_dem    th798  R       0:01      1 cn101
       55610497_19      core gabi_dem    th798  R       0:01      1 cn101
       55610497_18      core gabi_dem    th798  R       0:01      1 cn102
       55610497_17      core gabi_dem    th798  R       0:01      1 cn102
       55610497_16      core gabi_dem    th798  R       0:01      1 cn102
       55610497_15      core gabi_dem    th798  R       0:01      1 cn103
       55610497_14      core gabi_dem    th798  R       0:01      1 cn103
       55610497_13      core gabi_dem    th798  R       0:01      1 cn103
       55610497_12      core gabi_dem    th798  R       0:01      1 cn104
       55610497_11      core gabi_dem    th798  R       0:01      1 cn104
       55610497_10      core gabi_dem    th798  R       0:01      1 cn104
        55610497_9      core gabi_dem    th798  R       0:01      1 cn105
        55610497_8      core gabi_dem    th798  R       0:01      1 cn105
        55610497_7      core gabi_dem    th798  R       0:01      1 cn105
        55610497_6      core gabi_dem    th798  R       0:01      1 cn6
        55610497_5      core gabi_dem    th798  R       0:01      1 cn6
        55610497_4      core gabi_dem    th798  R       0:01      1 cn6
        55610497_3      core gabi_dem    th798  R       0:01      1 cn8
        55610497_2      core gabi_dem    th798  R       0:01      1 cn20
        55610497_1      core gabi_dem    th798  R       0:01      1 cn55
        55610497_0      core gabi_dem    th798  R       0:01      1 cn55
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ squeue -j55610497
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ wc -l /scratch/th798/gabi_demo/results/*.csv
   16 /scratch/th798/gabi_demo/results/0.csv
   16 /scratch/th798/gabi_demo/results/10.csv
   16 /scratch/th798/gabi_demo/results/11.csv
   16 /scratch/th798/gabi_demo/results/12.csv
   16 /scratch/th798/gabi_demo/results/13.csv
   16 /scratch/th798/gabi_demo/results/14.csv
   16 /scratch/th798/gabi_demo/results/15.csv
   16 /scratch/th798/gabi_demo/results/16.csv
   16 /scratch/th798/gabi_demo/results/17.csv
   16 /scratch/th798/gabi_demo/results/18.csv
   16 /scratch/th798/gabi_demo/results/19.csv
   16 /scratch/th798/gabi_demo/results/1.csv
   16 /scratch/th798/gabi_demo/results/20.csv
   16 /scratch/th798/gabi_demo/results/21.csv
   16 /scratch/th798/gabi_demo/results/22.csv
   16 /scratch/th798/gabi_demo/results/23.csv
   16 /scratch/th798/gabi_demo/results/2.csv
   16 /scratch/th798/gabi_demo/results/3.csv
   16 /scratch/th798/gabi_demo/results/4.csv
   16 /scratch/th798/gabi_demo/results/5.csv
   16 /scratch/th798/gabi_demo/results/6.csv
   16 /scratch/th798/gabi_demo/results/7.csv
   16 /scratch/th798/gabi_demo/results/8.csv
   16 /scratch/th798/gabi_demo/results/9.csv
  384 total

The output above indicates the jobs were running and then finished successfully. We then copy the results to the project directory,

(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ rsync -rvz /scratch/th798/gabi_demo/ ./
sending incremental file list
run_one.py
run_one.sh
slurm-55610497_0.out
slurm-55610497_1.out
slurm-55610497_10.out
slurm-55610497_11.out
slurm-55610497_12.out
slurm-55610497_13.out
slurm-55610497_14.out
slurm-55610497_15.out
slurm-55610497_16.out
slurm-55610497_17.out
slurm-55610497_18.out
slurm-55610497_19.out
slurm-55610497_2.out
slurm-55610497_20.out
slurm-55610497_21.out
slurm-55610497_22.out
slurm-55610497_23.out
slurm-55610497_3.out
slurm-55610497_4.out
slurm-55610497_5.out
slurm-55610497_6.out
slurm-55610497_7.out
slurm-55610497_8.out
slurm-55610497_9.out
results/0.csv
results/1.csv
results/10.csv
results/11.csv
results/12.csv
results/13.csv
results/14.csv
results/15.csv
results/16.csv
results/17.csv
results/18.csv
results/19.csv
results/2.csv
results/20.csv
results/21.csv
results/22.csv
results/23.csv
results/3.csv
results/4.csv
results/5.csv
results/6.csv
results/7.csv
results/8.csv
results/9.csv

sent 13,167 bytes  received 975 bytes  9,428.00 bytes/sec
total size is 202,392  speedup is 14.31

We combine the results with the analyze.py script below,

import pandas as pd
import numpy as np
from glob import glob
out_df_list = []
for out_csv in glob("results/*.csv"):
    out_df_list.append(pd.read_csv(out_csv))
error_df = pd.concat(out_df_list)
error_df.to_csv("results.csv")
print(error_df)
import plotnine as p9
train_name_order=["same", "all", "all_small", "other", "other_small"]
def gg_error(feature_name):
    feature_df = error_df.query("feature_name == '%s'"%feature_name).copy()
    featureless = feature_df.query("algorithm=='featureless'")
    grouped = featureless.groupby(
        ['target_img', 'data_for_img']
    )["error_percent"]
    min_max_df = grouped.apply(lambda df: pd.DataFrame({
        "xmin":[df.min()],
        "xmax":[df.max()],
        "ymin":[-np.inf],
        "ymax":[np.inf],
    })).reset_index()
    feature_df["Algorithm"] = "\n"+feature_df.algorithm
    feature_df["Train data"] = pd.Categorical(
        feature_df.train_name, train_name_order)
    gg = p9.ggplot()+\
        p9.theme_bw()+\
        p9.theme(panel_spacing=0.1)+\
        p9.theme(figure_size=(10,5))+\
        p9.geom_rect(
            p9.aes(
                xmin="xmin",
                xmax="xmax",
                ymin="ymin",
                ymax="ymax"
                ),
            fill="grey",
            data=min_max_df)+\
        p9.geom_point(
            p9.aes(
                x="error_percent",
                y="Train data"
            ),
        data=feature_df)+\
        p9.scale_x_continuous(limits=[0,62])+\
        p9.facet_grid("Algorithm ~ target_img + data_for_img", labeller="label_both")+\
        p9.ggtitle("Features: "+feature_name)
    gg.save(feature_name+".png")
gg_error("easy")
gg_error("impossible")

Running that script yields

(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ python analyze.py
   feature_name  test_fold  ...          algorithm  error_percent
0    impossible          0  ...        featureless      58.208955
1    impossible          0  ...       linear_model      38.059701
2    impossible          0  ...  nearest_neighbors      44.776119
3    impossible          0  ...        featureless      58.208955
4    impossible          0  ...       linear_model      34.328358
..          ...        ...  ...                ...            ...
10   impossible          2  ...       linear_model      36.363636
11   impossible          2  ...  nearest_neighbors      36.363636
12   impossible          2  ...        featureless      42.424242
13   impossible          2  ...       linear_model      45.454545
14   impossible          2  ...  nearest_neighbors      45.454545

[360 rows x 9 columns]
/home/th798/.conda/envs/emacs1/lib/python3.7/site-packages/plotnine/ggplot.py:721: PlotnineWarning: Saving 10 x 5 in image.
/home/th798/.conda/envs/emacs1/lib/python3.7/site-packages/plotnine/ggplot.py:722: PlotnineWarning: Filename: easy.png
/home/th798/.conda/envs/emacs1/lib/python3.7/site-packages/plotnine/ggplot.py:721: PlotnineWarning: Saving 10 x 5 in image.
/home/th798/.conda/envs/emacs1/lib/python3.7/site-packages/plotnine/ggplot.py:722: PlotnineWarning: Filename: impossible.png
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ ls *.png
easy.png  impossible.png
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ cd ..
(emacs1) th798@cn29:~/genomic-ml$ publish_data gabi_demo/
gabi_demo/ has been published
Any published files are now accessible at https://rcdata.nau.edu/th798

Actually, the files are published at https://rcdata.nau.edu/genomic-ml/gabi_demo/ including:

  • results.csv combined results CSV file, from all parallel jobs.
  • easy.png plot of error rates when learning using easy features.
  • impossible.png plot of error rates when learning using impossible features.
  • other files include slurm out/log files, individual result CSV files, and the input python script and CSV data files.

In conclusion, we have shown how to use the cluster to compute a machine learning cross-validation experiment in parallel. In this example there was a relatively small data set, and only 24 parameter combinations that were run in parallel on the cluster nodes, so there are not a lot of time savings for doing this in parallel on the cluster. For larger machine learning experiments, involving more different data sets and algorithms, which results in hundreds or thousands of parameter combinations, this parallel computing setup can save a lot of time.