Cross-validation experiments on the cluster
This post is similar to a previous post, but here we explain how to do the same computation on the cluster.
A number of students have asked me how to properly code cross-validation, in order to compare generalization / accuracy of different machine learning algorithms, when trained on different data sets. For example in one project, a student is trying to build an algorithm for satellite image segmentation. For each pixel in each image, there should be a predicted class, which represents the intensity of forest fire burn in that area of the image. The student has about 10000 rows (pixels with labels), where each row comes from one of 9 different images. I suggested the following cross-validation experiment to determine the extent to which prediction error increases when generalizing to new images.
Let’s say there are 1000 rows,
import pandas as pd
import numpy as np
import math
N = 1000
full_df = pd.DataFrame({
"label":np.tile(["burned","no burn"],int(N/2)),
"image":np.concatenate([
np.repeat(1, 0.4*N), np.repeat(2, 0.4*N),
np.repeat(3, 0.1*N), np.repeat(4, 0.1*N)
])
})
len_series = full_df.groupby("image").apply(len)
uniq_images = full_df.image.unique()
n_images = len(uniq_images)
full_df["signal"] = np.where(
full_df.label=="no burn", 0, 1
)*full_df.image
np.random.seed(1)
for n_noise in range(n_images-1):
full_df["feature_easy_%s"%n_noise] = np.random.randn(N)
full_df["feature_easy_%s"%n_images] = np.random.randn(N)+full_df.signal
for img in uniq_images:
noise = np.random.randn(N)
full_df["feature_impossible_%s"%img] = np.where(
full_df.image == img, full_df.signal+noise, noise)
n_folds=3
def get_fold_ids(some_array):
n_out = len(some_array)
arange_vec = np.arange(n_folds)
tile_vec = np.tile(arange_vec, math.ceil(n_out/n_folds))[:n_out]
return pd.DataFrame({
"fold":np.random.permutation(tile_vec),
})
new_fold_df = full_df.groupby("image").apply(get_fold_ids)
full_df["fold"] = new_fold_df.reset_index().fold
pd.crosstab(full_df.fold, full_df.image)
## image 1 2 3 4
## fold
## 0 134 134 34 34
## 1 133 133 33 33
## 2 133 133 33 33
full_df.columns
## Index(['label', 'image', 'signal', 'feature_easy_0', 'feature_easy_1',
## 'feature_easy_2', 'feature_easy_4', 'feature_impossible_1',
## 'feature_impossible_2', 'feature_impossible_3', 'feature_impossible_4',
## 'fold'],
## dtype='object')
The output above shows that the fold IDs have been assigned such that, for each image, there is an equal amount of data across folds. It also shows the columns of the full data, with four “easy” feature columns and four “impossible” feature columns, which were simulated using this model:
- In the easy features, there is one signal feature, with the same pattern across images (larger is more likely to be burned). There are three other features which are just noise.
- In the impossible features, each image has a corresponding signal feature, so learning from one image does not help when making predictions on another image.
We would like to compute the test error, when predicting on any given image, and training on either that same image or other images. In the code below we create a DataFrame of parameters that we can run in parallel to do that computation,
extract_df = full_df.columns.str.extract("feature_(?P<type>[^_]+)")
feature_name_series = extract_df.value_counts().reset_index().type
params_dict = {
"feature_name":feature_name_series,
"test_fold":range(n_folds),
"target_img":full_df.image.unique(),
}
params_df = pd.MultiIndex.from_product(
params_dict.values(),
names=params_dict.keys()
).to_frame().reset_index(drop=True)
params_df
## feature_name test_fold target_img
## 0 easy 0 1
## 1 easy 0 2
## 2 easy 0 3
## 3 easy 0 4
## 4 easy 1 1
## 5 easy 1 2
## 6 easy 1 3
## 7 easy 1 4
## 8 easy 2 1
## 9 easy 2 2
## 10 easy 2 3
## 11 easy 2 4
## 12 impossible 0 1
## 13 impossible 0 2
## 14 impossible 0 3
## 15 impossible 0 4
## 16 impossible 1 1
## 17 impossible 1 2
## 18 impossible 1 3
## 19 impossible 1 4
## 20 impossible 2 1
## 21 impossible 2 2
## 22 impossible 2 3
## 23 impossible 2 4
The table above has three columns, for feature_name
, test_fold
,
and target_img
. Each row represents a particular set of test data
and features that we can compute in parallel on the cluster. Each row
can be used for the arguments of the function below, which computes
the test error:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
def compute_one(feature_name, test_fold, target_img):
is_feature = np.array([feature_name in col for col in full_df.columns])
is_all_test = full_df.fold == test_fold
is_dict = {"all_train":False,"all_test":True}
df_dict = {
set_name:full_df.loc[is_all_test==is_value]
for set_name, is_value in is_dict.items()}
is_same_dict = {
set_name:set_df.image == target_img
for set_name, set_df in df_dict.items()}
df_dict["test"] = df_dict["all_test"].loc[ is_same_dict["all_test"], :]
is_same_values_dict = {
"all":[True,False],
"same":[True],
"other":[False],
}
train_name_list = []
for data_name, is_same_values in is_same_values_dict.items():
train_name_list.append(data_name)
train_is_same = is_same_dict["all_train"].isin(is_same_values)
df_dict[data_name] = df_dict["all_train"].loc[train_is_same,:]
same_nrow, same_ncol = df_dict["same"].shape
for data_name in "all", "other":
small_name = data_name+"_small"
train_name_list.append(small_name)
train_df = df_dict[data_name]
all_indices = np.arange(len(train_df))
small_indices = np.random.permutation(all_indices)[:same_nrow]
df_dict[small_name] = train_df.iloc[small_indices,:]
xy_dict = {
data_name:(df.loc[:,is_feature], df.label)
for data_name, df in df_dict.items()}
test_X, test_y = xy_dict["test"]
error_df_list = []
for train_name in train_name_list:
train_X, train_y = xy_dict[train_name]
param_dicts = [{'n_neighbors':[x]} for x in range(1, 21)]
learner_dict = {
"linear_model":LogisticRegressionCV(),
"nearest_neighbors":GridSearchCV(
KNeighborsClassifier(), param_dicts)
}
featureless_pred = train_y.value_counts().index[0]
pred_dict = {
"featureless":np.repeat(featureless_pred, len(test_y)),
}
for algorithm, learner in learner_dict.items():
learner.fit(train_X, train_y)
pred_dict[algorithm] = learner.predict(test_X)
for algorithm, pred_vec in pred_dict.items():
error_vec = pred_vec != test_y
out_df=pd.DataFrame({
"feature_name":[feature_name],
"test_fold":[test_fold],
"target_img":[target_img],
"data_for_img":[len_series[target_img]],
"test_nrow":[len(test_y)],
"train_nrow":[len(train_y)],
"train_name":[train_name],
"algorithm":[algorithm],
"error_percent":[error_vec.mean()*100],
})
error_df_list.append(out_df)
return pd.concat(error_df_list)
one_param_num=0
one_param_series = params_df.iloc[one_param_num,:]
one_param_dict = dict(one_param_series)
compute_one(**one_param_dict)
## feature_name test_fold ... algorithm error_percent
## 0 easy 0 ... featureless 58.208955
## 0 easy 0 ... linear_model 29.104478
## 0 easy 0 ... nearest_neighbors 29.104478
## 0 easy 0 ... featureless 58.208955
## 0 easy 0 ... linear_model 33.582090
## 0 easy 0 ... nearest_neighbors 44.776119
## 0 easy 0 ... featureless 58.208955
## 0 easy 0 ... linear_model 32.835821
## 0 easy 0 ... nearest_neighbors 30.597015
## 0 easy 0 ... featureless 58.208955
## 0 easy 0 ... linear_model 26.865672
## 0 easy 0 ... nearest_neighbors 29.104478
## 0 easy 0 ... featureless 41.791045
## 0 easy 0 ... linear_model 32.089552
## 0 easy 0 ... nearest_neighbors 30.597015
##
## [15 rows x 9 columns]
To run this experiment on the monsoon cluster, as previously described, we need to create three python scripts.
params.py
creates parameter combination CSV files, which isparam_df
above.run_one.py
computes the result for one parameter combination, which is thecompute_one
function above.analyze.py
combines the results for each parameter combination into a single result for analysis/plotting.
Params script
The params.py
script contents are below. It creates params.csv
and
full.csv
in both the original/current directory, and in the scratch
sub-directory for this job.
from datetime import datetime
import pandas as pd
import numpy as np
import os
import shutil
import math
N = 1000
full_df = pd.DataFrame({
"label":np.tile(["burned","no burn"],int(N/2)),
"image":np.concatenate([
np.repeat(1, 0.4*N), np.repeat(2, 0.4*N),
np.repeat(3, 0.1*N), np.repeat(4, 0.1*N)
])
})
uniq_images = full_df.image.unique()
n_images = len(uniq_images)
full_df["signal"] = np.where(
full_df.label=="no burn", 0, 1
)*full_df.image
np.random.seed(1)
for n_noise in range(n_images-1):
full_df["feature_easy_%s"%n_noise] = np.random.randn(N)
full_df["feature_easy_%s"%n_images] = np.random.randn(N)+full_df.signal
for img in uniq_images:
noise = np.random.randn(N)
full_df["feature_impossible_%s"%img] = np.where(
full_df.image == img, full_df.signal+noise, noise)
n_folds=3
def get_fold_ids(some_array):
n_out = len(some_array)
arange_vec = np.arange(n_folds)
tile_vec = np.tile(arange_vec, math.ceil(n_out/n_folds))[:n_out]
return pd.DataFrame({
"fold":np.random.permutation(tile_vec),
})
new_fold_df = full_df.groupby("image").apply(get_fold_ids)
full_df["fold"] = new_fold_df.reset_index().fold
extract_df = full_df.columns.str.extract("feature_(?P<type>[^_]+)")
feature_name_series = extract_df.value_counts().reset_index().type
params_dict = {
"feature_name":feature_name_series,
"test_fold":range(n_folds),
"target_img":full_df.image.unique(),
}
params_df = pd.MultiIndex.from_product(
params_dict.values(),
names=params_dict.keys()
).to_frame().reset_index(drop=True)
n_tasks, ncol = params_df.shape
job_name = datetime.now().strftime("%Y-%m-%d_%H:%M:%S")
job_name = "gabi_demo"
job_dir = "/scratch/th798/"+job_name
results_dir = os.path.join(job_dir, "results")
os.system("mkdir -p "+results_dir)
params_csv = os.path.join(job_dir, "params.csv")
params_df.to_csv(params_csv,index=False)
full_csv = os.path.join(job_dir, "full.csv")
full_df.to_csv(full_csv,index=False)
run_one_contents = f"""#!/bin/bash
#SBATCH --array=0-{n_tasks-1}
#SBATCH --time=24:00:00
#SBATCH --mem=8GB
#SBATCH --cpus-per-task=1
#SBATCH --output={job_dir}/slurm-%A_%a.out
#SBATCH --error={job_dir}/slurm-%A_%a.out
#SBATCH --job-name={job_name}
cd {job_dir}
python run_one.py $SLURM_ARRAY_TASK_ID
"""
run_one_sh = os.path.join(job_dir, "run_one.sh")
with open(run_one_sh, "w") as run_one_f:
run_one_f.write(run_one_contents)
run_one_py = os.path.join(job_dir, "run_one.py")
run_orig_py = "run_one.py"
shutil.copyfile(run_orig_py, run_one_py)
orig_dir = os.path.dirname(run_orig_py)
orig_results = os.path.join(orig_dir, "results")
os.system("mkdir -p "+orig_results)
orig_csv = os.path.join(orig_dir, "params.csv")
params_df.to_csv(orig_csv,index=False)
orig_full_csv = os.path.join(orig_dir, "full.csv")
full_df.to_csv(orig_full_csv,index=False)
msg=f"""created params CSV files and job scripts, test with
python {run_orig_py} 0
SLURM_ARRAY_TASK_ID=0 bash {run_one_sh}"""
print(msg)
Run one script
The run_one.py
script below computes the test error for one
parameter combination, and saves the result in a CSV file under the
results sub-directory.
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np
import sys
full_df = pd.read_csv("full.csv")
len_series = full_df.groupby("image").apply(len)
params_df = pd.read_csv("params.csv")
if len(sys.argv)==2:
prog_name, task_str = sys.argv
one_param_num = int(task_str)
else:
print("len(sys.argv)=%d so trying first param"%len(sys.argv))
one_param_num = 0
def compute_one(feature_name, test_fold, target_img):
is_feature = np.array([feature_name in col for col in full_df.columns])
is_all_test = full_df.fold == test_fold
is_dict = {"all_train":False,"all_test":True}
df_dict = {
set_name:full_df.loc[is_all_test==is_value]
for set_name, is_value in is_dict.items()}
is_same_dict = {
set_name:set_df.image == target_img
for set_name, set_df in df_dict.items()}
df_dict["test"] = df_dict["all_test"].loc[ is_same_dict["all_test"], :]
is_same_values_dict = {
"all":[True,False],
"same":[True],
"other":[False],
}
train_name_list = []
for data_name, is_same_values in is_same_values_dict.items():
train_name_list.append(data_name)
train_is_same = is_same_dict["all_train"].isin(is_same_values)
df_dict[data_name] = df_dict["all_train"].loc[train_is_same,:]
same_nrow, same_ncol = df_dict["same"].shape
for data_name in "all", "other":
small_name = data_name+"_small"
train_name_list.append(small_name)
train_df = df_dict[data_name]
all_indices = np.arange(len(train_df))
small_indices = np.random.permutation(all_indices)[:same_nrow]
df_dict[small_name] = train_df.iloc[small_indices,:]
xy_dict = {
data_name:(df.loc[:,is_feature], df.label)
for data_name, df in df_dict.items()}
test_X, test_y = xy_dict["test"]
error_df_list = []
for train_name in train_name_list:
train_X, train_y = xy_dict[train_name]
param_dicts = [{'n_neighbors':[x]} for x in range(1, 21)]
learner_dict = {
"linear_model":LogisticRegressionCV(),
"nearest_neighbors":GridSearchCV(
KNeighborsClassifier(), param_dicts)
}
featureless_pred = train_y.value_counts().index[0]
pred_dict = {
"featureless":np.repeat(featureless_pred, len(test_y)),
}
for algorithm, learner in learner_dict.items():
learner.fit(train_X, train_y)
pred_dict[algorithm] = learner.predict(test_X)
for algorithm, pred_vec in pred_dict.items():
error_vec = pred_vec != test_y
out_df=pd.DataFrame({
"feature_name":[feature_name],
"test_fold":[test_fold],
"target_img":[target_img],
"data_for_img":[len_series[target_img]],
"test_nrow":[len(test_y)],
"train_nrow":[len(train_y)],
"train_name":[train_name],
"algorithm":[algorithm],
"error_percent":[error_vec.mean()*100],
})
error_df_list.append(out_df)
return pd.concat(error_df_list)
one_param_series = params_df.iloc[one_param_num,:]
one_param_dict = dict(one_param_series)
error_df = compute_one(**one_param_dict)
out_file = f"results/{one_param_num}.csv"
error_df.to_csv(out_file,index=False)
Running the scripts on the cluster
I put these scripts on the cluster and first ran the params script on a compute node,
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ python params.py
created params CSV files and job scripts, test with
python run_one.py 0
SLURM_ARRAY_TASK_ID=0 bash /scratch/th798/gabi_demo/run_one.sh
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ ls
full.csv params.csv params.py params.py~ results run_one.py run_one.py~
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ head *.csv
==> full.csv <==
label,image,signal,feature_easy_0,feature_easy_1,feature_easy_2,feature_easy_4,feature_impossible_1,feature_impossible_2,feature_impossible_3,feature_impossible_4,fold
burned,1,1,1.6243453636632417,-0.15323616176709168,0.4895166181417871,0.9228930431325937,0.8596290051283277,-0.9247552836219404,-2.8510695426019397,0.9200785993027285,0
no burn,1,0,-0.6117564136500754,-2.432508512647113,0.2387958575229506,0.2078252673129942,0.1416416728308121,1.1288898954126778,0.8231234224707914,0.018099210462107714,2
burned,1,1,-0.5281717522634557,0.5079843366153792,-0.4481118060935155,1.9861959282732953,1.311968612695531,-1.1287912685853583,-0.27474362331135227,1.4758691196562372,0
no burn,1,0,-1.0729686221561705,-0.3240323290200066,-0.6107950028484128,1.432756426622143,0.7690851809940938,-0.7247376220830969,-0.6881014060987432,0.3731985591697631,1
burned,1,1,0.8654076293246785,-1.5110766079102027,-2.0299450691852883,1.528258499515287,1.5842857679634836,0.6235712087829979,1.120080836099801,1.2951831782590502,0
no burn,1,0,-2.3015386968802827,-0.8714220655949576,0.607946586089135,-0.3677319748239843,1.7885926519290443,1.4379664640750502,0.2520928412415682,0.3277513475414965,0
burned,1,1,1.74481176421648,-0.8648299414655872,-0.35410888355969,1.691720859369497,0.975368956749922,0.2741418428159658,0.7824721743051218,-0.08139687871858095,2
no burn,1,0,-0.7612069008951028,0.6087490823417022,0.15258149012149536,-0.798346748264865,1.4902897985704333,-0.5044690146598694,-0.6030809780235539,0.1343882401809338,2
burned,1,1,0.31903909605709857,0.5616380965234054,0.5012748490408392,1.2138177272479647,0.6789228521496135,1.1066709960295764,1.248752273820923,1.7385846511025012,2
==> params.csv <==
feature_name,test_fold,target_img
easy,0,1
easy,0,2
easy,0,3
easy,0,4
easy,1,1
easy,1,2
easy,1,3
easy,1,4
easy,2,1
The output above indicates that the params script successfully created
the full.csv
and params.csv
files. Now we test that the
computation works for one parameter combination,
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ python run_one.py 0
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ python run_one.py 5
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ head results/*.csv
==> results/0.csv <==
feature_name,test_fold,target_img,data_for_img,test_nrow,train_nrow,train_name,algorithm,error_percent
easy,0,1,400,134,664,all,featureless,58.2089552238806
easy,0,1,400,134,664,all,linear_model,29.1044776119403
easy,0,1,400,134,664,all,nearest_neighbors,29.1044776119403
easy,0,1,400,134,266,same,featureless,58.2089552238806
easy,0,1,400,134,266,same,linear_model,33.582089552238806
easy,0,1,400,134,266,same,nearest_neighbors,44.776119402985074
easy,0,1,400,134,398,other,featureless,58.2089552238806
easy,0,1,400,134,398,other,linear_model,32.83582089552239
easy,0,1,400,134,398,other,nearest_neighbors,30.597014925373134
==> results/5.csv <==
feature_name,test_fold,target_img,data_for_img,test_nrow,train_nrow,train_name,algorithm,error_percent
easy,1,2,400,133,668,all,featureless,55.639097744360896
easy,1,2,400,133,668,all,linear_model,9.022556390977442
easy,1,2,400,133,668,all,nearest_neighbors,17.293233082706767
easy,1,2,400,133,267,same,featureless,55.639097744360896
easy,1,2,400,133,267,same,linear_model,8.270676691729323
easy,1,2,400,133,267,same,nearest_neighbors,19.548872180451127
easy,1,2,400,133,401,other,featureless,55.639097744360896
easy,1,2,400,133,401,other,linear_model,10.526315789473683
easy,1,2,400,133,401,other,nearest_neighbors,16.541353383458645
The output above indicates that the computation worked for subsets 0 and 5. Now we try a similar computation under scratch,
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ SLURM_ARRAY_TASK_ID=1 bash /scratch/th798/gabi_demo/run_one.sh
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ SLURM_ARRAY_TASK_ID=20 bash /scratch/th798/gabi_demo/run_one.sh
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ head /scratch/th798/gabi_demo/results/*.csv
==> /scratch/th798/gabi_demo/results/1.csv <==
feature_name,test_fold,target_img,data_for_img,test_nrow,train_nrow,train_name,algorithm,error_percent
easy,0,2,400,134,664,all,featureless,52.98507462686567
easy,0,2,400,134,664,all,linear_model,17.91044776119403
easy,0,2,400,134,664,all,nearest_neighbors,19.402985074626866
easy,0,2,400,134,266,same,featureless,52.98507462686567
easy,0,2,400,134,266,same,linear_model,18.65671641791045
easy,0,2,400,134,266,same,nearest_neighbors,18.65671641791045
easy,0,2,400,134,398,other,featureless,52.98507462686567
easy,0,2,400,134,398,other,linear_model,18.65671641791045
easy,0,2,400,134,398,other,nearest_neighbors,23.134328358208954
==> /scratch/th798/gabi_demo/results/20.csv <==
feature_name,test_fold,target_img,data_for_img,test_nrow,train_nrow,train_name,algorithm,error_percent
impossible,2,1,400,133,668,all,featureless,43.609022556390975
impossible,2,1,400,133,668,all,linear_model,42.857142857142854
impossible,2,1,400,133,668,all,nearest_neighbors,48.1203007518797
impossible,2,1,400,133,267,same,featureless,56.390977443609025
impossible,2,1,400,133,267,same,linear_model,39.849624060150376
impossible,2,1,400,133,267,same,nearest_neighbors,42.10526315789473
impossible,2,1,400,133,401,other,featureless,43.609022556390975
impossible,2,1,400,133,401,other,linear_model,51.127819548872175
impossible,2,1,400,133,401,other,nearest_neighbors,51.8796992481203
The output above indicates the result was computed successfully. Next we send the job to the compute cluster,
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ sbatch /scratch/th798/gabi_demo/run_one.sh
Submitted batch job 55610497
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ squeue -j55610497
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
55610497_[0-23] core gabi_dem th798 PD 0:00 1 (Priority)
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ squeue -j55610497
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
55610497_23 core gabi_dem th798 R 0:01 1 cn95
55610497_22 core gabi_dem th798 R 0:01 1 cn100
55610497_21 core gabi_dem th798 R 0:01 1 cn101
55610497_20 core gabi_dem th798 R 0:01 1 cn101
55610497_19 core gabi_dem th798 R 0:01 1 cn101
55610497_18 core gabi_dem th798 R 0:01 1 cn102
55610497_17 core gabi_dem th798 R 0:01 1 cn102
55610497_16 core gabi_dem th798 R 0:01 1 cn102
55610497_15 core gabi_dem th798 R 0:01 1 cn103
55610497_14 core gabi_dem th798 R 0:01 1 cn103
55610497_13 core gabi_dem th798 R 0:01 1 cn103
55610497_12 core gabi_dem th798 R 0:01 1 cn104
55610497_11 core gabi_dem th798 R 0:01 1 cn104
55610497_10 core gabi_dem th798 R 0:01 1 cn104
55610497_9 core gabi_dem th798 R 0:01 1 cn105
55610497_8 core gabi_dem th798 R 0:01 1 cn105
55610497_7 core gabi_dem th798 R 0:01 1 cn105
55610497_6 core gabi_dem th798 R 0:01 1 cn6
55610497_5 core gabi_dem th798 R 0:01 1 cn6
55610497_4 core gabi_dem th798 R 0:01 1 cn6
55610497_3 core gabi_dem th798 R 0:01 1 cn8
55610497_2 core gabi_dem th798 R 0:01 1 cn20
55610497_1 core gabi_dem th798 R 0:01 1 cn55
55610497_0 core gabi_dem th798 R 0:01 1 cn55
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ squeue -j55610497
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ wc -l /scratch/th798/gabi_demo/results/*.csv
16 /scratch/th798/gabi_demo/results/0.csv
16 /scratch/th798/gabi_demo/results/10.csv
16 /scratch/th798/gabi_demo/results/11.csv
16 /scratch/th798/gabi_demo/results/12.csv
16 /scratch/th798/gabi_demo/results/13.csv
16 /scratch/th798/gabi_demo/results/14.csv
16 /scratch/th798/gabi_demo/results/15.csv
16 /scratch/th798/gabi_demo/results/16.csv
16 /scratch/th798/gabi_demo/results/17.csv
16 /scratch/th798/gabi_demo/results/18.csv
16 /scratch/th798/gabi_demo/results/19.csv
16 /scratch/th798/gabi_demo/results/1.csv
16 /scratch/th798/gabi_demo/results/20.csv
16 /scratch/th798/gabi_demo/results/21.csv
16 /scratch/th798/gabi_demo/results/22.csv
16 /scratch/th798/gabi_demo/results/23.csv
16 /scratch/th798/gabi_demo/results/2.csv
16 /scratch/th798/gabi_demo/results/3.csv
16 /scratch/th798/gabi_demo/results/4.csv
16 /scratch/th798/gabi_demo/results/5.csv
16 /scratch/th798/gabi_demo/results/6.csv
16 /scratch/th798/gabi_demo/results/7.csv
16 /scratch/th798/gabi_demo/results/8.csv
16 /scratch/th798/gabi_demo/results/9.csv
384 total
The output above indicates the jobs were running and then finished successfully. We then copy the results to the project directory,
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ rsync -rvz /scratch/th798/gabi_demo/ ./
sending incremental file list
run_one.py
run_one.sh
slurm-55610497_0.out
slurm-55610497_1.out
slurm-55610497_10.out
slurm-55610497_11.out
slurm-55610497_12.out
slurm-55610497_13.out
slurm-55610497_14.out
slurm-55610497_15.out
slurm-55610497_16.out
slurm-55610497_17.out
slurm-55610497_18.out
slurm-55610497_19.out
slurm-55610497_2.out
slurm-55610497_20.out
slurm-55610497_21.out
slurm-55610497_22.out
slurm-55610497_23.out
slurm-55610497_3.out
slurm-55610497_4.out
slurm-55610497_5.out
slurm-55610497_6.out
slurm-55610497_7.out
slurm-55610497_8.out
slurm-55610497_9.out
results/0.csv
results/1.csv
results/10.csv
results/11.csv
results/12.csv
results/13.csv
results/14.csv
results/15.csv
results/16.csv
results/17.csv
results/18.csv
results/19.csv
results/2.csv
results/20.csv
results/21.csv
results/22.csv
results/23.csv
results/3.csv
results/4.csv
results/5.csv
results/6.csv
results/7.csv
results/8.csv
results/9.csv
sent 13,167 bytes received 975 bytes 9,428.00 bytes/sec
total size is 202,392 speedup is 14.31
We combine the results with the analyze.py
script below,
import pandas as pd
import numpy as np
from glob import glob
out_df_list = []
for out_csv in glob("results/*.csv"):
out_df_list.append(pd.read_csv(out_csv))
error_df = pd.concat(out_df_list)
error_df.to_csv("results.csv")
print(error_df)
import plotnine as p9
train_name_order=["same", "all", "all_small", "other", "other_small"]
def gg_error(feature_name):
feature_df = error_df.query("feature_name == '%s'"%feature_name).copy()
featureless = feature_df.query("algorithm=='featureless'")
grouped = featureless.groupby(
['target_img', 'data_for_img']
)["error_percent"]
min_max_df = grouped.apply(lambda df: pd.DataFrame({
"xmin":[df.min()],
"xmax":[df.max()],
"ymin":[-np.inf],
"ymax":[np.inf],
})).reset_index()
feature_df["Algorithm"] = "\n"+feature_df.algorithm
feature_df["Train data"] = pd.Categorical(
feature_df.train_name, train_name_order)
gg = p9.ggplot()+\
p9.theme_bw()+\
p9.theme(panel_spacing=0.1)+\
p9.theme(figure_size=(10,5))+\
p9.geom_rect(
p9.aes(
xmin="xmin",
xmax="xmax",
ymin="ymin",
ymax="ymax"
),
fill="grey",
data=min_max_df)+\
p9.geom_point(
p9.aes(
x="error_percent",
y="Train data"
),
data=feature_df)+\
p9.scale_x_continuous(limits=[0,62])+\
p9.facet_grid("Algorithm ~ target_img + data_for_img", labeller="label_both")+\
p9.ggtitle("Features: "+feature_name)
gg.save(feature_name+".png")
gg_error("easy")
gg_error("impossible")
Running that script yields
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ python analyze.py
feature_name test_fold ... algorithm error_percent
0 impossible 0 ... featureless 58.208955
1 impossible 0 ... linear_model 38.059701
2 impossible 0 ... nearest_neighbors 44.776119
3 impossible 0 ... featureless 58.208955
4 impossible 0 ... linear_model 34.328358
.. ... ... ... ... ...
10 impossible 2 ... linear_model 36.363636
11 impossible 2 ... nearest_neighbors 36.363636
12 impossible 2 ... featureless 42.424242
13 impossible 2 ... linear_model 45.454545
14 impossible 2 ... nearest_neighbors 45.454545
[360 rows x 9 columns]
/home/th798/.conda/envs/emacs1/lib/python3.7/site-packages/plotnine/ggplot.py:721: PlotnineWarning: Saving 10 x 5 in image.
/home/th798/.conda/envs/emacs1/lib/python3.7/site-packages/plotnine/ggplot.py:722: PlotnineWarning: Filename: easy.png
/home/th798/.conda/envs/emacs1/lib/python3.7/site-packages/plotnine/ggplot.py:721: PlotnineWarning: Saving 10 x 5 in image.
/home/th798/.conda/envs/emacs1/lib/python3.7/site-packages/plotnine/ggplot.py:722: PlotnineWarning: Filename: impossible.png
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ ls *.png
easy.png impossible.png
(emacs1) th798@cn29:~/genomic-ml/gabi_demo$ cd ..
(emacs1) th798@cn29:~/genomic-ml$ publish_data gabi_demo/
gabi_demo/ has been published
Any published files are now accessible at https://rcdata.nau.edu/th798
Actually, the files are published at https://rcdata.nau.edu/genomic-ml/gabi_demo/ including:
- results.csv combined results CSV file, from all parallel jobs.
- easy.png plot of error rates when learning using easy features.
- impossible.png plot of error rates when learning using impossible features.
- other files include slurm out/log files, individual result CSV files, and the input python script and CSV data files.
In conclusion, we have shown how to use the cluster to compute a machine learning cross-validation experiment in parallel. In this example there was a relatively small data set, and only 24 parameter combinations that were run in parallel on the cluster nodes, so there are not a lot of time savings for doing this in parallel on the cluster. For larger machine learning experiments, involving more different data sets and algorithms, which results in hundreds or thousands of parameter combinations, this parallel computing setup can save a lot of time.