OpenMP
Today my colleague Mike Gowanlock gave me a brief tutorial on how to use OpenMP to run C++ code in parallel on multiple CPUs.
The main idea is to insert #pragma omp parallel
before a block of
code that will be run separately on each CPU. And then use #pragma
omp for
before a for loop which has no dependencies between
iterations. You need to use #include <omp.h>
and I needed the
following in my src/Makevars
file on windows, using g++ to compile
C++ code in an R
package:
PKG_CPPFLAGS=-fopenmp
PKG_LIBS=-lgomp
Here is an example. I coded a version of the nearest neighbors algorithm for multi-class classification.
int Predict1toMaxNeighborsMatrixMultiClass
(double *train_inputs_ptr, //ntrain x ncol
int *train_label_ptr, //ntrain
int n_train, int ncol, int max_neighbors, int n_test,
int n_labels,
double *test_inputs_ptr, //ncol x ntest
int *test_predictions_ptr //max_neighbors x ntest
){
Eigen::Map< Eigen::MatrixXd > test_inputs_mat(test_inputs_ptr, ncol, n_test);
Eigen::Map< Eigen::MatrixXi > test_predictions_mat(test_predictions_ptr, max_neighbors, n_test);
#pragma omp parallel
{
Eigen::VectorXd distance_vec(n_train);
Eigen::VectorXi sorted_index_vec(n_train), label_count_vec(n_labels);
#pragma omp for
for(int test_i=0; test_i<n_test; test_i++){
Predict1toMaxNeighborsMultiClass(
train_inputs_ptr, train_label_ptr,
n_train, ncol, max_neighbors, n_labels,
distance_vec.data(),
sorted_index_vec.data(),
label_count_vec.data(),
test_inputs_mat.col(test_i).data(),
test_predictions_mat.col(test_i).data());
}
}
return 0;
}
Note the #pragma omp parallel
block which allocates O(n_train)
memory
per CPU, for storing the distances, indices, and label counts for each
test data point. The for loop is executed in parallel – each CPU
processes a separate set of test data points via the
Predict1toMaxNeighborsMultiClass
sub-routine.
The nice thing about this kind of parallel coding is that it is super easy to code, and extremely memory efficient (linear in the number of CPUs). In contrast, R/Python parallel memory usage is linear in the number of data, which can be problematic for large data. So C++ OpenMP code is a great option for efficient parallel processing of large data.
I used the code to compute K=1 to 50 nearest neighbors predictions using 5832 training data and 7291 test data in a 256-dimensional feature space (zip.train data from the ElemStatLearn R package). On my machine (12 CPUs), it reduced the time that it takes to compute from about 20 to 2 seconds.
user system elapsed
19.75 0.00 1.84