Today my colleague Mike Gowanlock gave me a brief tutorial on how to use OpenMP to run C++ code in parallel on multiple CPUs.

The main idea is to insert #pragma omp parallel before a block of code that will be run separately on each CPU. And then use #pragma omp for before a for loop which has no dependencies between iterations. You need to use #include <omp.h> and I needed the following in my src/Makevars file on windows, using g++ to compile C++ code in an R package:

PKG_CPPFLAGS=-fopenmp
PKG_LIBS=-lgomp

Here is an example. I coded a version of the nearest neighbors algorithm for multi-class classification.

int Predict1toMaxNeighborsMatrixMultiClass
(double *train_inputs_ptr, //ntrain x ncol
 int *train_label_ptr,  //ntrain
 int n_train, int ncol, int max_neighbors, int n_test,
 int n_labels,
 double *test_inputs_ptr,     //ncol x ntest
 int *test_predictions_ptr //max_neighbors x ntest
 ){
  Eigen::Map< Eigen::MatrixXd > test_inputs_mat(test_inputs_ptr, ncol, n_test);
  Eigen::Map< Eigen::MatrixXi > test_predictions_mat(test_predictions_ptr, max_neighbors, n_test);
#pragma omp parallel
  {
    Eigen::VectorXd distance_vec(n_train);
    Eigen::VectorXi sorted_index_vec(n_train), label_count_vec(n_labels);
#pragma omp for
    for(int test_i=0; test_i<n_test; test_i++){
      Predict1toMaxNeighborsMultiClass(
        train_inputs_ptr, train_label_ptr,
        n_train, ncol, max_neighbors, n_labels,
        distance_vec.data(),
        sorted_index_vec.data(),
        label_count_vec.data(),
        test_inputs_mat.col(test_i).data(),
        test_predictions_mat.col(test_i).data());
    }
  }
  return 0;
}

Note the #pragma omp parallel block which allocates O(n_train) memory per CPU, for storing the distances, indices, and label counts for each test data point. The for loop is executed in parallel – each CPU processes a separate set of test data points via the Predict1toMaxNeighborsMultiClass sub-routine.

The nice thing about this kind of parallel coding is that it is super easy to code, and extremely memory efficient (linear in the number of CPUs). In contrast, R/Python parallel memory usage is linear in the number of data, which can be problematic for large data. So C++ OpenMP code is a great option for efficient parallel processing of large data.

I used the code to compute K=1 to 50 nearest neighbors predictions using 5832 training data and 7291 test data in a 256-dimensional feature space (zip.train data from the ElemStatLearn R package). On my machine (12 CPUs), it reduced the time that it takes to compute from about 20 to 2 seconds.

   user  system elapsed 
  19.75    0.00    1.84