Rcpp version of tabulate is slower; where is this from, how to understand

336 views Asked by At

In the process of creating some sampling functions for already aggregated data I found that table was rather slow on the size data I am working with. I tried two improvements, first an Rcpp function as follows

// [[Rcpp::export]]
IntegerVector getcts(NumericVector x, int m) {
  IntegerVector cts(m);
  int t;
  for (int i = 0; i < x.length(); i++) {
    t = x[i] - 1;
    if (0 <= t && t < m)
      cts[t]++;
  }
  return cts;
}

And then while trying to understand why table was rather slow I found it being based on tabulate. Tabulate works well for me, and is faster than the Rcpp version. The code for tabulate is at:

https://github.com/wch/r-source/blob/545d365bd0485e5f0913a7d609c2c21d1f43145a/src/main/util.c#L2204

With the key line being:

for(R_xlen_t i = 0 ; i < n ; i++)
  if (x[i] != NA_INTEGER && x[i] > 0 && x[i] <= nb) y[x[i] - 1]++;

Now the key parts of tabulate and my Rcpp version seem pretty close (I have not bothered dealing with NA).

Q1: why is my Rcpp version 3 times slower?

Q2: how can I find out where this time goes?

I would very much appreciate knowing where the time went, but even better would be a good way to profile the code. My C++ skills are only so so, but this seems simple enough that I should (cross my fingers) have been able to avoid any silly stuff that would triple my time.

My timing code:

max_x <- 100
xs <- sample(seq(max_x), size = 50000000, replace = TRUE)
system.time(getcts(xs, max_x))
system.time(tabulate(xs))

This gives 0.318 for getcts and 0.126 for tabulate.

1

There are 1 answers

0
Artem Klevtsov On BEST ANSWER

Your function calls a length method in each loop iteration. Seems compiler don't cache it. To fix this store size of the vector in a separate variable or use range based loop. Also note that we don't really need explicit missing values check because in C++ all comparisons involving a NaN always return false.

Let's compare performance:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
IntegerVector tabulate1(const IntegerVector& x, const unsigned max) {
    IntegerVector counts(max);
    for (std::size_t i = 0; i < x.size(); i++) {
        if (x[i] > 0 && x[i] <= max)
            counts[x[i] - 1]++;
    }
    return counts;
}

// [[Rcpp::export]]
IntegerVector tabulate2(const IntegerVector& x, const unsigned max) {
    IntegerVector counts(max);
    std::size_t n = x.size();
    for (std::size_t i = 0; i < n; i++) {
        if (x[i] > 0 && x[i] <= max)
            counts[x[i] - 1]++;
    }
    return counts;
}

// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
IntegerVector tabulate3(const IntegerVector& x, const unsigned max) {
    IntegerVector counts(max);
    for (auto& now : x) {
        if (now > 0 && now <= max)
            counts[now - 1]++;
    }
    return counts;
}

// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
IntegerVector tabulate4(const IntegerVector& x, const unsigned max) {
    IntegerVector counts(max);
    for (auto it = x.begin(); it != x.end(); it++) {
        if (*it > 0 && *it <= max)
            counts[*it - 1]++;
    }
    return counts;
}

/***R
library(microbenchmark)
x <- sample(10, 1e5, rep = TRUE)
microbenchmark(
    tabulate(x, 10), tabulate1(x, 10),
    tabulate2(x, 10), tabulate3(x, 10), tabulate4(x, 10)
)
x[sample(10e5, 10e3)] <- NA
microbenchmark(
    tabulate(x, 10), tabulate1(x, 10),
    tabulate2(x, 10), tabulate3(x, 10), tabulate4(x, 10)
)
*/

tabulate1 is the original version.

Benchmark results:

Without NA:

Unit: microseconds
            expr     min       lq     mean   median      uq     max neval
 tabulate(x, 10) 143.557 146.8355 169.2820 156.1970 177.327 286.370   100
tabulate1(x, 10) 390.706 392.6045 437.7357 416.5655 443.065 748.767   100
tabulate2(x, 10) 108.149 111.4345 139.7579 118.2735 153.118 337.647   100
tabulate3(x, 10) 107.879 111.7305 138.2711 118.8650 139.598 300.023   100
tabulate4(x, 10) 391.003 393.4530 436.3063 420.1915 444.048 777.862   100

With NA:

Unit: microseconds
            expr      min        lq     mean   median       uq      max neval
 tabulate(x, 10)  943.555 1089.5200 1614.804 1333.806 2042.320 3986.836   100
tabulate1(x, 10) 4523.076 4787.3745 5258.490 4929.586 5624.098 7233.029   100
tabulate2(x, 10)  765.102  931.9935 1361.747 1113.550 1679.024 3436.356   100
tabulate3(x, 10)  773.358  914.4980 1350.164 1140.018 1642.354 3633.429   100
tabulate4(x, 10) 4241.025 4466.8735 4933.672 4717.016 5148.842 8603.838   100

The tabulate4 function which uses an iterator also slower than tabulate. We can improve it:

// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
IntegerVector tabulate4(const IntegerVector& x, const unsigned max) {
    IntegerVector counts(max);
    auto start = x.begin();
    auto end = x.end();
    for (auto it = start; it != end; it++) {
        if (*(it) > 0 && *(it) <= max)
            counts[*(it) - 1]++;
    }
    return counts;
}