Is it possible to do Rcpp::wrap without copying for a very large std::vector

428 views Asked by At

I have an Rcpp function that reads large BAM file (1-20GB, using htslib) and creates several very long std::vectors (up to 80M elements). The number of elements is not known before reading, so I cannot use Rcpp::IntegerVector and Rcpp::CharacterVector. As far as I understand, when I Rcpp::wrap them for further usage, the copy is created. Is there a way to speed up the transfer of data from C++ to R in this situation? Is there a data structure that can be created within Rcpp function, be as quick to push_back elements as std::vector is, and passed by reference to R?

Just in case, here's how I create them currently:

std::vector<std::string> seq, xm;
std::vector<int> rname, strand, start;

And here's how I wrap and return them:

Rcpp::IntegerVector w_rname = Rcpp::wrap(rname);
w_rname.attr("class") = "factor";
w_rname.attr("levels") = chromosomes;  // chromosomes contain names of the reference sequences from BAM

Rcpp::IntegerVector w_strand = Rcpp::wrap(strand);
w_strand.attr("class") = "factor";
w_strand.attr("levels") = strands;  // std::vector<std::string> strands = {"+", "-"};
      
Rcpp::DataFrame res = Rcpp::DataFrame::create(
  Rcpp::Named("rname") = w_rname,
  Rcpp::Named("strand") = w_strand,
  Rcpp::Named("start") = start,
  Rcpp::Named("seq") = seq,
  Rcpp::Named("XM") = xm
);
      
return(res);

Edit 1 (2021.10.19):

Thanks to everyone for comments, I need more time to check if stringfish can be used, but I ran a slightly modified test from cpp11 package vignettes to compare it with std::vector. Here's the code and results (showing that std::vector<int> is still faster despite it must be Rcpp::wrapped upon return):

Rcpp::cppFunction('
#include <Rcpp.h>
using namespace Rcpp;

//[[Rcpp::export]]
std::vector<int> stdint_grow_(SEXP n_sxp) {
  R_xlen_t n = REAL(n_sxp)[0];
  std::vector<int> x;
  R_xlen_t i = 0;
  while (i < n) {
    x.push_back(i++);
  }

  return x;
}')

library(cpp11test)
grid <- expand.grid(len = 10 ^ (0:7), pkg = c("cpp11", "stdint"), stringsAsFactors = FALSE)
b_grow <- bench::press(.grid = grid,
                       {
                         fun = match.fun(sprintf("%sgrow_", ifelse(pkg == "cpp11", "", paste0(pkg, "_"))))
                         bench::mark(
                           fun(len)
                         )
                       }
)[c("len", "pkg", "min", "mem_alloc", "n_itr", "n_gc")]

print(b_grow, n=Inf)

# A tibble: 12 × 6
        len pkg         min mem_alloc n_itr  n_gc
      <dbl> <chr>  <bch:tm> <bch:byt> <int> <dbl>
 1      100 cpp11     1.9µs    1.89KB  9999     1
 2     1000 cpp11     6.1µs   16.03KB  9999     1
 3    10000 cpp11   58.11µs  256.22KB  7267    12
 4   100000 cpp11  488.15µs       2MB   815    11
 5  1000000 cpp11    4.34ms      16MB    88    14
 6 10000000 cpp11   97.39ms     256MB     4     5
 7      100 stdint    1.6µs    2.93KB 10000     0
 8     1000 stdint   3.36µs    6.45KB  9998     2
 9    10000 stdint  19.87µs    41.6KB  9998     2
10   100000 stdint 181.88µs  393.16KB  2571     4
11  1000000 stdint   1.91ms    3.82MB   213     3
12 10000000 stdint  36.09ms   38.15MB     9     1

Edit 2:

std::vector<std::string> is marginally slower than cpp11::writable::strings in these test conditions, but more memory-efficient:

Rcpp::cppFunction('
#include <Rcpp.h>
using namespace Rcpp;

//[[Rcpp::export]]
std::vector<std::string> stdstr_grow_(SEXP n_sxp) {
  R_xlen_t n = REAL(n_sxp)[0];
  std::vector<std::string> x;
  R_xlen_t i = 0;
  while (i++ < n) {
    std::string s (i, 33);
    x.push_back(s);
  }

  return x;
}')

cpp11::cpp_source(code='
#include "cpp11/strings.hpp"

[[cpp11::register]] cpp11::writable::strings cpp11str_grow_(R_xlen_t n) {
  cpp11::writable::strings x;
  R_xlen_t i = 0;
  while (i++ < n) {
    std::string s (i, 33);
    x.push_back(s);
  }

  return x;
}                
')

library(cpp11test)
grid <- expand.grid(len = 10 ^ (0:5), pkg = c("cpp11str", "stdstr"), stringsAsFactors = FALSE)
b_grow <- bench::press(.grid = grid,
                       {
                         fun = match.fun(sprintf("%sgrow_", ifelse(pkg == "cpp11", "", paste0(pkg, "_"))))
                         bench::mark(
                           fun(len)
                         )
                       }
)[c("len", "pkg", "min", "mem_alloc", "n_itr", "n_gc")]

print(b_grow, n=Inf)

# A tibble: 12 × 6
      len pkg           min mem_alloc n_itr  n_gc
    <dbl> <chr>    <bch:tm> <bch:byt> <int> <dbl>
 1      1 cpp11str   1.22µs        0B 10000     0
 2     10 cpp11str   3.02µs        0B  9999     1
 3    100 cpp11str     22µs    1.89KB  9997     3
 4   1000 cpp11str 765.28µs  541.62KB   602     2
 5  10000 cpp11str  66.69ms   47.91MB     8     0
 6 100000 cpp11str    6.83s    4.62GB     1     0
 7      1 stdstr     1.38µs    2.49KB 10000     0
 8     10 stdstr     1.86µs    2.49KB 10000     0
 9    100 stdstr    16.44µs    3.32KB 10000     0
10   1000 stdstr   898.23µs   10.35KB   511     0
11  10000 stdstr    73.55ms   80.66KB     7     0
12 100000 stdstr      7.54s  783.79KB     1     0

Solution (2022.01.12):

... for those who have similar question. In this particular case I didn't need to use std::vector data within R. So XPtr easily solved my problem, cutting BAM loading time nearly twice. The pointer is created:

std::vector<std::string>* seq = new std::vector<std::string>;
std::vector<std::string>* xm = new std::vector<std::string>;

and then stored as a data.frame attribute:

Rcpp::DataFrame res = Rcpp::DataFrame::create(                                
    Rcpp::Named("rname") = w_rname,
    Rcpp::Named("strand") = w_strand,
    Rcpp::Named("start") = start
);

Rcpp::XPtr<std::vector<std::string>> seq_xptr(seq, true);
res.attr("seq_xptr") = seq_xptr;

Rcpp::XPtr<std::vector<std::string>> xm_xptr(xm, true);
res.attr("xm_xptr") = xm_xptr;

and reused elsewhere as following:

Rcpp::XPtr<std::vector<std::string>> seq((SEXP)df.attr("seq_xptr"));
Rcpp::XPtr<std::vector<std::string>> xm((SEXP)df.attr("xm_xptr"));
1

There are 1 answers

2
Dirk is no longer here On

We use std::vector<> because of its robust implementation coupled with great performance (as it is generally hard to see std::vector<> beat in any comparison). But it uses its own allocator for memory held outside of R.

Rcpp returns objects to R that are indistinguishable from objects created by R as they use R's own data structure and that requires one final copy into memory used, owned and allocated by R. There is simply no way around it with the current interface if you want to return all elements to R.

R now has ALTREP allowing for alternative / external representation so you could cook up something different, but it is in fact somewhat difficult as the API for ALTREP is still both somewhat incomplete and changing. Some packages have been built using ALTREP but none comes to my mind right now for your particular use case.

Edit: For your string vectors you could (and should) try the stringfish package by Travers. It uses ALTREP for strings, which may be your bigger performance hurdle. For the int vector I do not have an alternative but maybe the final memcpy there is also not that painful (as opposed to strings which are handled differently internally making them more expensive).