I have an Rcpp function that reads large BAM file (1-20GB, using htslib
) and creates several very long std::vector
s (up to 80M elements). The number of elements is not known before reading, so I cannot use Rcpp::IntegerVector
and Rcpp::CharacterVector
. As far as I understand, when I Rcpp::wrap
them for further usage, the copy is created. Is there a way to speed up the transfer of data from C++ to R in this situation? Is there a data structure that can be created within Rcpp function, be as quick to push_back
elements as std::vector
is, and passed by reference to R?
Just in case, here's how I create them currently:
std::vector<std::string> seq, xm;
std::vector<int> rname, strand, start;
And here's how I wrap and return them:
Rcpp::IntegerVector w_rname = Rcpp::wrap(rname);
w_rname.attr("class") = "factor";
w_rname.attr("levels") = chromosomes; // chromosomes contain names of the reference sequences from BAM
Rcpp::IntegerVector w_strand = Rcpp::wrap(strand);
w_strand.attr("class") = "factor";
w_strand.attr("levels") = strands; // std::vector<std::string> strands = {"+", "-"};
Rcpp::DataFrame res = Rcpp::DataFrame::create(
Rcpp::Named("rname") = w_rname,
Rcpp::Named("strand") = w_strand,
Rcpp::Named("start") = start,
Rcpp::Named("seq") = seq,
Rcpp::Named("XM") = xm
);
return(res);
Edit 1 (2021.10.19):
Thanks to everyone for comments, I need more time to check if stringfish
can be used, but I ran a slightly modified test from cpp11 package vignettes to compare it with std::vector
. Here's the code and results (showing that std::vector<int>
is still faster despite it must be Rcpp::wrap
ped upon return):
Rcpp::cppFunction('
#include <Rcpp.h>
using namespace Rcpp;
//[[Rcpp::export]]
std::vector<int> stdint_grow_(SEXP n_sxp) {
R_xlen_t n = REAL(n_sxp)[0];
std::vector<int> x;
R_xlen_t i = 0;
while (i < n) {
x.push_back(i++);
}
return x;
}')
library(cpp11test)
grid <- expand.grid(len = 10 ^ (0:7), pkg = c("cpp11", "stdint"), stringsAsFactors = FALSE)
b_grow <- bench::press(.grid = grid,
{
fun = match.fun(sprintf("%sgrow_", ifelse(pkg == "cpp11", "", paste0(pkg, "_"))))
bench::mark(
fun(len)
)
}
)[c("len", "pkg", "min", "mem_alloc", "n_itr", "n_gc")]
print(b_grow, n=Inf)
# A tibble: 12 × 6
len pkg min mem_alloc n_itr n_gc
<dbl> <chr> <bch:tm> <bch:byt> <int> <dbl>
1 100 cpp11 1.9µs 1.89KB 9999 1
2 1000 cpp11 6.1µs 16.03KB 9999 1
3 10000 cpp11 58.11µs 256.22KB 7267 12
4 100000 cpp11 488.15µs 2MB 815 11
5 1000000 cpp11 4.34ms 16MB 88 14
6 10000000 cpp11 97.39ms 256MB 4 5
7 100 stdint 1.6µs 2.93KB 10000 0
8 1000 stdint 3.36µs 6.45KB 9998 2
9 10000 stdint 19.87µs 41.6KB 9998 2
10 100000 stdint 181.88µs 393.16KB 2571 4
11 1000000 stdint 1.91ms 3.82MB 213 3
12 10000000 stdint 36.09ms 38.15MB 9 1
Edit 2:
std::vector<std::string>
is marginally slower than cpp11::writable::strings
in these test conditions, but more memory-efficient:
Rcpp::cppFunction('
#include <Rcpp.h>
using namespace Rcpp;
//[[Rcpp::export]]
std::vector<std::string> stdstr_grow_(SEXP n_sxp) {
R_xlen_t n = REAL(n_sxp)[0];
std::vector<std::string> x;
R_xlen_t i = 0;
while (i++ < n) {
std::string s (i, 33);
x.push_back(s);
}
return x;
}')
cpp11::cpp_source(code='
#include "cpp11/strings.hpp"
[[cpp11::register]] cpp11::writable::strings cpp11str_grow_(R_xlen_t n) {
cpp11::writable::strings x;
R_xlen_t i = 0;
while (i++ < n) {
std::string s (i, 33);
x.push_back(s);
}
return x;
}
')
library(cpp11test)
grid <- expand.grid(len = 10 ^ (0:5), pkg = c("cpp11str", "stdstr"), stringsAsFactors = FALSE)
b_grow <- bench::press(.grid = grid,
{
fun = match.fun(sprintf("%sgrow_", ifelse(pkg == "cpp11", "", paste0(pkg, "_"))))
bench::mark(
fun(len)
)
}
)[c("len", "pkg", "min", "mem_alloc", "n_itr", "n_gc")]
print(b_grow, n=Inf)
# A tibble: 12 × 6
len pkg min mem_alloc n_itr n_gc
<dbl> <chr> <bch:tm> <bch:byt> <int> <dbl>
1 1 cpp11str 1.22µs 0B 10000 0
2 10 cpp11str 3.02µs 0B 9999 1
3 100 cpp11str 22µs 1.89KB 9997 3
4 1000 cpp11str 765.28µs 541.62KB 602 2
5 10000 cpp11str 66.69ms 47.91MB 8 0
6 100000 cpp11str 6.83s 4.62GB 1 0
7 1 stdstr 1.38µs 2.49KB 10000 0
8 10 stdstr 1.86µs 2.49KB 10000 0
9 100 stdstr 16.44µs 3.32KB 10000 0
10 1000 stdstr 898.23µs 10.35KB 511 0
11 10000 stdstr 73.55ms 80.66KB 7 0
12 100000 stdstr 7.54s 783.79KB 1 0
Solution (2022.01.12):
... for those who have similar question. In this particular case I didn't need to use std::vector
data within R. So XPtr
easily solved my problem, cutting BAM loading time nearly twice. The pointer is created:
std::vector<std::string>* seq = new std::vector<std::string>;
std::vector<std::string>* xm = new std::vector<std::string>;
and then stored as a data.frame
attribute:
Rcpp::DataFrame res = Rcpp::DataFrame::create(
Rcpp::Named("rname") = w_rname,
Rcpp::Named("strand") = w_strand,
Rcpp::Named("start") = start
);
Rcpp::XPtr<std::vector<std::string>> seq_xptr(seq, true);
res.attr("seq_xptr") = seq_xptr;
Rcpp::XPtr<std::vector<std::string>> xm_xptr(xm, true);
res.attr("xm_xptr") = xm_xptr;
and reused elsewhere as following:
Rcpp::XPtr<std::vector<std::string>> seq((SEXP)df.attr("seq_xptr"));
Rcpp::XPtr<std::vector<std::string>> xm((SEXP)df.attr("xm_xptr"));
We use
std::vector<>
because of its robust implementation coupled with great performance (as it is generally hard to seestd::vector<>
beat in any comparison). But it uses its own allocator for memory held outside of R.Rcpp
returns objects to R that are indistinguishable from objects created by R as they use R's own data structure and that requires one final copy into memory used, owned and allocated by R. There is simply no way around it with the current interface if you want to return all elements to R.R now has ALTREP allowing for alternative / external representation so you could cook up something different, but it is in fact somewhat difficult as the API for ALTREP is still both somewhat incomplete and changing. Some packages have been built using ALTREP but none comes to my mind right now for your particular use case.
Edit: For your string vectors you could (and should) try the stringfish package by Travers. It uses ALTREP for strings, which may be your bigger performance hurdle. For the
int
vector I do not have an alternative but maybe the finalmemcpy
there is also not that painful (as opposed to strings which are handled differently internally making them more expensive).