In short: How can I call, from within Rccp C++ code, the agrep C internal function that gets called when users use the regular agrep function from base R?
In long: I have found multiple questions here about how to invoke, from within Rcpp, a C or C++ function created for another package (e.g. using C function from other package in Rcpp and Rcpp: Call C function from a package within Rcpp).
The thing that I am trying to achieve, however, is at the same time simpler but also way less documented: it is to directly call, from within Rcpp, a .Internal C function that comes with base R rather than another package, without interfacing with R (that is, without doing what is said in Call R functions in Rcpp). How could I do that for the .Internal C function that lays underneath base R's agrep wrapper?
The specific function I am trying to call here is the agrep internal C function. And for context, what I am ultimately trying to achieve is to speed-up a call to agrep for when millions of patterns must be each checked against each of millions of x targets.
Great question. The long and short of it is "You cant" (in many cases) unless the function is visible in one of the header files in "src/include/". At least not that easily.
Not long ago I had a similar fun challenge, where I tried to get access to the
do_docallfunction (called bydo.call), and it is not a simple task. First of all, it is not directly possible to just#include <agrep.c>(or something similar). That file simply isn't available for inclusion, as it is not a part of the "src/include". It is compiled and the uncompiled file is removed (not to mention that one should never "include" a .c file).If one is willing to go the mile, then the next step one could look at is "copying" and "altering" the source code. Basically find the function in "src/main/agrep.c", copy it into your package and then fix any errors you find.
Problems with this approach:
R-extsthe internal structures ofsexprec_infois not made public (this is the base structure for all objects in R). Many internal function use the fields within this structure, so one has to "copy" the structure into your source code, to make it public to your code specifically.#include <Rcpp.h>prior to this file, you will need to go through each and every call to internal functions and likely add eitherR_orRf_.CDR,CARand similar does. The internal functions have a documented structure, where the first argument contains the full call passed to the function, and function like those 2 are used to access parts of the call. I did myself a solid and rewrotedo_docallchanging the input format, to avoid having to consider this. But this takes time. The alternative is to create apairlistaccording to the documentation, set its type as a call-sexp (the exact name is lost to me at the moment) and pass the appropriate arguments forop,argsandenv.sexprec_info(as described later), then you will need to be very careful about when you includeRinternalsandRcpp, as any one of these causes your code to crash and burn in the most beautiful and silent way if you include your header and these in the wrong order! Note that this even goes for[[Rcpp::export]], which may indeed turn out to include them in the wrong arbitrary order!If you are willing to go this far down the drainage, I would suggest carefully reading adv-R "R's C interface" and Chapter 2, 5 and 6 of R-ext and maybe even the R internal manual, and finally once that is done take a look at
do_docallfromsrc/main/coerce.cand compare it to the implementation in my repository cmdline.arguments/src/utils/{cmd_coerce.h, cmd_coerce.c}. In this version I haveSEXP's, that was used as a lookup. This caused a problem as I can't access the modified version, so my code is slightly altered with the old code blocked by the macro#if --- defined(CMDLINE_ARGUMENTS_MAYBE_IN_THE_FUTURE). Luckily the code causing a problem had a static answer, so I could work around this (but this might not always be the case).Rf_s as their macro version is not available (since I#include <Rcpp.h>at some point)This implementation will be frozen "for all time to come" as I've moved on to another branch (and this one is frozen for my own future benefit, if I ever want to walk down this path again).
I spent a few days scouring the internet for information on this and found 2 different posts, talking about how this could be achieved, and my approach basically copies this. Whether this is actually allowed in a cran package, is an whole other question (and not one that I will be testing out).
This approach goes again if you want to use not-public code from other packages. While often here it is as simple as "copy-paste" their files into your repository.
As a final side note, you mention the intend is to "speed up" your code for when you have to perform millions upon millions of calls to
agrep. It seems that this is a time where one should consider performing the task in parallel. Even after going through the steps outlined above, creating N parallel sessions to take care of K evaluations each (say 100.000), would be the first step to reduce computing time. Of course each session should be given a batch and not a single call toagrep.