I am trying to convert an annotated NLP model of size 1.2GB to dataframe. I am using the Udpipe package for natural language processing in R with following code:
# Additional Topic Models
# annotate and tokenize corpus
model <- udpipe_download_model(language = "english")
udmodel_english <- udpipe_load_model(model$file_model)
s <- udpipe_annotate(udmodel_english, cleaned_text_NLP)
options(java.parameters = "-Xmx32720m")
memory.limit(3210241024*1024)
x <- data.frame(s)
Note that I have 32GB RAM and allocated all available memory to R to run the code. I also tried deleting large objects stored in the R environment space that are not relevant for running the above code. R cannot seem to allocate enough memory for the task and the following error message was the result:
Error in strsplit(x$conllu, "\n") :
could not allocate memory (4095 Mb) in C function 'R_AllocStringBuffer'
My question is two fold:
- What does the above error message mean?
- What workarounds are available to fix this issue?
Probably you have quite some documents to annotate. It's better to annotate in chunks as shown at https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-parallel.html
Following code will annotate in chunks of 50 documents in parallel across 2 cores and basically does your data.frame command. You will no longer have the issue as the function did strsplit on each chunks of 50 documents instead of on your full dataset where apparently the size of the annotated text was too large to fit into the limits of a stringbuffer in R. But below code will solve your issue.