I have a long vector of characters (e.g. "Hello World", etc), 1.7M rows, and I need to substitute words in them using a map between two vectors, and save the result in same vector. Here's a simple example:
library(qdap)
line = c("one", "two one", "four phones")
e = c("one", "two")
r = c("ONE", "TWO")
line = mgsub(e,r,line)
Result:
[1] "ONE" "TWO ONE" "four phONEs"
As you can see, each instance of e[j]
in line gets substituted with r[j]
and only r[j]
.
It works fine on a relatively small "line" and e->r
vocabulary length, but when I run on length(line) = 1700000
and length(e) = 750
, I reach the total allocated memory:
Reached total allocation of 7851Mb: see help(memory.size)
Any ideas how to avoid it?
I believe you can use
fixed = TRUE
.You seem to be concerned with spaces it sounds like... so just add spaces to the ends of all 3 vectors you're working with. To run this whole sequence from
## Start
to## Finish
(roughly the size of your data) takesTime difference of 2.906395 secs
on 1.7 million strings. The majority of time is at the end with stripping off the extra spaces.Here qdap's
mgsub
is not useful. The package was designed for much smaller data. Additionally, thefixed = TRUE
is a sensible default because it is so much faster. The point of an add on packages is to improve upon work flow (sometimes field/task specific) through a reconfiguration of available tools. Themgsub
function has some error handling too and other niceties that are useful in the analysis of transcripts that make the function hog memory. There's often the trade off between safety + syntactic sugar vs. speed.Note that just because 2 functions are named in similar ways should not imply anything, particularly if they are found in add on packages. Even functions within base R have differently named and behaving defaults (look at the
apply
family of functions; this problem is less than ideal but is part of the historical evolution of R). It is incumbent upon you as a user to read documentation not make assumptions.