R: How to prevent memory overflow when using mgsub in vector mode?

685 views Asked by At

I have a long vector of characters (e.g. "Hello World", etc), 1.7M rows, and I need to substitute words in them using a map between two vectors, and save the result in same vector. Here's a simple example:

library(qdap)
line = c("one", "two one", "four phones")
e = c("one", "two")
r = c("ONE", "TWO")
line = mgsub(e,r,line)

Result:

[1] "ONE"  "TWO ONE" "four phONEs"

As you can see, each instance of e[j] in line gets substituted with r[j] and only r[j]. It works fine on a relatively small "line" and e->r vocabulary length, but when I run on length(line) = 1700000 and length(e) = 750, I reach the total allocated memory:

Reached total allocation of 7851Mb: see help(memory.size)

Any ideas how to avoid it?

3

There are 3 answers

6
Tyler Rinker On BEST ANSWER

I believe you can use fixed = TRUE.

You seem to be concerned with spaces it sounds like... so just add spaces to the ends of all 3 vectors you're working with. To run this whole sequence from ## Start to ## Finish (roughly the size of your data) takes Time difference of 2.906395 secs on 1.7 million strings. The majority of time is at the end with stripping off the extra spaces.

## Recreate data
line <- c("one", "two one", "four phones", "and a capsule", "But here's a caps key")
e <- c("one", "two", "caps")
r <- c("ONE", "TWO", "CAPS")

line <- rep(line, 1700000/length(line))

## Start    
line2 <- paste0(" ", line, " ")
e2 <-  paste0(" ", e, " ")
r2 <- paste0(" ", r, " ")


for (i in seq_along(e2)) {
    line2 <- gsub(e2[i], r2[i], line2, fixed=TRUE)
}

gsub("^\\s|\\s$", "", line2, perl=TRUE)
## Finish

Here qdap's mgsub is not useful. The package was designed for much smaller data. Additionally, the fixed = TRUE is a sensible default because it is so much faster. The point of an add on packages is to improve upon work flow (sometimes field/task specific) through a reconfiguration of available tools. The mgsub function has some error handling too and other niceties that are useful in the analysis of transcripts that make the function hog memory. There's often the trade off between safety + syntactic sugar vs. speed.

Note that just because 2 functions are named in similar ways should not imply anything, particularly if they are found in add on packages. Even functions within base R have differently named and behaving defaults (look at the apply family of functions; this problem is less than ideal but is part of the historical evolution of R). It is incumbent upon you as a user to read documentation not make assumptions.

1
Alexey Ferapontov On

Update to the problem (to Admins: if it doesn't deserve a separate answer - please merge it with the original one). The reason mgsub ran so fast compared to a simple for loop was that in mgsub the parameter fixed = TRUE by default, while in gsub it is FALSE by default! I just discovered it. I'd like to clarify again, that fixed=TRUE is not appropriate for me, as I do not want to replace caps in capsule, but only the whole word caps. I.e. I am forced to paste \\bs to the pattern. Here are three snippets from my code (I tested fixed=TRUE in gsub just to see the time difference, not going to use it).

#This is with mgsub. Now with fixed = FALSE!!
i = mgsub(paste("\\b",orig,"\\b",sep=""),change,i,fixed=FALSE)

#This is with a for loop. fixed=TRUE in one of lines is for test purposes only. Do not use
for(k in seq_along(orig)) {
  i = gsub(paste("\\b",orig[k],"\\b",sep=""),change[k],i)
  #i = gsub(orig[k],change[k],i,fixed=TRUE)
}

Here are the times and memory usage for all three cases on different number of input data:

N     | mgsub, fixed=F   | gsub, fixed=F    | gsub, fixed=T
--------------------------------------------------------------
100k  | 41sec, M > 2.3GB | 37sec, M > 0.9GB | 9sec, M > 0.8GB
200k  | 99sec, M > 4GB   | 74sec, M > 1.1GB | 18sec, M > 1.3GB
300k  | 132sec, M > 5.6GB| 112sec, M > 2.6GB| 28sec, M > 1.6GB 
        + disk involved

Thus, I conclude that for my application when fixed must be FALSE, there's no advantage of using mgsub. In fact, for loop is faster and does not cause memory overflow!

Thanks to all involved. I wish I could give commenters credits, but I don't know how to do it in "Comments"

3
Tyler Rinker On

The stringi package provides fast consistent tools for lots of string manipulation stuff:

stri_replace_all_regex(line, paste0("\\b", e, "\\b"), r, vectorize_all = FALSE)

Darn near as fast (fractions of a second different) as the other method and more straight forward.