Replace apply function with lapply

1.5k views Asked by At

I am creating a data set to compute the aggregate values for different combinations of words using regex. Each row has a unique regex value which I want to check against another dataset and find the number of times it appeared in it.

The first dataset (df1) looks like this :

   word1    word2               pattern
   air      10     (^|\\s)air(\\s.*)?\\s10($|\\s)
 airport    20   (^|\\s)airport(\\s.*)?\\s20($|\\s)
   car      30     (^|\\s)car(\\s.*)?\\s30($|\\s)

The other dataset (df2) from which I want to match this looks like

   sl_no    query
   1      air 10     
   2    airport 20   
   3    airport 20
   3    airport 20
   3      car 30

The final output I want should look like word1 word2 total_occ air 10 1 airport 20 3 car 30 1

I am able to do this by using apply in R

process <- 
function(x) 
{
  length(grep(x[["pattern"]], df2$query))
}           

df1$total_occ=apply(df1,1,process)

but find it time taking since my dataset is pretty big.

I found out that "mclapply" function of "parallel" package can be used to run such things on multicores, for which I am trying to run lapply first. Its giving me error saying

lapply(df,process)

Error in x[, "pattern"] : incorrect number of dimensions

Please let me know what changes should I make to run lapply correctly.

1

There are 1 answers

7
Gavin Simpson On BEST ANSWER

Why not just lapply() over the pattern?

Here I've just pulled out your pattern but this could just as easily be df$pattern

pattern <- c("(^|\\s)air(\\s.*)?\\s10($|\\s)",
             "(^|\\s)airport(\\s.*)?\\s20($|\\s)",
             "(^|\\s)car(\\s.*)?\\s30($|\\s)")

Using your data for df2

txt <- "sl_no    query
   1      'air 10'     
   2    'airport 20'   
   3    'airport 20'
   3    'airport 20'
   3      'car 30'"
df2 <- read.table(text = txt, header = TRUE)

Just iterate on pattern directly

> lapply(pattern, grep, x = df2$query)
[[1]]
[1] 1

[[2]]
[1] 2 3 4

[[3]]
[1] 5

If you want more compact output as suggested in your question, you'll need to run lengths() over the output returned (Thanks to @Frank for pointing out the new function lengths().)). Eg

lengths(lapply(pattern, grep, x = df2$query))

which gives

> lengths(lapply(pattern, grep, x = df2$query))
[1] 1 3 1

You can add this to the original data via

dfnew <- cbind(df1[, 1:2],
               Count = lengths(lapply(pattern, grep, x = df2$query)))