Using grepl in R

3k views Asked by At

In which cases could these 2 different ways of implimentation would give different results?

data(mtcars)
firstWay <- mtcars[grepl('6',mtcars$cyl),]
SecondWay <- mtcars[mtcars$cyl=='6',]

If these ways always give the same results which of them is recommended and why? Thanks

3

There are 3 answers

0
C_Z_ On BEST ANSWER

Using the package microbenchmark, we can see which is faster

library(microbenchmark)
m <- microbenchmark(mtcars[grepl('6',mtcars$cyl),], mtcars[mtcars$cyl=='6',], times=10000)

    Unit: microseconds
                         expr     min      lq     mean  median      uq      max neval
 mtcars[grepl("6", mtcars$cyl), ] 229.080 234.738 247.5324 236.693 239.417 6713.914 10000
      mtcars[mtcars$cyl == "6", ] 214.902 220.210 231.0240 221.956 224.471 7759.507 10000

It looks like == is faster, so when possible you should use that

However, the functions do not do exactly the same thing. grepl searches for if the string is present at all wheras == checks whether the expressions are equal

grepl("6", mtcars$disp)

 [1]  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
[18] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

mtcars$disp == "6"

[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1
SabDeM On

Well, I think that the fist difference is that with grepl you can subset even if you do not already know, for example 6, but you can try to search a rows that start or end with 6.

If you try to do this with normal subsetting technique you'll have an empty object because, for example ^6, is not recognized as a regular expression but as a string with the symbol ^ and 6.

I am sure there are other differences but I am sure professional users will provide more detailed answers.

For the side os which one could be preferred maybe there can be reasons of efficiency:

system.time(mtcars[grepl('^6',mtcars$cyl),])
   user  system elapsed 
  0.029   0.002   0.035 
system.time(mtcars[mtcars$cyl=='6',])
   user  system elapsed 
  0.031   0.002   0.046 

This little example can be just a guide and as @Nick K suggested first further (and precise) investigations have to be done with microbenchmark . Of course with big dataset I barely believe that a professional users (or one in need of speed) will prefer both of them but maybe it will rely on data table, or tools like dplyr written in lower level language and so more fast.

0
Nick Kennedy On

mtcars$cyl is a numeric column, so you would be better off comparing it to a number using mtcars[mtcars$cyl == 6, ].

But the difference between the equality operator == and grepl is that == will only be TRUE for members of the vector which are equal to "6", while grepl will match any member of the vector which has a 6 anywhere within it.

So, for example:

String                                                   ==     grepl
6                                                        TRUE   TRUE
123456                                                   FALSE  TRUE
6ABC                                                     FALSE  TRUE
This is a long sentence which happens to have a 6 in it  FALSE  TRUE
Whereas this long sentence does not                      FALSE  FALSE

The equivalent grepl pattern would be "^6$". There's a tutorial (one of many) on regex at http://www.regular-expressions.info/tutorial.html.