R AND Operator in Regex

7.4k views Asked by At

I am trying to get an expression that takes a huge few paragraphs and finds lines with two specific words both in that lines, so I am looking for the AND operator? Any way how to do this?

For example:

c <- ("She sold seashells by the seashore, and she had a great time while doing so.")

I want an expression that finds a line with both "sold" and "great" in the line.

I've tried something like:

grep("sold", "great", c, value = TRUE) 

Any ideas?

Thanks so much!

3

There are 3 answers

0
ira On

While in most cases, I would go with stringr package as already suggested in CPak's answer, there is also i grep solution to this:

# create the sample string
c <- ("She sold seashells by the seashore, and she had a great time while doing so.")

# match any sold and great string within the text
# ignore case so that Sold and Great are also matched
grep("(sold.*great|great.*sold)", c, value = TRUE, ignore.case = TRUE)

Hmm, not bad, right? But what if there was a word merely containing the phrase sold or great?

# set up alternative string
d <- ("She saw soldier eating seashells by the seashore, and she had a great time while doing so.")
# even soldier is matched here:
grep("(sold.*great|great.*sold)", d, value = TRUE, ignore.case = TRUE)

So you might want to use word boundaries, i.e. match the entire word:

# \\b is a special character which matches word endings
grep("(\\bsold\\b.*\\bgreat\\b|\\bgreat\\b.*\\bsold\\b)", d, value = TRUE, ignore.case = TRUE)

the \\b matches first character in the string, last character in the string or between two characters where one belongs to a word and the other does not:

More on the \b metacharacter here: http://www.regular-expressions.info/wordboundaries.html

3
Kevin Arseneau On

You can create two capture groups, assuming the order of the words is unimportant

grep("(sold|great)(?:.+)(sold|great)", c, value = TRUE)
0
CPak On

The duplicate post might get you started but I don't think addresses your question directly.

You could combine stringr::str_detect with all

pos <- ("She sold seashells by the seashore, and she had a great time while doing so.") # contains sold and great
neg <- ("She bought seashells by the seashore, and she had a great time while doing so.") # contains great

pattern <- c("sold", "great")

library(stringr)
all(str_detect(pos,pattern))
# [1] TRUE

all(str_detect(neg,pattern))
# [1] FALSE

stringr::detect has the advantage (over grepl) of searching over a character vector of patterns