I am suffering from a regex problem in R here. I have three sentences:
s1 <- "today john jack and joe go to the beach"
s2 <- "today joe and john go to the beach"
s3 <- "today jack and joe go to the beach"
I want to know of each sentence whether john is going to the beach today, regardless of the other two guys. So the outcome for the three sentences should be (in order)
TRUE
TRUE
FALSE
I try to do this with grepl in R. The following regex gives TRUE to all sentences:
print(grepl("today (john|jack|joe|and| )+go to the beach", s1))
print(grepl("today (john|jack|joe|and| )+go to the beach", s2))
print(grepl("today (john|jack|joe|and| )+go to the beach", s3))
It helps when I sandwich "john", the compulsory word, between two identical quantifiers for the other, optional words:
print(grepl("today (jack|joe|and| )*john(jack|joe|and| )*go to the beach", s1))
print(grepl("today (jack|joe|and| )*john(jack|joe|and| )*go to the beach", s2))
print(grepl("today (jack|joe|and| )*john(jack|joe|and| )*go to the beach", s3))
However, this is obviously bad coding (repetitions). Anyone has a more elegant solution?
You may use
.*
in places where you do not know what may appear there:See online R demo
The
\b
word boundaries are used to matchjohn
as a whole word.EDIT: If you have a pre-defined whitelist of words that may appear between
today
andgo
, you cannot just match anything, you need to use an alternation group with all those alternative listed, and - if you really want to shorten the pattern - use the subroutine call within a PCRE regex:See the regex demo.
Here, the alternatives are wrapped within a non-capturing group that is quantified, and the whole group is wrapped with a "technical" capturing group that can be recursed with the
(?1)
subroutine call (1
means capturing group #1).