How to make a list with options, of which one is compulsory in regex, R?

100 views Asked by At

I am suffering from a regex problem in R here. I have three sentences:

s1 <- "today john jack and joe go to the beach"
s2 <- "today joe and john go to the beach"
s3 <- "today jack and joe go to the beach"

I want to know of each sentence whether john is going to the beach today, regardless of the other two guys. So the outcome for the three sentences should be (in order)

TRUE
TRUE
FALSE 

I try to do this with grepl in R. The following regex gives TRUE to all sentences:

print(grepl("today (john|jack|joe|and| )+go to the beach", s1))
print(grepl("today (john|jack|joe|and| )+go to the beach", s2))
print(grepl("today (john|jack|joe|and| )+go to the beach", s3))

It helps when I sandwich "john", the compulsory word, between two identical quantifiers for the other, optional words:

print(grepl("today (jack|joe|and| )*john(jack|joe|and| )*go to the beach", s1))
print(grepl("today (jack|joe|and| )*john(jack|joe|and| )*go to the beach", s2))
print(grepl("today (jack|joe|and| )*john(jack|joe|and| )*go to the beach", s3))

However, this is obviously bad coding (repetitions). Anyone has a more elegant solution?

2

There are 2 answers

5
Wiktor Stribiżew On

You may use .* in places where you do not know what may appear there:

s <- c("today john jack and joe go to the beach", "today joe and john go to the beach", "today jack and joe go to the beach")
grepl("today .*\\bjohn\\b.* go to the beach", s)
## => [1]  TRUE  TRUE FALSE

See online R demo

The \b word boundaries are used to match john as a whole word.

EDIT: If you have a pre-defined whitelist of words that may appear between today and go, you cannot just match anything, you need to use an alternation group with all those alternative listed, and - if you really want to shorten the pattern - use the subroutine call within a PCRE regex:

> grepl("today ((?:jack|joe|and| )*)john(?1)\\bgo to the beach", s, perl=TRUE)
[1]  TRUE  TRUE FALSE

See the regex demo.

Here, the alternatives are wrapped within a non-capturing group that is quantified, and the whole group is wrapped with a "technical" capturing group that can be recursed with the (?1) subroutine call (1 means capturing group #1).

2
Konrad Rudolph On

Do you need to validate the rest of the sentence? Because otherwise I’d go for simple:

sentences = c(s1, s2, s3)
grepl('\\bjohn\\b', sentences)
# [1]  TRUE  TRUE FALSE

This performs less validation but it expresses the intent of the statement much more obviously: “does John appear in the sentence?