Karl Broman's post: https://kbroman.wordpress.com/2015/06/22/randomized-hobbit-2/ got me playing with regex and ngrams just for fun. I attempted to use regex to extract 2-grams. I know there are parsers to do this but am interested in the regex logic (i.e., it was a self challenge that I failed to meet).
Below I give a minimal example and the desired output. The problem in my attempt is 2 fold:
The grams (words) get eaten up and aren't available for the next pass. How can I make them available for the second pass? (e.g., I want
like
to be available forlike toast
after it's already been consumed previously inI like
)I couldn't make the space between words non-captured (notice the trailing white space in my output even though I used
(?:\\s*)
). How can I not capture trailing spaces on the nth (in this case second) word? I know this could be done simply with:"(\\b[A-Za-z']+\\s)(\\b[A-Za-z']+)"
for a 2-gram but I want to extend the solution to n-grams. PS I know about\\w
but I don't consider underscores and numbers as word parts, but do consider'
as a word part.
MWE:
library(stringi)
x <- "I like toast and jam."
stringi::stri_extract_all_regex(
x,
pattern = "((\\b[A-Za-z']+\\b)(?:\\s*)){2}"
)
## [[1]]
## [1] "I like " "toast and "
Desired Output:
## [[1]]
## [1] "I like" "like toast" "toast and" "and jam"
Here's one way using base R regex. This can be easily extended to handle arbitrary n-grams. The trick is to put the capture group inside a positive look-ahead assertion, eg.,
(?=(my_overlapping_pattern))