I would be most grateful for advice. I would like to split my strings after a comma but need to preserve text within brackets containing a comma (i.e. not to split this). There are 4 possibilities in my data regarding whitespaces and commas.
1 no space after the comma within the parentheses (c,d) 2 a space after the comma in the parentheses (x, y) 3 a space after the comma outside the parentheses url.d, url.e 4 no space after the comma outside the parentheses url.d, url.e
In my example below url.b (c,d) needs to appear together as does url.h (x, y). In the code below, lines 8 and 9 need to appear together. Line 11 needs to be split.
my_df = data.frame(id=1:4, urls=c("url.a, url.b (c,d), url.c",
"url.d, url.e, url.f",
"url.g, url.h (x, y), url.i",
"url.d,url.e, url.f"))
tidytext::unnest_tokens(my_df, out, urls, token = 'regex', pattern=",\\s+")
id out
1 1 url.a
2 1 url.b (c,d)
3 1 url.c
4 2 url.d
5 2 url.e
6 2 url.f
7 3 url.g
8 3 url.h (x
9 3 y)
10 3 url.i
11 4 url.d,url.e
12 4 url.f
Thank you!
(2nd attempt after test data update)
Here's one strategy to try out:
|
)",\\s*"
for splitting, it will match all commas with optional trailing whitespaceA closer look at
str_replace_all(..., \(x) do_something(x))
,"\\(([^)]*)\\)"
is used to find substrings that are enclosed in parentheses:But instead of a replacement string we'll use a replacement function that modifies our match and replaces
,
with a placeholder|
(assuming|
is not used anywhere inurls
column):Adding those 2 pieces together to eliminate all commas between ():
Created on 2023-11-22 with reprex v2.0.2