Using unnest_tokens() to split a column selectively (no split if comma inside a bracket)

64 views Asked by At

I would be most grateful for advice. I would like to split my strings after a comma but need to preserve text within brackets containing a comma (i.e. not to split this). There are 4 possibilities in my data regarding whitespaces and commas.

1 no space after the comma within the parentheses (c,d) 2 a space after the comma in the parentheses (x, y) 3 a space after the comma outside the parentheses url.d, url.e 4 no space after the comma outside the parentheses url.d, url.e

In my example below url.b (c,d) needs to appear together as does url.h (x, y). In the code below, lines 8 and 9 need to appear together. Line 11 needs to be split.

my_df = data.frame(id=1:4, urls=c("url.a, url.b (c,d), url.c",
                                  "url.d, url.e, url.f",
                                  "url.g, url.h (x, y), url.i",
                                  "url.d,url.e, url.f"))


tidytext::unnest_tokens(my_df, out, urls, token = 'regex', pattern=",\\s+")

   id         out
1   1       url.a
2   1 url.b (c,d)
3   1       url.c
4   2       url.d
5   2       url.e
6   2       url.f
7   3       url.g
8   3    url.h (x
9   3          y)
10  3       url.i
11  4 url.d,url.e
12  4       url.f

Thank you!

1

There are 1 answers

5
margusl On BEST ANSWER

(2nd attempt after test data update)

Here's one strategy to try out:

  • use a placeholder character for commas in parentheses (let's pick |)
  • use ",\\s*" for splitting, it will match all commas with optional trailing whitespace
  • restore commas
library(dplyr)
library(stringr)
library(tidytext)

my_df = data.frame(id=1:4, urls=c("url.a, url.b (c,d), url.c",
                                  "url.d, url.e, url.f",
                                  "url.g, url.h (x, y), url.i",
                                  "url.d,url.e, url.f"))

# before applying unnest_tokens, replace commas in parenthesis 
# with a placeholder, `|`
my_df %>% 
  mutate(urls = str_replace_all(urls, 
                                "\\(([^)]*)\\)", 
                                \(match) str_replace_all(match, fixed(","), "|"))) %>% 
  unnest_tokens(out, urls, token = 'regex', pattern=",\\s*") %>% 
  # restore commas
  mutate(out = str_replace_all(out, fixed("|"), ","))
#>    id          out
#> 1   1        url.a
#> 2   1  url.b (c,d)
#> 3   1        url.c
#> 4   2        url.d
#> 5   2        url.e
#> 6   2        url.f
#> 7   3        url.g
#> 8   3 url.h (x, y)
#> 9   3        url.i
#> 10  4        url.d
#> 11  4        url.e
#> 12  4        url.f


A closer look at str_replace_all(..., \(x) do_something(x)) , "\\(([^)]*)\\)" is used to find substrings that are enclosed in parentheses:

str_view("url.a, url.b (c,d, foo, bar), url.c", "\\(([^)]*)\\)")
#> [1] │ url.a, url.b <(c,d, foo, bar)>, url.c

But instead of a replacement string we'll use a replacement function that modifies our match and replaces , with a placeholder | (assuming | is not used anywhere in urls column):

# \(match) ... notation is a shorthand for anonymous / lambda function
anon_function <- \(match) str_replace_all(match, fixed(","), "|")
anon_function("c,d, foo")
#> [1] "c|d| foo"

Adding those 2 pieces together to eliminate all commas between ():

str_replace_all(my_df$urls, "\\(([^)]*)\\)", \(match) str_replace_all(match, fixed(","), "|"))
#> [1] "url.a, url.b (c|d), url.c"  "url.d, url.e, url.f"       
#> [3] "url.g, url.h (x| y), url.i" "url.d,url.e, url.f"

Created on 2023-11-22 with reprex v2.0.2