Gsubing when we have multiple backslashes and/special characters

36 views Asked by At

I have a string in which I want to get out the city, in this example it would be 'Elland Rd' and 'Leeds'.

mystring = "0000\" club_info=\"Elland Rd, Leeds\" Pitch=\"100x50\""
city = gsub(".* club_info=\"(.*),(.+)\.*", "\\2", mystring) #cant get this part to work

My theory behind getting the city is to search for everything after the comma and up until the backslash but I cant seem to get it to recognize the backslash

1

There are 1 answers

3
r2evans On BEST ANSWER

I prefer strcapture to extract multiple patterns vice repeated gsubing, how about this?

strcapture('.*club_info="([^"]+),([^"]+)".(.*)', mystring, list(x1="", x2="", x3=""))
#          x1     x2             x3
# 1 Elland Rd  Leeds Pitch="100x50"

(It was not required to include the Pitch= in there, but I thought you might use it since it appears you're doing reductive gsubing.)

FYI, x2 here has a leading space; it could be handled in the regex, but if you are not 100% positive it's in all cases, then it might be simpler to add trimws(.), as in

strcapture('.*club_info="([^"]+),([^"]+)".(.*)', mystring, list(x1="", x2="", x3="")) |>
  lapply(trimws)
# $x1
# [1] "Elland Rd"
# $x2
# [1] "Leeds"
# $x3
# [1] "Pitch=\"100x50\""

In this case it does drop from a data.frame to a list, but I'm not certain you need a frame, a named list should suffice. If you really want it as a frame --- and many of my use-cases really prefer that --- just add |> as.data.frame() to the pipe.

Regex walk-through.

.*club_info="([^"]+),([^"]+)".(.*)
^^                                  leading/trailing text, discarded
  ^^^^^^^^^^^                       literal text
              [^"]+   [^"]+         one or more "any character except dquote"
             (     ),(     )        two capture-groups

Also, since we know that we'll have double quotes in the pattern and not single-quotes, I chose to use single-quotes as the outer string-defining demarcation. If we have both or if you want to avoid double-backslashes and the like, we can use R's "raw strings" instead,

r"{.*club_info="([^"]+),([^"]+)".(.*)}"

where the r"{ and }" are the open/close delimiters; I chose braces here since parens are visually confusing with the regex-parens, though brackets r"[/]" and parens r"(/)" also work.