Find pattern in URL with stringr and regex

995 views Asked by At

I have a dataframe df with some urls. There are subcategories within the slashes in the URLs I want to extract with stringr and str_extract

My data looks like

Text         URL
Hello        www.facebook.com/group1/bla/exy/1234
Test         www.facebook.com/group2/fssas/eda/1234
Text         www.facebook.com/group-sdja/sdsds/adeds/23234
Texter       www.facebook.com/blablabla/sdksds/sdsad

I now want to extract everything after .com/ and the next /

I tried suburlpattern <- "^.com//{1,20}//$" and df$categories <- str_extract(df$URL, suburlpattern)

But I only end up with NA in df$categories

Any idea what I am doing wrong here? Is it my regex code?

Any help is highly appreciated! Many thanks beforehand.

3

There are 3 answers

1
manotheshark On BEST ANSWER

this will return everything between the first set of forward slashes

library(stringr)
str_match("www.facebook.com/blablabla/sdksds/sdsad", "^[^/]+/(.+?)/")[2]

[1] "blablabla"
0
Wiktor Stribiżew On

If you want to use str_extract, you need a regex that will get the value you need into the whole match, and you will need a (?<=[.]com/) lookbehind:

(?<=[.]com/)[^/]+

See the regex demo.

Details:

  • (?<=[.]com/) - the current location must be preceded with .com/ substring
  • [^/]+ - matches 1 or more characters other than /.

R demo:

> URL = c("www.facebook.com/group1/bla/exy/1234", "www.facebook.com/group2/fssas/eda/1234","www.facebook.com/group-sdja/sdsds/adeds/23234", "www.facebook.com/blablabla/sdksds/sdsad")
> df <- data.frame(URL)
> library(stringr)
> res <- str_extract(df$URL, "(?<=[.]com/)[^/]+")
> res
[1] "group1"     "group2"     "group-sdja" "blablabla"
0
Matt S On

This works

library(stringr)
data <- c("www.facebook.com/group1/bla/exy/1234", 
          "www.facebook.com/group2/fssas/eda/1234",
          "www.facebook.com/group-sdja/sdsds/adeds/23234",
          "www.facebook.com/blablabla/sdksds/sdsad")

suburlpattern <- "/(.*?)/" 
categories <- str_extract(data, suburlpattern)
str_sub(categories, start = 2, end = -2)

Results:

[1] "group1" "group2" "group-sdja" "blablabla"

Will only get you what's between the first and second slashes... but that seems to be what you want.