Find pattern in URL with stringr and regex

Question

Find pattern in URL with stringr and regex

984 views Asked by rkuebler At 20 December 2016 at 22:58

I have a dataframe df with some urls. There are subcategories within the slashes in the URLs I want to extract with stringr and str_extract

My data looks like

Text         URL
Hello        www.facebook.com/group1/bla/exy/1234
Test         www.facebook.com/group2/fssas/eda/1234
Text         www.facebook.com/group-sdja/sdsds/adeds/23234
Texter       www.facebook.com/blablabla/sdksds/sdsad

I now want to extract everything after .com/ and the next /

I tried suburlpattern <- "^.com//{1,20}//$" and df$categories <- str_extract(df$URL, suburlpattern)

But I only end up with NA in df$categories

Any idea what I am doing wrong here? Is it my regex code?

Any help is highly appreciated! Many thanks beforehand.

Original Q&A

There are 3 answers

Wiktor Stribiżew On 20 December 2016 at 23:43

If you want to use str_extract, you need a regex that will get the value you need into the whole match, and you will need a (?<=[.]com/) lookbehind:

(?<=[.]com/)[^/]+

See the regex demo.

Details:

(?<=[.]com/) - the current location must be preceded with .com/ substring
[^/]+ - matches 1 or more characters other than /.

R demo:

> URL = c("www.facebook.com/group1/bla/exy/1234", "www.facebook.com/group2/fssas/eda/1234","www.facebook.com/group-sdja/sdsds/adeds/23234", "www.facebook.com/blablabla/sdksds/sdsad")
> df <- data.frame(URL)
> library(stringr)
> res <- str_extract(df$URL, "(?<=[.]com/)[^/]+")
> res
[1] "group1"     "group2"     "group-sdja" "blablabla"

Matt S On 20 December 2016 at 23:45

This works

library(stringr)
data <- c("www.facebook.com/group1/bla/exy/1234", 
          "www.facebook.com/group2/fssas/eda/1234",
          "www.facebook.com/group-sdja/sdsds/adeds/23234",
          "www.facebook.com/blablabla/sdksds/sdsad")

suburlpattern <- "/(.*?)/" 
categories <- str_extract(data, suburlpattern)
str_sub(categories, start = 2, end = -2)

Results:

[1] "group1" "group2" "group-sdja" "blablabla"

Will only get you what's between the first and second slashes... but that seems to be what you want.

**manotheshark** · Accepted Answer · 2016-12-20T23:32:07+00:00

manotheshark On 20 December 2016 at 23:32 BEST ANSWER

this will return everything between the first set of forward slashes

library(stringr)
str_match("www.facebook.com/blablabla/sdksds/sdsad", "^[^/]+/(.+?)/")[2]

[1] "blablabla"

TechQA.

Find pattern in URL with stringr and regex

There are 3 answers

Related Questions in R

Related Questions in REGEX

Related Questions in EXTRACT

Related Questions in STRINGR

Popular Questions

Popular Tags

Trending Questions