Regex skipping delimiter is there is / before it

37 views Asked by At

I'm using:

\b(?<=,)http.*?(?=,)\b

The issue come when I try to use it on:

Something,https://stackoverflow.com/questions/ask/,Anotherthing

While it works on:

Something,https://stackoverflow.com/questions/ask,Anotherthing

So if the / is places before the delimiter the expression seem to skip it. Please help me it's been an entire day that I'm stuck on this.

I tried to change it like: \b(?<=,)http.*[\/],(?=,)\b

but it doesn't work nor it works \b(?<=,)http.*\/,(?=,)\b

I want it to select any link until the , delimiter even if the comma is anticipated by /. Thanks for any help. I'm a noob sorry if it's easy.

2

There are 2 answers

0
det.antoine On

Since a comma can not be in a url, you can narrow the characters in the url to not be a comma with [^,]:

\b(?<=,)http[^,]*(?=,)\b

Also you do not need the ? for optional character as * already means zero or more characters.

1
Patrick Janser On

I would personally add :// after http and also add an optional s for the SSL URLs. This way, you won't match "httpd is Apache daemon" and only match URLs.

Instead of using .*?, you can match any char which isn't a comma, but normally, they are allowed in URLs, at specific places (typically not in the domain but allowed in the path, mentioned in chapter 3.3 of RFC2396, so the use of \bhttps?://[^,]+ may not be the correct solution but will be working in most cases.

A) No comma in the URL

The regular expression would become: (?<=,)\bhttps?:\/\/[^,]+(?=,)

See it in action here: https://regex101.com/r/yKHDiC/1

B) No comma in the URL, but spaces before allowed

But it's a bit of a shame that we cannot use (?<=,\s*) as positive lookbehind, which would let us also match URLs if some spaces are placed behind the comma. This is because adding \s* makes it of undefined length, which is not allowed in lookbehinds for most regex engines.

But depending on your use case, we could replace the lookarounds by some capturing groups, which are more flexible, because they can have an unfixed length:

(,\s*)\b(https?:\/\/[^,]+)(\s*,)

Version 2 with groups: https://regex101.com/r/yKHDiC/2

Your URL would be in group number 2.

C) Comma allowed in the URL, spaces before and after

Using groups, like before, we can also improve it by accepting commas in the URL, but searching for lines starting with some text without commas, followed by a comma, then the URL and finally the end of line which should be a comma and any chars not beeing commas.

This would become: ^([^,]+,\s*)\b(https?:\/\/.*?)(\s*,[^,]+)$

Version 3: https://regex101.com/r/yKHDiC/3

But this can only work if you always have 3 items per line.

So you might have to adapt and choose the best regex depending on your real use case.