Regex to match URL / URI except when contained in an img tag

660 views Asked by At

Credit to dfowler's excellent Jabbr project, I am borrowing code to embed linked content from user posts. The code is from here and uses a regex to extract URLs for additional processing and embedding.

In my case, I run the user posts through a markdown processor first, before attempting this embed. The markdown processor (MarkdownDeep) will, if the user formats the markdown correctly, transform any given image markdown into valid HTML img tag. That works great, however, using the embedded content providers will make the image appear twice, since it shows up validly from the markdown transform, then gets embedded as well afterwards.

So, I believe the solution to my problem lies in changing the regex to not match when the found URL is already contained within a valid img tag.

For ease of answering the regex so far is:

(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'"".,<>?«»“”‘’]))

I think I want to use negative look-ahead like in this answer to exclude the img, but I'm too poor at regex syntax to implement it myself.

NOTE: I want it to still match images if they just appear in the text. So http://www.example.com/sites/default/files/DellComputer.jpg would match or in a hyperlink <a href='http://www.example.com/sites/default/files/DellComputer.jpg'> would match but <img src='http://www.example.com/sites/default/files/DellComputer.jpg'> would not.

Thanks for the help, I know some of you have savant-level regex talents, I just never could do them.

1

There are 1 answers

2
femtoRgon On

For the simple approach, just prepend

(?<!img.*)

to the beginning of your regex. It will match as it already does, but will reject it if img comes somewhere before it on the line. So, the entire regex:

(?<!img.*)(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'"".,<>?«»“”‘’]))

Again, not changed except a few characters on the beginning.

If you need it to be smarter about where the img is located on before it on the line, I would probably recommend using a tool other than regex.