Using regex to parse hashtags from a sentence

669 views Asked by At

I want to extract out hashtags from a sentence. For eg- if the sentence is

#test1.#test2 #test3 www.google.com/#test4 www.google.com/hello#test5

the hashtags would be

#test1
#test2 
#test3 

but not #test4 or #test5 as they are part of URLs

I was trying to make the regex for this. till now i have

/(^|\s)#(\w+)\b/g

https://regex101.com/r/WPeSdE/1

this takes care of #test1 and #test3 but fails to get #test2.

Please help.

2

There are 2 answers

0
ghostCoder On BEST ANSWER

Needed a very complex regex to support what i needed. In the end, for now i ended up using hashtag function of twitter.txt library. handles all the cases i was stuck with.

0
Wiktor Stribiżew On

Match URLs and match and capture the hashtags, and just grab the Group 1 contents:

/\b(?:(?:https?|ftps?):\/\/|www\.)\S+|#(\w+)\b/gi

See the regex demo.

Details:

  • \b(?:(?:https?|ftps?):\/\/|www\.)\S+ - a URL like pattern:
    • \b - word boundary
    • (?:(?:https?|ftps?)://|www.)` - either of:
      • (?:https?|ftps?):\/\/ - http://, or https:// (or same with ftp/ftps)
      • www\. - or www.
    • \S+ - 1 or more chars other than whitespace
  • | - or
  • #(\w+)\b - a hash symbol, then Group 1 capturing one or more word chars (the hashtag) followed with a word boundary.

See the JS demo below:

var rx = /\b(?:(?:https?|ftps?):\/\/|www\.)\S+|#(\w+)\b/gi;
var str = `#test1.#test2 #test3 www.google.com/#test4 www.google.com/hello#test5`;
var m, res =[];
while ((m = rx.exec(str)) !== null) {
   if (m[1]) res.push(m[1]);
}
console.log(res);