detect emoticon in a sentence using regex python

3.3k views Asked by At

Here is the list of emoticons: http://en.wikipedia.org/wiki/List_of_emoticons I want to form a regex which checks if any of these emoticons exist in the sentence. For example, "hey there I am good :)" or "I am angry and sad :(" but there are a lot of emoticons in the list on wikipedia so wondering how I can achieve this task. I am new to regex. & python.

>>> s = "hey there I am good :)"
>>> import re
>>> q = re.findall(":",s)
>>> q
[':']
1

There are 1 answers

0
tobias_k On BEST ANSWER

I see two approaches to your problem:

  1. Either, you can create a regular expression for a "generic smiley" and try to match as many as possible without making it overly complicated and insane. For example, you could say that each smiley has some sort of eyes, a nose (optional), and a mouth.
  2. Or, if you want to match each and every smiley from that list (and none else) you can just take those smileys, escape any regular-expression specific special characters, and build a huge disjunction from those.

Here is some code that should get you started for both approaches:

# approach 1: pattern for "generic smiley"
eyes, noses, mouths = r":;8BX=", r"-~'^", r")(/\|DP"
pattern1 = "[%s][%s]?[%s]" % tuple(map(re.escape, [eyes, noses, mouths]))

# approach 2: disjunction of a list of smileys
smileys = """:-) :) :o) :] :3 :c) :> =] 8) =) :} :^) 
             :D 8-D 8D x-D xD X-D XD =-D =D =-3 =3 B^D""".split()
pattern2 = "|".join(map(re.escape, smileys))

text = "bla bla bla :-/ more text 8^P and another smiley =-D even more text"
print re.findall(pattern1, text)

Both approaches have pros, cons, and some general limitations. You will always have false positives, like in a mathematical term like 18^P. It might help to put spaces around the expression, but then you can't match smileys followed by punctuation. The first approach is more powerful and catches smileys the second approach won't match, but only as long as they follow a certain schema. You could use the same approach for "eastern" smileys, but it won't work for strictly symmetric ones, like =^_^=, as this is not a regular language. The second approach, on the other hand, is easier to extend with new smileys, as you just have to add them to the list.