Without getting into bayesian-level content classification project, I'm trying to make a very simple profanity filter for twitter accounts.
In essense, I just join all of a user's tweets into one large text blob and run the content against my filter, which in essence works like this:
badwords = ['bad', 'worse', 'momwouldbeangry', 'thousandsofperversesayings', 'xxx', 'etc']
s = 'Get free xxx etc'
score = 0
for b in badwords:
if b in s:
score = score+1
I have a 3k list of bad words (what a perverted world we live in!) and ideally I'd like to create a score based not only on word occurance, but how many times each word occurs. So if the word occurs twice, the score would increment twice.
The score generator above is extremely simple but re-evaluates the string thousands of times, plus it does not increment the way I'd like.
How can this be adjusted for performance and accuracy?
So
len(badwords) == 3000
, therefore withtweet_words = len(s.split()))
it is thatlen(tweet_words) < len(badwords)
; henceis really inefficient.
First thing to do: make
badwords
afrozenset
. That way, it's much faster to look for an ocurrence of something in it.Then, search for words in
badwords
, not the other way around:then, be a bit more functional!
which will be faster than the full loops, because python needs to construct and destruct less temporary contexts (that's technically a bit misleading, but you save a lot of cpython pyObject creations).