A fast and efficient, not-so complex word content filter

220 views Asked by At

Without getting into bayesian-level content classification project, I'm trying to make a very simple profanity filter for twitter accounts.

In essense, I just join all of a user's tweets into one large text blob and run the content against my filter, which in essence works like this:

badwords = ['bad', 'worse', 'momwouldbeangry', 'thousandsofperversesayings', 'xxx', 'etc']

s = 'Get free xxx etc'

score = 0

for b in badwords:
    if b in s:
        score = score+1

I have a 3k list of bad words (what a perverted world we live in!) and ideally I'd like to create a score based not only on word occurance, but how many times each word occurs. So if the word occurs twice, the score would increment twice.

The score generator above is extremely simple but re-evaluates the string thousands of times, plus it does not increment the way I'd like.

How can this be adjusted for performance and accuracy?

3

There are 3 answers

2
Marcus Müller On BEST ANSWER

So len(badwords) == 3000, therefore with tweet_words = len(s.split())) it is that len(tweet_words) < len(badwords); hence

for b in badwords:
    if b in s:
        score = score+1

is really inefficient.

First thing to do: make badwords a frozenset. That way, it's much faster to look for an ocurrence of something in it.

Then, search for words in badwords, not the other way around:

for t_word in tweet_words
    if t_word in badwords:
        score = score+1

then, be a bit more functional!

score_function = lambda word: 0 if len(word) < 3 or (word not in badwords) else 1
score = lambda tweet: sum(score(lower(word)) for word in tweet.split())

which will be faster than the full loops, because python needs to construct and destruct less temporary contexts (that's technically a bit misleading, but you save a lot of cpython pyObject creations).

0
Padraic Cunningham On

If the each badword cannot be a substring and you want a count for each word you could use a dict, you would also need to lower and strip any punctuation from the words in your users tweets:

from string import punctuation
badwords = dict.fromkeys(('bad', 'worse', 'momwouldbeangry', 'thousandsofperversesayings', 'xxx', 'etc'),0)

s = 'Get free xxx! etc!!'

for word in s.split():
    word = word.lower().strip(punctuation)
    if word in badwords:
        badwords[word] += 1


print(badwords)
print(sum(badwords.values()))
{'momwouldbeangry': 0, 'xxx': 1, 'etc': 1, 'bad': 0, 'thousandsofperversesayings': 0, 'worse': 0}
2

If you don't care what words appear just the count:

from string import punctuation
badwords = {'bad', 'worse', 'momwouldbeangry', 'thousandsofperversesayings', 'xxx', 'etc'}

s = 'Get free xxx! etc!!'

print(sum( word.lower().strip(punctuation)in badwords for word in s.split()))
2
Roland Smith On

Try using collections.Counter;

In [1]: text = """Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum"""

In [2]: badwords = ['in', 'ex']

In [3]: from collections import Counter

In [9]: words = text.lower().split()

In [10]: c = Counter(words)

In [11]: c
Out[11]: Counter({'ut': 3, 'in': 3, 'dolore': 2, 'dolor': 2, 'adipiscing': 1, 'est': 1, 'exercitation': 1, 'aute': 1, 'proident,': 1, 'elit,': 1, 'irure': 1, 'consequat.': 1, 'minim': 1, 'pariatur.': 1, 'nostrud': 1, 'laboris': 1, 'occaecat': 1, 'lorem': 1, 'esse': 1, 'quis': 1, 'anim': 1, 'amet,': 1, 'ipsum': 1, 'laborum': 1, 'sunt': 1, 'qui': 1, 'incididunt': 1, 'culpa': 1, 'consectetur': 1, 'aliquip': 1, 'duis': 1, 'cillum': 1, 'excepteur': 1, 'cupidatat': 1, 'labore': 1, 'magna': 1, 'do': 1, 'fugiat': 1, 'reprehenderit': 1, 'ullamco': 1, 'ad': 1, 'commodo': 1, 'tempor': 1, 'non': 1, 'et': 1, 'ex': 1, 'deserunt': 1, 'sit': 1, 'eu': 1, 'voluptate': 1, 'mollit': 1, 'eiusmod': 1, 'aliqua.': 1, 'nulla': 1, 'sed': 1, 'sint': 1, 'nisi': 1, 'enim': 1, 'veniam,': 1, 'velit': 1, 'id': 1, 'officia': 1, 'ea': 1})

In [12]: scores = [v for k, v in c.items() if k in badwords]

In [13]: scores
Out[13]: [1, 3]

In [14]: sum(scores)
Out[14]: 4