Finding frequency of words after stemming in Python

736 views Asked by At

I am doing a data cleaning task on a text file full of sentences. After stemming these sentences I would like to get the frequency of the words in my stemmed list. However I am encountering a problem as when printing the stemmed list, stem_list, I am obtaining a list for every sentence like so :

[u'anyon', u'think', u'forgotten', u'day', u'parti', u'friend', u'friend', u'paymast', u'us', u'longer', u'memori']

[u'valu', u'friend', u'bought', u'properti', u'actual', u'relev', u'repres', u'actual', u'valu', u'properti']

[u'monster', u'wreck', u'reef', u'cargo', u'vessel', u'week', u'passeng', u'ship', u'least', u'24', u'hour', u'upload', u'com']

I would like to obtain the frequency of all of the words but I am only obtaining the frequency per sentence by using the following code:

   fdist = nltk.FreqDist(stem_list)
   for word, frequency in fdist.most_common(50):
         print(u'{};{}'.format(word, frequency))

This is producing the following output: friend;2 paymast;1 longer;1 memori;1 parti;1 us;1 day;1 anyon;1 forgotten;1 think;1 actual;2 properti;2 valu;2 friend;1 repres;1 relev;1 bought;1 week;1 cargo;1 monster;1 hour;1 wreck;1 upload;1 passeng;1 least;1 reef;1 24;1 vessel;1 ship;1 com;1 within;1 area;1 territori;1 custom;1 water;1 3;1

The word 'friend' is being counted twice since it is in two different sentences. How would I be able to make it count friend once and display friend;3 in this case?

3

There are 3 answers

3
iFlo On

You could just concatenate everything in one list :

stem_list = [inner for outer in stem_list for inner in outer]

and process the same way you do.

Otherwise, you could keep the same code but instead of printing you create a dict and populate it with values you get. Each time you get a new word, you create the key, then you add the value.

all_words_count = dict()
for word, frequency in fdist.most_common(50):
    if word in all_words_count : # Already found
        all_words_count[word] += frequency
    else : # Not found yet
        all_words_count[word] = frequency

for word in all_words_count : 
    print(u'{};{}'.format(word, all_words_count[word]))
2
Michael Weber On

I think the easyest way is to combine the arrays before passing it to the function.

allwords = [inner for outer in stem_list for inner in outer]

fdist = nltk.FreqDist(allwords)
    for word, frequency in fdist.most_common(50):
        print(y'{};{}'.format(word, frequency))

or shorter:

fdist = nltk.FreqDist([inner for outer in stem_list for inner in outer])
    for word, frequency in fdist.most_common(50):
        print(y'{};{}'.format(word, frequency))

I think your input looks like:

stem_list = [[u'anyon', u'think', u'forgotten', u'day', u'parti', u'friend', u'friend', u'paymast', u'us', u'longer', u'memori'],

            [u'valu', u'friend', u'bought', u'properti', u'actual', u'relev', u'repres', u'actual', u'valu', u'properti'],

            [u'monster', u'wreck', u'reef', u'cargo', u'vessel', u'week', u'passeng', u'ship', u'least', u'24', u'hour', u'upload', u'com'],

            [.....], etc for the other sentences ]

so you have two arrays - first for sentences and second for words in sentenc. With allwords = [inner for outer in stem_list for inner in outer] you run through the sentences and combine them as one array of words.

0
trincot On

You could flatten your 2D array first with chain.from_iterable:

fdist = nltk.FreqDist(chain.from_iterable(stem_list)):
    for word, frequency in fdist.most_common(50):
        print(u'{};{}'.format(word, frequency))