I am doing a data cleaning task on a text file full of sentences. After stemming these sentences I would like to get the frequency of the words in my stemmed list. However I am encountering a problem as when printing the stemmed list, stem_list, I am obtaining a list for every sentence like so :
[u'anyon', u'think', u'forgotten', u'day', u'parti', u'friend', u'friend', u'paymast', u'us', u'longer', u'memori']
[u'valu', u'friend', u'bought', u'properti', u'actual', u'relev', u'repres', u'actual', u'valu', u'properti']
[u'monster', u'wreck', u'reef', u'cargo', u'vessel', u'week', u'passeng', u'ship', u'least', u'24', u'hour', u'upload', u'com']
I would like to obtain the frequency of all of the words but I am only obtaining the frequency per sentence by using the following code:
fdist = nltk.FreqDist(stem_list)
for word, frequency in fdist.most_common(50):
print(u'{};{}'.format(word, frequency))
This is producing the following output: friend;2 paymast;1 longer;1 memori;1 parti;1 us;1 day;1 anyon;1 forgotten;1 think;1 actual;2 properti;2 valu;2 friend;1 repres;1 relev;1 bought;1 week;1 cargo;1 monster;1 hour;1 wreck;1 upload;1 passeng;1 least;1 reef;1 24;1 vessel;1 ship;1 com;1 within;1 area;1 territori;1 custom;1 water;1 3;1
The word 'friend' is being counted twice since it is in two different sentences. How would I be able to make it count friend once and display friend;3 in this case?
You could just concatenate everything in one list :
and process the same way you do.
Otherwise, you could keep the same code but instead of printing you create a dict and populate it with values you get. Each time you get a new word, you create the key, then you add the value.