how to omit the less frequent words from a dictionary in python?

1.5k views Asked by At

I have a dictionary. I want to omit the words with the count 1 from the dictionary. how can I do it? Any help? and I wanna extract the bigram model of the words remained? how can I do it?

import codecs
file=codecs.open("Pezeshki339.txt",'r','utf8')
txt = file.read()
txt = txt[1:]

token=txt.split()

count={}
for word in token:
    if word not in count:
      count[word]=1
    else:
      count[word]+=1
for k,v in count.items():
    print(k,v)

i could edit my code as the following. But there is a question about it: how can I create the bigram matrix and smooth it using add-one method? I appreciate any suggestions which matches my code.

import nltk
from collections import Counter
import codecs
with codecs.open("Pezeshki339.txt",'r','utf8') as file:
    for line in file:
       token=line.split()

spl = 80*len(token)/100
train = token[:int(spl)]
test = token[int(spl):]
print(len(test))
print(len(train))
cn=Counter(train)
known_words=([word for word,v in cn.items() if v>1])# removes the rare words and puts them in a list
print(known_words)
print(len(known_words))
bigram=nltk.bigrams(known_words)
frequency=nltk.FreqDist(bigram)
for f in frequency:
     print(f,frequency[f])
3

There are 3 answers

3
Padraic Cunningham On BEST ANSWER

Use a Counter dict to count the word then filter the .items removing keys that have a value of 1:

from collections import Counter

import codecs
with codecs.open("Pezeshki339.txt",'r','utf8') as f:

    cn = Counter(word for line in f for word in line.split())

    print(dict((word,v )for word,v in cn.items() if v > 1 ))

If you just want the words use list comp:

print([word for word,v in cn.items() if v > 1 ])

You don't need to call read you can split each line as you go, also if you want to remove punctuation you need to strip:

from string import punctuation

cn = Counter(word.strip(punctuation) for line in file for word in line.split())
4
Ami Tavory On
import collections

c = collections.Counter(['a', 'a', 'b']) # Just an example - use your words

[w for (w, n) in c.iteritems() if n > 1]
0
jotakah On

Padraic's solution works great. But here is a solution that can just go underneath your code, instead of rewriting it completely:

newdictionary = {}
for k,v in count.items():
    if v != 1:
        newdictionary[k] = v