I have a dictionary. I want to omit the words with the count 1 from the dictionary. how can I do it? Any help? and I wanna extract the bigram model of the words remained? how can I do it?
import codecs
file=codecs.open("Pezeshki339.txt",'r','utf8')
txt = file.read()
txt = txt[1:]
token=txt.split()
count={}
for word in token:
if word not in count:
count[word]=1
else:
count[word]+=1
for k,v in count.items():
print(k,v)
i could edit my code as the following. But there is a question about it: how can I create the bigram matrix and smooth it using add-one method? I appreciate any suggestions which matches my code.
import nltk
from collections import Counter
import codecs
with codecs.open("Pezeshki339.txt",'r','utf8') as file:
for line in file:
token=line.split()
spl = 80*len(token)/100
train = token[:int(spl)]
test = token[int(spl):]
print(len(test))
print(len(train))
cn=Counter(train)
known_words=([word for word,v in cn.items() if v>1])# removes the rare words and puts them in a list
print(known_words)
print(len(known_words))
bigram=nltk.bigrams(known_words)
frequency=nltk.FreqDist(bigram)
for f in frequency:
print(f,frequency[f])
Use a Counter dict to count the word then filter the .items removing keys that have a value of 1:
If you just want the words use list comp:
You don't need to call read you can split each line as you go, also if you want to remove punctuation you need to strip: