I want to make a matrix of the bigram model. How can I do it? Any suggestions which match my code, please?
import nltk
from collections import Counter
import codecs
with codecs.open("Pezeshki339.txt",'r','utf8') as file:
for line in file:
token=line.split()
spl = 80*len(token)/100
train = token[:int(spl)]
test = token[int(spl):]
print(len(test))
print(len(train))
cn=Counter(train)
known_words=([word for word,v in cn.items() if v>1])# removes the rare words and puts them in a list
bigram=nltk.bigrams(known_words)
frequency=nltk.FreqDist(bigram)
for f in frequency:
print(f,frequency[f])
I need something like:
w1 w2 w3 ....wn
w1 n(w1w1) n(w1w2) n(w1w3) n(w1wn)
w2 n(w2w1) n(w2w1) n(w2w3) n(w2wn)
w3 .
.
.
.
wn
The same for all rows and columns.
Since you need a "matrix" of words, you'll use a dictionary-like class. You want a dictionary of all first words in bigrams. To make a two-dimensional matrix, it will be a dictionary of dictionaries: Each value is another dictionary, whose keys are the second words of the bigrams and values are whatever you're tracking (probably number of occurrences).
In the NLTK you can do it quickly with a
ConditionalFreqDist()
:But I recommend you build your bigram table step by step. You'll understand it better, and you need to before you can use it.