how to create the bigram matrix?

2.2k views Asked by At

I want to make a matrix of the bigram model. How can I do it? Any suggestions which match my code, please?

 import nltk
 from collections import Counter


 import codecs
 with codecs.open("Pezeshki339.txt",'r','utf8') as file:
     for line in file:
       token=line.split()

 spl = 80*len(token)/100
 train = token[:int(spl)]
 test = token[int(spl):]
 print(len(test))
 print(len(train))
 cn=Counter(train)
 known_words=([word for word,v in cn.items() if v>1])# removes the rare  words and puts them in a list

 bigram=nltk.bigrams(known_words)
 frequency=nltk.FreqDist(bigram)
 for f in frequency:
       print(f,frequency[f])

I need something like:

          w1        w2      w3          ....wn
 w1     n(w1w1)  n(w1w2)  n(w1w3)      n(w1wn)
 w2     n(w2w1)  n(w2w1)  n(w2w3)      n(w2wn)
 w3   .
  .
  .
  .
  wn

The same for all rows and columns.

1

There are 1 answers

0
alexis On BEST ANSWER

Since you need a "matrix" of words, you'll use a dictionary-like class. You want a dictionary of all first words in bigrams. To make a two-dimensional matrix, it will be a dictionary of dictionaries: Each value is another dictionary, whose keys are the second words of the bigrams and values are whatever you're tracking (probably number of occurrences).

In the NLTK you can do it quickly with a ConditionalFreqDist():

mybigrams = nltk.ConditionalFreqDist(nltk.bigrams(brown.words()))

But I recommend you build your bigram table step by step. You'll understand it better, and you need to before you can use it.