I am trying to implement the Maxent Classifier but I am facing problem while using the iis algorithm.The following code works fine for gis algorithm.
import nltk
from nltk.classify import MaxentClassifier, accuracy
from featx import split_label_feats, label_feats_from_corpus
from nltk.corpus import movie_reviews
from nltk.classify import megam
from openpyxl import load_workbook
from featx import bag_of_non_words
from nltk.tokenize import word_tokenize
movie_reviews.categories()
lfeats = label_feats_from_corpus(movie_reviews)
lfeats.keys()
train_feats, test_feats = split_label_feats(lfeats)
me_classifier = nltk.MaxentClassifier.train(train_feats, algorithm='iis', trace=0, max_iter=3)
print accuracy(me_classifier, test_feats)
I am working on a WIN32 machine and the above code is from NLTK book by Jacob Perkins. The warning thrown by it is
C:\Python27\lib\site-packages\nltk\classify\maxent.py:1308: RuntimeWarning: invalid value encountered in multiply
sum1 = numpy.sum(exp_nf_delta * A, axis=0)
C:\Python27\lib\site-packages\nltk\classify\maxent.py:1309: RuntimeWarning: invalid value encountered in multiply
sum2 = numpy.sum(nf_exp_nf_delta * A, axis=0)
C:\Python27\lib\site-packages\nltk\classify\maxent.py:1315: RuntimeWarning: invalid value encountered in divide
deltas -= (ffreq_empirical - sum1) / -sum2
And then the computer hangs.So I have to stop the execution.
.
Firstly, the way you're importing your libraries unsorted is too confusing. Also there are lot of unused imports. After some googling, So let's cut down the imports and stick with this:
Then I found that
featx
is some example module the Jacob Perkins was using for his book, this is a better source (https://github.com/sophist114/Python/blob/master/EmotionAnalysis.py). So let's here's a documented version with some explanation of what the functions are doing:Now let's go through the process of training the model and testing it, first, the feature extraction:
let's see what we get after calling
label_feats_from_corpus
:[out]:
So we get a document with the
neg
label and for each word in our document, we see that ALL words are True. For now each document only contains the feature (i.e. the word) that it has.Let's move on:
Now we see that the
split_label_feats
change the key value structure such that each iteration of train_feats gives us a document with a tuple of the (features, label)[out]:
So it seems like the error can only be caused by your last two lines of code, when you run the line:
You get these warnings but do note that the code is still building the model !!!! So it's just warnings due to underflow, see What are arithmetic underflow and overflow in C?
It takes a while to build the classifier but fear not, just wait till it's finish and don't
ctr + c
to end the python process. If you kill the process, you will see this:So let's understand why the warning occurs, there are 4 warnings given:
All of them points to the same function used to calculate delta in NLTk's maxent implementation, i.e. https://github.com/nltk/nltk/blob/develop/nltk/classify/maxent.py#L1208 . And you find out that the this delta calculation is specific to IIS (Improved Iterative Scaling) algorithm.
At this point, you need to learn about machine learning and supervised learning, https://en.wikipedia.org/wiki/Supervised_learning
To answer your question, the warming is merely an indication that delta is hard to calculate at some point but it's still reasonable to deal with, possibly because of some super small values when calculating delta. The algorithm IS working. It's not hanging, it's training.
In order to appreciate the neat implementation of MaxEnt in NLTK, i suggest you go through this course https://www.youtube.com/playlist?list=PL6397E4B26D00A269 or for more hardcore Machine Learning course, go to https://www.coursera.org/course/ml
Training a classifier takes time and computing juice and after you wait long enough, you should see that it does:
[out]:
You can see that the accuracy is bad, as expected since delta calculation is going too far, 0.5 is your baseline. Go through the courses as listed above and you should be able to produce better classifiers after knowing how they come about and how to tune them.
BTW, remember to pickle your classifier so that you don't have to retrain it the next time, see Save Naive Bayes Trained Classifier in NLTK and Pickling a trained classifier yields different results from the results obtained directly from a newly but identically trained classifier
Here's the full code: