I'm writing a naive bayes classifier for a class project and I just got it working... sort of. While I do get an error-free output, the winning output label had an output probability of 3.89*10^-85.
Wow.
I have a couple of ideas of what I might be doing wrong. Firstly, I am not normalizing the output percentages for the classes, so all of the percentages are effectively zero. While that would give me numbers that look nice, I don't know if that's the correct thing to do.
My second idea was to reduce the number of features. Our input data is a list of pseudo-images in the form of a very long text file. Currently, our features are just the binary value of every pixel of the image, and with a 28x28 image that's a lot of features. If I instead chopped the image into blocks of size, say, 7x7, how much would that actually improve the output percentages?
tl;dr Here's the general things I'm trying to understand about naive bayes:
1) Do you need to normalize the output percentages from testing each class?
2) How much of an effect does having too many features have on the results?
Thanks in advance for any help you can give me.
It could be normal. The output of a naive bayes is not meant to be a real probability. What it is meant to do is order a score among competing classes.
The reason why the probability is so low is that many Naive Bayes implementations are the product of the probabilities of all the observed features of the instance that is being classified. If you are classifying text, each feature may have a low conditional probability for each class (example: lower than 0.01). If you multiply 1000s of feature probabilities, you quickly end up with numbers such as you have reported.
Also, the probabilities returned are not the probabilities of each class given the instance, but an estimate of the probabilities of observing this set of features, given the class. Thus, the more you have features, the less likely it is to observe these exact features. A bayesian theorem is used to change
argmax_c P(class_c|features)
toargmax_c P(class_c)*P(features|class_c)
, and then theP(features|class_c)
is further simplified by making independence assumption, which allows changing that to a product of the probabilities of observing each individual feature given the class. These assumptions don't change the argmax (the winning class).If I were you, I would not really care about the probability output, focus instead on the accuracy of your classifier and take action to improve the accuracy, not the calculated probabilities.