I am learning about spam detection using machine learning techniques, and a post I found on Stack suggests that I start with a Naive Bayesian Classifier.
My question is this: what if an attribute I am measuring is discreet, not continuous, how should it be incorporated? In this example in Wikipedia, they are training a classifier to detect male vs female based on height, weight, and foot size. What if there was a fourth category, "Favorite Sport". In my hypothetical sample, say you had "Football, Football, Swimming, Ice Skating". These values are discreet/enumerated, not continuous. Could you still use a naive bayesian classifier? I could map these values to integers (Football = 1, Swimming = 2), but there is an implied meaning in the differences of things like height and weight (5 ft is very unlike 10 ft) where there is no such implied meaning in the differences between an enumeration (Football - Swimming = -1, so what?)
Basically, could I still use a Naive Bayesian Classifier if the values I had were height, weight, foot size, and favorite sport?
Yes, in bayesian classification, u just need to determine the class specific distribution on its support which you can easily do from the data. Now u can compute the posterior distribution for each class and then do a map estimates. Actually for documents the distribution is defined for each word of a dictionary given the document class as spam or not spam. For details refer to andrew ng notes on intro to machine learning