Orange Bayes algorithm with continuous features

98 views Asked by At

I have a two class Bayes classification problem with four continuous features. I'm trying to partially reproduce bayes algorithm algorithm that Orange uses for calculating probabilities. But I haven't succeeded to obtain same values that Orange outputs.

Data set size : 150 (class0 : 88 and class1 : 62)

I use the following algorithm

p(class0 | X1, X2, X3, X4) = L0 / (L0 + L1)
p(class1 | X1, X2, X3, X4) = L1 / (L0 + L1)

where L0 and L1 are likelihoods

L0 = prior_class0 * product( p(Xi|class0) )
L1 = prior_class1 * product( p(Xi|class1) )

prior_class0 and prior_class1 are Laplacian estimators

prior_class0 = (88 + 1) / (150 + 2)
prior_class1 = (62 + 1) / (150 + 2)

Orange uses LOESS for calculating conditional probabilities (I guess its not necessary to reproduce that). For this dataset it outputs 49 points for both classes as given in python object classifier.conditional_distributions. By using linear interpolation between surrounding points for Xi, I can calculate p(Xi|class0) and p(Xi|class1).

1) Can anyone comment on Orange Bayes algorithm with continuous features?

2) Or any technical advice how to setup compiler/IDE that I could debug Orange C++ code and inspect some intermediary results from functions in orange/source/orange/bayes.cpp?

1

There are 1 answers

1
JanezD On BEST ANSWER

Orange uses a slightly different formula that, according to Kononenko, gives the same result but allows for better interpretability and m-estimate of probabilities. Instead of product( p(Xi|class0) ) it computes product( p(class0|Xi) / p(class0)). I don't think this should affect your computation, though, but you can check. The code that computes those probabilities is at https://github.com/biolab/orange/blob/master/source/orange/bayes.cpp#L169. Note that it does it for all classes in parallel.

The other piece of the code you're interested in is the computation of probabilities from LOESS density estimates. It's at https://github.com/biolab/orange/blob/master/source/orange/estimateprob.cpp#L307. Note that most operations there are on vectors, e.g. all variables in *result *= (x-x1)/(x2-x1); are actually vectors.

As for debugging, I wrote this code (many years ago, and somewhat ashamed to admit -- seeing the terrible coding style I used) with Visual Studio. I forgot the version and can't check it since I no longer use Windows. But I never really debugged Orange on any other OS.

If you load the project and build a debug version, you'll also have to build a debug version of Python. This is actually simple (see the instructions in the Python source code), the problem would be that you'd have to build debug version of any other binary libraries you use as well (e.g. numpy). A simpler way is to build a release version of Orange but switch the debug info flags on. This way you can use standard Python and libraries.