I have to run three different kinds of comparisons between different data mining algorithms.
The only type of comparison that is problematic for is the most basic one, two algorithms on a single data set - is the problematic one for me.
I am aware of the Diettrich (1998) paper which refers to McNemar and 5x2CV as the options of choice and states, that resampled t-test is infeasible. As the analysis forms part of a larger setup using subsamples, 60:40 training:test-splits and total cost as performance measure, I cannot use those though.
Which other options are there to evaluate the performance in this case?
Sign-test: Just counting the number of cases, where each of the two algorithms performs better and thereafter check the p-value using the binomial distribution. Problematic as very weak.
Wilcoxon-signed-rank-test: As non-parametric alternative to the t-test the first one I thought of, but not mentioned in any paper for this kind of comparison, only for comparing two algorithms on several datasets using average performance result of several iterations. Is it infeasible and if so, why?
One obvious difference between the two is that Wilcoxon signed rank test requires that you compute a difference between the two members of a pair and then rank these differences. If the only information you have for each member of a pair is whether the data-mining procedures guessed the class of their member correctly, then there will only be three possible signed ranks - -1, 0, 1, and the Wilcoxon signed rank test will be equivalent to the McNemar test, which is in fact simply a way of calculating an approximate tail value of the sign test. If it makes sense to compare the results from the two members of a pair but not to subtract them and get a number then again you are back with the sign test.
This sounds like an exercise to get you to do a number of statistical tests, but if this was something in real life my first thought would be to work out why you really cared about running a data mining exercise, perhaps reduce this to a value in terms of money, and then look for the test that represented that best.