Although it wasn't originally conceived this way, the standard Adaboost algorithm is equivalent to conducting a forward stagewise additive model estimation using an exponential loss function. That is, given some weak classifiers c1,...,cM, and sample points x1,...,xN the weights coming out of the algorithm:
- Set F_0(x) = 0
- For m = 1 to M: Set (w_m, f_m ) = arg min over (w, c_i) of (Loss function(y_i, F_m-1(x_i) + w * c_i(x_i)) applied to all x_i's)
- Set F_m(x) = F_m-1(x) + w_m * f_m(x)
The strong classifier is the output, F_M(x). The loss function that makes this strong learner the same as the Adaboost output is
L(y,f(x)) = exp(-y*f(x))
for classifiers taking values in {-1,1}. This is all explained in Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Section 10.4.
My question has to do with forward stagewise regression. It is a greedy algorithm, in that once w_i is estimated, it is fixed, then the weight w_i+1 is found, etc. This seems like it is really designed to handle "independent" weak classifiers, like stump classifiers, or tree classifiers restricted to mutually exclusive independent variables (features), so that the residual piece after fitting a classifier is not explained by that classifier any more.
In other words, to fit a set of functions to an given target function, I wouldn't fit the first one, fix that coefficient, then find the optimal coefficient for the second, holding the first constant, etc... unless I knew that the functions were independent. But this is what the algorithm does, more or less.
Does this explain the success of Adaboost with stump learners or decision trees compared to (from my experience) Adaboost with more comprehensive classifiers like SVM, or a linear model? Can someone give a reference? - I have not seen this aspect discussed in the literature. Thanks.
Think about the simplest example in which you will fit a 1D curve with a linear model. Instead of approximating, you are going to learn the curve. So each time you pick up two data sets to learn the line crossing them. After plenty learning times, the line will be obtained by averaging all the parameters (weights) you learned. Such line will achieve the lowest in-sample error. And this is equivalent to the learning process in which you update the previous parameters given the new training set.
I am not sure whether I understand your question correctly, but if you are trying to fit with different models (linear, quadratic, cubic,exponential...) for the example above, the number of weights for each model is not the same. So the greedy approach as what people do in classification problems may not well-fit. One resolving method may be: you give the weight on each model, and use boosting to determine which model is best fit for the training data.
An alternative to do this regression is to use neural network as the weak learner. Here is one study that applied back-propagation on neural network. Every time a subset of the training set was randomly picked for the learning and boosting process. And stagewise additive modeling is used to update the weight. The error calculation and weight factor is slightly different but in a similar form with what are used in classification. The result indicates ada-neural network is more stable than back-propagation in the regression.
In classification problem, I am trying to understand why AdaBoost with stump learners is better than with SVM? Since AdaBoost is a greedy feature selector, given the same feature set, SVM is expected to outperform AdaBoost, isn't it? Actually it is feasible to use AdaBoost to select important features and SVM to classify examples. You can also build a AdaBoost tree that put the features falling into the SVM margin to the children nodes, and re-train them with SVM until they are correctly classified.