number of trees in h2o.gbm

1.1k views Asked by At

in traditional gbm, we can use predict.gbm(model, newsdata=..., n.tree=...)

So that I can compare result with different number of trees for the test data.

In h2o.gbm, although it has n.tree to set, it seems it doesn't have any effect on the result. It's all the same as the default model:

h2o.test.pred <- as.vector(h2o.predict(h2o.gbm.model, newdata=test.frame, n.tree=100))
R2(h2o.test.pred, test.mat$y)
[1] -0.00714109
h2o.test.pred <- as.vector(h2o.predict(h2o.gbm.model, newdata=test.frame, n.tree=10))
> R2(h2o.test.pred, test.mat$y)
[1] -0.00714109

Does anybod have similar problem? How to solve it? h2o.gbm is much faster than gbm, so if it can get detailed result of each tree that would be great.

2

There are 2 answers

1
Darren Cook On

I don't think H2O supports what you are describing.

BUT, if what you are after is to get the performance against the number of trees used, that can be done at model building time.

library(h2o)
h2o.init()

iris <- as.h2o(iris)
parts <- h2o.splitFrame(iris,c(0.8,0.1))
train <- parts[[1]]
valid <- parts[[2]]
test <- parts[[3]]
m <- h2o.gbm(1:4, 5, train,
             validation_frame = valid,
             ntrees = 100, #Max desired
             score_tree_interval = 1)

h2o.scoreHistory(m)
plot(m)

The score history will show the evaluation after adding each new tree. plot(m) will show a chart of this. Looks like 20 is plenty for iris!

BTW, if your real purpose was to find out the optimum number of trees to use, then switch early stopping on, and it will do that automatically for you. (Just make sure you are using both validation and test data frames.)

0
nirvana-msu On

As of 3.20.0.6 H2O does support this. The method you are looking for is staged_predict_proba. For classification models it produces predicted class probabilities after each iteration (tree), for every observation in your testing frame. For regression models (i.e. when response is numerical), although not really documented, it produces the actual prediction for every observation in your testing frame.

From these predictions it is also easy to compute various performance metrics (AUC, r2 etc), assuming that's what you're after.

Python API:

staged_predict_proba = model.staged_predict_proba(test)

R API:

staged_predict_proba <- h2o.staged_predict_proba(model, prostate.test)