If I understand correctly catboost
, we need to tune the nrounds
just like in xgboost
, using CV. I see the following code in the official tutorial In [8]
params_with_od <- list(iterations = 500,
loss_function = 'Logloss',
train_dir = 'train_dir',
od_type = 'Iter',
od_wait = 30)
model_with_od <- catboost.train(train_pool, test_pool, params_with_od)
Which result in the best iterations
= 211.
My question are:
- Is it correct that: this command use the
test_pool
to choose the bestiterations
instead of using cross-validation? - If yes, does catboost provide a command to choose the best
iterations
from CV, or I need to do it manually?
Catboost is doing cross validation to determine the optimum number of iterations. Both train_pool and test_pool are datasets that include the target variable. Earlier in the tutorial they write
When you execute catboost.train(train_pool, test_pool, params_with_od) train_pool is used for training and test_pool is used to determine the optimum number of iterations via cross validation.
Now you are right to be confused, since later on in the tutorial they again use test_pool and the fitted model to make a prediction (model_best is similar to model_with_od, but uses a different overfitting detector IncToDec):
This might be bad practice. Now they might get away with it with their IncToDec overfitting detector - I am not familiar with the mathematics behind it - but for the Iter type overfitting detector you would need to have separate train,validation and test data sets (and if you want to be on the save side, do the same for the IncToDec overfitting detector). However it is only a tutorial showing the functionality so I wouldn't be too pedantic about what data they have already used how.
Here a link to a little more detail on the overfitting detectors: https://tech.yandex.com/catboost/doc/dg/concepts/overfitting-detector-docpage/