How to choose the nrounds using `catboost`?

2.2k views Asked by At

If I understand correctly catboost, we need to tune the nrounds just like in xgboost, using CV. I see the following code in the official tutorial In [8]

params_with_od <- list(iterations = 500,
                       loss_function = 'Logloss',
                       train_dir = 'train_dir',
                       od_type = 'Iter',
                       od_wait = 30)
model_with_od <- catboost.train(train_pool, test_pool, params_with_od)

Which result in the best iterations = 211.

My question are:

  • Is it correct that: this command use the test_pool to choose the best iterations instead of using cross-validation?
  • If yes, does catboost provide a command to choose the best iterations from CV, or I need to do it manually?
3

There are 3 answers

1
ftiaronsem On

Catboost is doing cross validation to determine the optimum number of iterations. Both train_pool and test_pool are datasets that include the target variable. Earlier in the tutorial they write

train_path = '../R-package/inst/extdata/adult_train.1000'
test_path = '../R-package/inst/extdata/adult_test.1000'

column_description_vector = rep('numeric', 15)
cat_features <- c(3, 5, 7, 8, 9, 10, 11, 15)
for (i in cat_features)
    column_description_vector[i] <- 'factor'

train <- read.table(train_path, head=F, sep="\t", colClasses=column_description_vector)
test <- read.table(test_path, head=F, sep="\t", colClasses=column_description_vector)
target <- c(1)
train_pool <- catboost.from_data_frame(data=train[,-target], target=train[,target])
test_pool <- catboost.from_data_frame(data=test[,-target], target=test[,target])

When you execute catboost.train(train_pool, test_pool, params_with_od) train_pool is used for training and test_pool is used to determine the optimum number of iterations via cross validation.

Now you are right to be confused, since later on in the tutorial they again use test_pool and the fitted model to make a prediction (model_best is similar to model_with_od, but uses a different overfitting detector IncToDec):

prediction_best <- catboost.predict(model_best, test_pool, type = 'Probability')

This might be bad practice. Now they might get away with it with their IncToDec overfitting detector - I am not familiar with the mathematics behind it - but for the Iter type overfitting detector you would need to have separate train,validation and test data sets (and if you want to be on the save side, do the same for the IncToDec overfitting detector). However it is only a tutorial showing the functionality so I wouldn't be too pedantic about what data they have already used how.

Here a link to a little more detail on the overfitting detectors: https://tech.yandex.com/catboost/doc/dg/concepts/overfitting-detector-docpage/

0
Lucas On
  1. It is a very poor decision to base your number of iterations on one test_pool and from the best iterations of catboost.train(). In doing so, you are tuning your parameters to one specific test set and your model will not work well with new data. You are therefore correct in presuming that like XGBoost, you need to apply CV to find the optimal number of iterations.
  2. There is indeed a CV function in catboost. What you should do is specify a large number of iterations and stop the training after a certain number of rounds without improvement by using parameters early_stopping_rounds. Unlike LightGBM unfortunately, catboost doesn't seem to have the option of automatically giving the optimal number of boosting rounds after CV to apply in catboost.train(). Therefore, it requires a bit of a workaround. Here is an example which should work:
    library(catboost)
    library(data.table)

    parameter = list(
      thread_count = n_cores,
      loss_function = "RMSE",
      eval_metric = c("RMSE","MAE","R2"),
      iterations = 10^5, # Train up to 10^5 rounds
      early_stopping_rounds = 100, # Stop after 100 rounds of no improvement
    )

    # Apply 6-fold CV
    model = catboost.cv(
        pool = train_pool,
        fold_count = 6,
        params = parameter
      )

      # Transform output to DT
      setDT(cbt_occupancy)
      model[, iterations := .I]
      # Order from lowest to highgest RMSE
      setorder(model, test.RMSE.mean)
      # Select iterations with lowest RMSE
      parameter$iterations = model[1, iterations]

      # Train model with optimal iterations
      model = catboost.train(
        learn_pool = train_pool,
        test_pool = test_pool,
        params = parameter
      )

0
Jiaxiang On

I think this is a general question for xgboost and catboost. The choice of nround gets along with the choice with learning rate. Thus, I recommend the higher round (1000+) and low learning rate. After you find the best hype-params and retry a lower learning rate to check the hype-params you choose are stable.

And I find @nikitxskv 's answer is misleading.

  1. In the R tutorial, In [12] just chooses learning_rate = 0.1 without mutiple choices. Thus, there is no hint for nround tuning.
  2. Actually, In [12] just uses function expand.grid to find the best hype-params. It functions on the selections of depth, gamma and so on.
  3. And in practice, we don't use this way to find a proper nround (too long).

And now for the two questions.

Is it correct that: this command use the test_pool to choose the best iterations instead of using cross-validation?

Yes, but you can use CV.

If yes, does catboost provide a command to choose the best iterations from CV, or I need to do it manually?

It depends on yourself. If you have a great aversion on boosting overfitting, I recommend you try it. There are a lot of packages to solve this problem. I recommend tidymodel packages.