Linked Questions

Popular Questions

how to understand nfold and nrounds in R's package xgboost

Asked by At

I am trying to use R's package xgboost. But there is something I feel confused. In xgboost manual, under function, it says:

The original sample is randomly partitioned into nfold equal size subsamples.

Of the nfold subsamples, a single subsample is retained as the validation data for testing the model, and the remaining nfold - 1 subsamples are used as training data.

The cross-validation process is then repeated nrounds times, with each of the nfold subsamples used exactly once as the validation data.

And this is the code in the manual:

data(agaricus.train, package='xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
cv <- = dtrain, nrounds = 3, nthread = 2, nfold = 5, metrics = 
max_depth = 3, eta = 1, objective = "binary:logistic")
print(cv, verbose=TRUE)

And the result is:

##### 5-folds
call: = dtrain, nrounds = 3, nfold = 5, metrics = list("rmse", 
    "auc"), nthread = 2, max_depth = 3, eta = 1, objective = "binary:logistic")
params (as set within
  nthread = "2", max_depth = "3", eta = "1", objective = "binary:logistic", 
eval_metric = "rmse", eval_metric = "auc", silent = "1"
  cb.print.evaluation(period = print_every_n, showsd = showsd)
niter: 3
 iter train_rmse_mean train_rmse_std train_auc_mean train_auc_std test_rmse_mean test_rmse_std test_auc_mean test_auc_std
1       0.1623756    0.002693092      0.9871108  1.123550e-03      0.1625222   0.009134128     0.9870954 0.0045008818
2       0.0784902    0.002413883      0.9998370  1.317346e-04      0.0791366   0.004566554     0.9997756 0.0003538184
3       0.0464588    0.005172930      0.9998942  7.315846e-05      0.0478028   0.007763252     0.9998902 0.0001347032

Let's say nfold=5 and nrounds=2. It means the data is splited into 5 parts with equal size. And the algorithm will generate 2 trees.

my understand is: each subsample has to be the validation once. When one subsample is validation, 2 trees will be generated. So, we will have 5 sets of trees (one set has 2 trees because nrounds=2). Then we check if the evaluation metric varies a lot or not.

But the result does not say the same way. one nround value has one line of the evaluation metric, which looks like it already includes the 'cross validation' part. So, if 'The cross-validation process is then repeated nrounds times', then how come 'with each of the nfold subsamples used exactly once as the validation data'? Could anyone please explain this to me?

Related Questions