I am trying to run dask_ml.xgboost
using eval_set
to allow for early stopping in an attempt to avoid overfitting.
Currently, I have a sample dataset shown in the example below
from dask.distributed import Client
from dask_ml.datasets import make_classification_df
from dask_ml.xgboost import XGBClassifier
if __name__ == "__main__":
n_train_rows = 4_000
n_val_rows = 1_000
client = Client()
print(client)
# Generate balanced data for binary classification
X_train, y_train = make_classification_df(
n_samples=n_train_rows,
chunks=100,
predictability=0.35,
n_features=50,
random_state=2,
)
X_val, y_val = make_classification_df(
n_samples=n_val_rows,
chunks=100,
predictability=0.35,
n_features=50,
random_state=2,
)
clf = XGBClassifier(objective="binary:logistic")
# train
clf.fit(
X_train,
y_train,
eval_metric="error",
eval_set=[
(X_train.compute(), y_train.compute()),
(X_val.compute(), y_val.compute()),
],
early_stopping_rounds=5,
)
# Make predictions
y_pred = clf.predict(X_val).compute()
assert len(y_pred) == len(y_val)
client.close()
All of X_train
, y_train
, X_val
and y_val
are dask DataFrame
s.
I cannot specify eval_set
as nested lists of dask DataFrame
s, using eval_set=[(X_train.compute(), y_train.compute()), (X_val.compute(), y_val.compute())]
. Instead, they need to be pandas DataFrame
s, which is why I needed to call .compute()
for each of them.
However, when I run the above code (using pandas DataFrame
s), I get this warning
<Client: 'tcp://127.0.0.1:12345' processes=4 threads=12, memory=16.49 GB>
/home/username/.../distributed/worker.py:3373: UserWarning: Large object of size 2.16 MB detected in task graph:
{'dmatrix_kwargs': {}, 'num_boost_round': 100, 'ev ... ing_rounds': 5}
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and
keep data on workers
future = client.submit(func, big_data) # bad
big_future = client.scatter(big_data) # good
future = client.submit(func, big_future) # good
warnings.warn(
task NULL connected to the tracker
task NULL connected to the tracker
task NULL connected to the tracker
task NULL connected to the tracker
task NULL got new rank 0
task NULL got new rank 1
task NULL got new rank 2
task NULL got new rank 3
[08:52:41] WARNING: ../src/gbm/gbtree.cc:129: Tree method is automatically selected to be 'approx' for distributed training.
[08:52:41] WARNING: ../src/gbm/gbtree.cc:129: Tree method is automatically selected to be 'approx' for distributed training.
[08:52:41] WARNING: ../src/gbm/gbtree.cc:129: Tree method is automatically selected to be 'approx' for distributed training.
[08:52:41] WARNING: ../src/gbm/gbtree.cc:129: Tree method is automatically selected to be 'approx' for distributed training.
This code runs through to completion and predictions are generated. However, the line estimator.fit(...)
line is producing this UserWarning
.
Additional notes
- In my use-case, the number of rows in the training and validation splits used in the example here reflect sizes after sampling from the overall data. Unfortunately, the overall data splits required for training (+hyperparameter tuning)
dask_ml.xgboost
are a few orders of magnitude larger (in number of rows, based on training and validation learning curves, perdask_ml
recommendations, generated using standardXGBoost
(usingfrom xgboost import XGBClassifier
) withoutdask_ml
's version ofXGBoost
(1, 2)) so I can't compute those and bring them into memory as pandasDataFrame
s for distributedXGBoost
training. - The number of features used in the example here is 50. (In the real use-case) I arrived at this number after dropping as many features as possible.
- Code is run on a local machine.
Question
Is there a correct/recommended approach to run dask_ml
's xgboost
with eval_set
consisting of Dask DataFrame
s?
EDIT
Note that the the training split is also being passed in eval_set
(in addition to the validation split) with the intention of generating learning curves using the output of model training (see here).