How to reduce the `dask_ml.xgboost` worker memory consumption?

150 views Asked by At

I've been testing the dask_ml.xgboost regressor on a synthetic 10GB dataset. When training, the memory usage of the workers exceeds the amount available on my local laptop. I am aware that I can try running on an online dask cluster with larger memory, or that I can sample the data (and ignore the rest) before training. But is there a different solution? I tried limiting the number and the depth of the trees generated, subsampling the rows and columns, and changing the tree construction algorithm but the workers still run out of memory.

Given a fixed memory allocation, is there a way to reduce the memory consumption of each worker when training dask_ml.xgboost?

Here is a code snippet:

import dask.dataframe as dd
from dask.distributed import Client
from dask_ml.xgboost import XGBRegressor

client = Client(memory_limit='7GB')
ddf = dd.read_csv('10GB_float.csv')
X = ddf[ddf.columns.difference(['float_1'])].persist()
y = ddf['float_1'].persist()

reg = XGBRegressor(
    objective='reg:squarederror', n_estimators=10, max_depth=2, tree_method='hist',  
    subsample=0.001, colsample_bytree=0.5, colsample_bylevel=0.5, 
    colsample_bynode=0.5, n_jobs=-1)

reg.fit(X, y)

The synthetic dataset 10GB_float.csv has 50 columns and 26758707 rows containing random floats (float64) ranging from 0 to 1. Below are the cluster details:

Cluster

    Workers: 4
    Cores: 12
    Memory: 28.00 GB

And some information about my local laptop:

Memory: 31.1 GiB
Processor: Intel® Core™ i7-8750H CPU @ 2.20GHz × 12 

Additionally, here are the parameters of XGBRegressor (using .get_params()):

{'base_score': 0.5,
 'booster': 'gbtree',
 'colsample_bylevel': 0.5,
 'colsample_bynode': 0.5,
 'colsample_bytree': 0.5,
 'gamma': 0,
 'importance_type': 'gain',
 'learning_rate': 0.1,
 'max_delta_step': 0,
 'max_depth': 2,
 'min_child_weight': 1,
 'missing': None,
 'n_estimators': 10,
 'n_jobs': -1,
 'nthread': None,
 'objective': 'reg:squarederror',
 'random_state': 0,
 'reg_alpha': 0,
 'reg_lambda': 1,
 'scale_pos_weight': 1,
 'seed': None,
 'silent': None,
 'subsample': 0.001,
 'verbosity': 1,
 'tree_method': 'hist'}

Thank you very much for your time!

0

There are 0 answers