Background: Training set with 100m rows and about 50 columns, and i have cast the dtype to the minimum types. still, the dataframe is like 8-10Gb when loaded.
Run training on AWS ec2 instances(one is 36CPU + 72RAM. another is 16CPU + 128RAM)
Problems:
1; Load data in Pandas dataframe and try with default config with xgboost, and memory soon exploded
2; Also, i tried with Dask dataframe with distributed client enabled and using dask.xgboost
, it run a bit longer, but i have worker failed warnings and progress stalled.
So, is there a way for me to estimate how big RAM i should use to make sure it is enough?
here is some codes:
import dask_ml.xgboost as dxgb
import dask.dataframe as ddf
train = pd.read_parquet('train_latest',engine='pyarrow')
train = ddf.from_pandas(train, npartitions=72)
X ,y = train[feats],train[label]
X_train,y_train,X_test,y_test = make_train_test(X,y) # customized function to divide train/test
model = dxgb.XGBClassifier(n_estimators=1000,
verbosity=1,
n_jobs=-1,
max_depth=10,
learning_rate=0.1)
model.fit(X_train,y_train)