Use of validation_frame in H2O AutoML

740 views Asked by At

Just started with H2O AutoML so apologies in advance if I have missed something basic.

I have a binary classification problem where data are observations from K years. I want to train on the K-1 years and tune the models and select the best one explicitly based on the remaining K year.

If I switch off cross-validation (with nfolds=0) to avoid randomly blending of years into the N folds and define data of year K as the validation_frame then I don't have the ensemble created (as expected according to the documentation) which in fact I need.

If I train with cross-validation (default nfolds) and defining a validation frame to be the K-year data

aml = H2OAutoML(max_runtime_secs=3600, seed=1)
aml.train(x=x,y=y, training_frame=k-1_years, validation_frame=k_year)

then according to http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html the validation_frame is ignored "...By default and when nfolds > 1, cross-validation metrics will be used for early stopping and thus validation_frame will be ignored."

Is there a way to get the tuning of the models and the selection of the best one(ensemble or not) based on the K-year data only, and while the ensemble of models is also available in the output?

Thanks a lot!

1

There are 1 answers

0
Neema Mashayekhi On BEST ANSWER

You don't want to have cross-validation (CV) if you are dealing with times-series (non-IID) data, since you won't want folds from the future to the predict the past.

I would explicitly add nfolds=0 so that CV is disabled in AutoML:

aml = H2OAutoML(max_runtime_secs=3600, seed=1, nfolds=0)
aml.train(x=x,y=y, training_frame=k-1_years, validation_frame=k_year)

To have an ensemble, add a blending_frame which also applies to time-series. See more info here.

Additionally, since you are dealing with time-series data. I would recommend adding time-series transformations (e.g. lags), so that your model gets info from previous years and their aggregates (e.g. weighted moving average).