Half month ago, I can use Optuna without a problem to do a 48-Hour study, with around 150+ trials. Yesterday I tried Optuna again on the same model, same dataset, same batch size and same device (A100 40GB or V100 32GB), but I always got torch.cuda.OutOfMemoryError: CUDA out of memory
after around 16 trials.
Following some of the answers (here and here) on SO, I tried enable GC by using study.optimize(objective, timeout=172800, gc_after_trial=True
, I also tried to add gc.collect()
and cuda.empty_cache()
after each epoch, but these didn't help. I even tried to largely reduce the scales of the hidden layers (e.g. from 256 to 128), and reduce the size of the dataset (e.g. from 50000 to 20000), these also didn't help. What I didn't try is to reduce batch sizes in Optuna study, but I guess this wouldn't resolve the core issue, sooner or later I will get an OOM error.
I can train (300 epochs and more) and evaluate the model using the same parameters and dataset, I only encounter this out-of-VRAM problem recently when doing Optuna studies.
I would like to know if there are any other general approaches to avoid OOM during an Optuna study. It is really weird because my first use of Optuna was just fine.
I had the same problem. The thing with gc.collect() and cuda.empty_cache() is that these methods don't remove the model from your GPU they just clean the cache.
So you need to delete your model from Cuda memory after each trial and probably clean the cache as well, without doing this every trial a new model will remain on your Cuda device.
So I put these lines at the end of 'objective' function: