How can I retry FAIL trials in Optuna in a second run?

117 views Asked by At

I am doing grid search with Optuna but FAIL trials are not repeated in a second run. Instead, already COMPLETE trials are uselessly repeated.

Here I describe the two problems separately:

  1. when a trial fails (e.g. lack of computational resources) it is not repeated when launching the grid search (the Python file) a second time. This can be tested with the following self-contained code, in which I simulate a problem by launching an exception. Comment the lines and re-run a second time to see that the combination x=2 and y=2 is not repeated.
import time
import optuna
from optuna.storages import RetryFailedTrialCallback
import numpy as np


def objective(trial):
    # get value
    params = {
                'x': trial.suggest_categorical('x', [0, 1, 2, 3]),
                'y': trial.suggest_categorical('y', [0, 1, 2, 3])
            }
    # print it
    print('Testing with x=' + str(params['x']), 'y=' + str(params['y']))

    ########################################
    # COMMENT THIS SECTION AFTER FIRST RUN #
    ########################################
    if params['x'] == 2 and params['y'] == 2:
        raise ValueError("x==2, y==2")
    ########################################

    # return
    return params['x'] ** 2 - params['y']



def optuna_search_space():
    # define search space
    return {
        'x': range(3),
        'y': range(3),
    }



def optuna_grid():
    # define URL
    URL = 'mysql://<USER>:<PASSWORD>@<IP>:<PORT>'
    # get search space
    search_space = optuna_search_space()
    # define storage
    storage = optuna.storages.RDBStorage(
        url=f"{URL}/prove_optuna",
        failed_trial_callback=RetryFailedTrialCallback(max_retry=3),
    )
    # define study
    study = optuna.load_study(
        study_name="test1",
        sampler = optuna.samplers.GridSampler(search_space),
        storage = storage,
    )
    # run
    study.optimize(objective)
    # print
    print(study.best_trial)



if __name__ == "__main__":
    # run
    optuna_grid()
  1. When I re-run the code, it repeats however a trial (or more) that has been already performed. I don't want this, as it is a loss of computational resources.

On the Optuna Dashboard it is possible to see that after several re-runs the combination (x=2, y=2) it is never repeated (even if it failed in the first time), and the combination (x=0, y=1) has been tested several times (uselessly).

Optuna Dashboard

How can I solve these problems?

Thank you

0

There are 0 answers