TF2 object detection API issue with resuming training from saved checkpoint

1.2k views Asked by At

I'm facing an issue with TF2 object detection API that seems to have occurred overnight. I'm trying to resume training from a saved checkpoint and as usual I change the path in the config file to where the checkpoints are before resuming the training, which has always worked.

Today it's throwing this error (see below). For some reason, checkpoint dir and model dir cannot be the same. Now, the big problem is that if I change the model dir, it restarts training from zero and not from the last epoch, so I'm stuck. This only happens in TF2, I also tried with TF1 and works fine.

File "/usr/local/lib/python3.7/dist-packages/object_detection/utils/variables_helper.py", line 230, in ensure_checkpoint_supported (' Please set model_dir to a different path.'))) RuntimeError: Checkpoint dir (/content/drive/MyDrive/Object_detection/training) and model_dir (/content/drive/MyDrive/Object_detection/training) cannot be same. Please set model_dir to a different path.

2

There are 2 answers

0
Armin Ghanbarzadeh On

I faced the same problem. It said that the model_dir and chechpoint_dir could not be the same, however, if they are different the training would just start from the beginning.

It was due to a recent addition (May 7) of a check at the end of the file "research/object_detection/utils/variables_helper.py":

 if model_dir == checkpoint_path_dir:
    raise RuntimeError(
        ('Checkpoint dir ({}) and model_dir ({}) cannot be same.'.format(
            checkpoint_path_dir, model_dir) +
         (' Please set model_dir to a different path.')))

I managed to fix it by changing it to something like:

 if model_dir == checkpoint_path_dir:
    pass
    # raise RuntimeError(
        # ('Checkpoint dir ({}) and model_dir ({}) cannot be same.'.format(
            # checkpoint_path_dir, model_dir) +
         # (' Please set model_dir to a different path.')))

After cloning the Github repository and before installing the object_detection package.

I believe you could have also changed the clone version, something like (might need some editing to get it working):

import os
import pathlib

# Clone the tensorflow models repository if it doesn't already exist
if "models" in pathlib.Path.cwd().parts:
  while "models" in pathlib.Path.cwd().parts:
    os.chdir('..')
elif not pathlib.Path('models').exists():
  !git clone --depth 1 https://github.com/tensorflow/models
  !git checkout 'master@{2021-05-6 00:00:00}'

2
Jotunheim On
  • 'fine_tune_checkpoint' should point to the checkpoints in the 'pre_trained_model' folder;
  • 'model_dir' instead is the directory where YOU are saving your new checkpoints.

There is no need to manually change the folder. If there are any checkpoints in the 'model_dir', training will re-start from that point. If there are no checkpoints, training will start from the checkpoint taken from the 'pre_trained_model' folder.