How to properly restore a checkpoint in TensorFlow object detection API with TF1?

1.7k views Asked by At

I am fine-tuning SSD-MobileNetV3 Large and SSD-MobileDet-CPU on the COCO 2017 dataset but with only book class. I have created a new dataset for this and inspected the dataset and it is good. I have also modified the config file to my needs. When I start the training, it just ignores the 'fine_tune_checkpoint' provided in the config file and starts from scratch. However, if I do the same process but with the checkpoint in the 'model_dir' directory instead, it tries to restore it but since I have different number of classes, it gives an error. How can I make the training process restore the checkpoint properly? I also tried with normal COCO dataset with all 90 classes, and when I start the training, 'fine_tune_checkpoint' is ignored, but if I put the checkpoint in the 'model_dir', it is restored properly.

My config file is as below.

# SSDLite with MobileDet-CPU feature extractor.
# Reference: Xiong & Liu et al., https://arxiv.org/abs/2004.14525
# Trained on COCO, initialized from scratch.
#
# 0.45B MulAdds, 4.21M Parameters. Latency is 113ms on Pixel 1 CPU.
# Achieves 24.0 mAP on COCO14 minival dataset.
# Achieves 23.5 mAP on COCO17 val dataset.
#
# This config is TPU compatible.

model {
  ssd {
    inplace_batchnorm_update: true
    freeze_batchnorm: false
    num_classes: 1
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
        use_matmul_gather: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    encode_background_as_zeros: true
    anchor_generator {
      ssd_anchor_generator {
        num_layers: 6
        min_scale: 0.2
        max_scale: 0.95
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        aspect_ratios: 3.0
        aspect_ratios: 0.3333
      }
    }
    image_resizer {
      fixed_shape_resizer {
        height: 320
        width: 320
      }
    }
    box_predictor {
      convolutional_box_predictor {
        min_depth: 0
        max_depth: 0
        num_layers_before_predictor: 0
        use_dropout: false
        dropout_keep_probability: 0.8
        kernel_size: 3
        use_depthwise: true
        box_code_size: 4
        apply_sigmoid_to_scores: false
        class_prediction_bias_init: -4.6
        conv_hyperparams {
          activation: RELU_6,
          regularizer {
            l2_regularizer {
              weight: 0.00004
            }
          }
          initializer {
            random_normal_initializer {
              stddev: 0.03
              mean: 0.0
            }
          }
          batch_norm {
            train: true,
            scale: true,
            center: true,
            decay: 0.97,
            epsilon: 0.001,
          }
        }
      }
    }
    feature_extractor {
      type: 'ssd_mobiledet_cpu'
      min_depth: 16
      depth_multiplier: 1.0
      use_depthwise: true
      conv_hyperparams {
        activation: RELU_6,
        regularizer {
          l2_regularizer {
            weight: 0.00004
          }
        }
        initializer {
          truncated_normal_initializer {
            stddev: 0.03
            mean: 0.0
          }
        }
        batch_norm {
          train: true,
          scale: true,
          center: true,
          decay: 0.97,
          epsilon: 0.001,
        }
      }
      override_base_feature_extractor_hyperparams: false
    }
    loss {
      classification_loss {
        weighted_sigmoid_focal {
          alpha: 0.75,
          gamma: 2.0
        }
      }
      localization_loss {
        weighted_smooth_l1 {
          delta: 1.0
        }
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    normalize_loss_by_num_matches: true
    normalize_loc_loss_by_codesize: true
    post_processing {
      batch_non_max_suppression {
        score_threshold: 1e-8
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 100
        use_static_shapes: true
      }
      score_converter: SIGMOID
    }
  }
}

train_config: {
  batch_size: 64
  sync_replicas: true
  startup_delay_steps: 0
  replicas_to_aggregate: 1
  num_steps: 800000
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    ssd_random_crop {
    }
  }
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        cosine_decay_learning_rate {
          learning_rate_base: 0.8
          total_steps: 800000
          warmup_learning_rate: 0.13333
          warmup_steps: 100
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  fine_tune_checkpoint: "./checkpoints/model.ckpt-400000"
  max_number_of_boxes: 100
  unpad_groundtruth_tensors: false
  fine_tune_checkpoint_type: "detection"
  fine_tune_checkpoint_version: V1
  load_all_detection_checkpoint_vars: true
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "./tf_record_coco_books/coco_train.record"
  }
  label_map_path: "./tf_record_coco_books/label_map.pbtxt"
}

eval_config: {
  metrics_set: "coco_detection_metrics"
  use_moving_averages: false
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "./tf_record_coco_books/coco_val.record"
  }
  label_map_path: "./tf_record_coco_books/label_map.pbtxt"
  shuffle: false
  num_readers: 1
}
2

There are 2 answers

1
Roman On BEST ANSWER

The issue rises from line 446 in model_lib.py

load_pretrained = hparams.load_pretrained if hparams else False;

because one of the previous commits changed hparams to None, so load_pretrained is always False. Setting it to True, and reinstalling the object_detection library fixes the problem.

I've mentioned this in related github issue: https://github.com/tensorflow/models/issues/9284

0
dnl_anoj On

You have to specify a model_dir that is different from the directory where your are loading the previously trained checkpoint.

At the very beginning of the training, the Tensorflow Object Detection API training script (either the current model_main or the legacy/train) will create a new checkpoint corresponding to your new config in your model_dir and then train over this checkpoint. If your directory already contains the pre-trained checkpoints, it will indeed raise an issue corresponding to the number of classe.

If that doesn't work your could also change in your config file the field :

fine_tune_checkpoint_type = "detection"

to :

fine_tune_checkpoint_type = "fine_tune"

regarding that is a current issue on the Object Detection API : https://github.com/tensorflow/models/issues/8892#issuecomment-680207038