Tensorflow object detection -- Increasing batch size leads to failure

6.8k views Asked by At

I have been trying to train an object detection model using the tensorflow object detection API.

The network trains well when batch_size is 1. However, increasing the batch_size leads to the following error after some steps.

Network : Faster RCNN

train_config: {
  batch_size: 1
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        manual_step_learning_rate {
          initial_learning_rate: 0.0002
          schedule {
            step: 25000
            learning_rate: .00002
          }
          schedule {
            step: 50000
            learning_rate: .000002
          }
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }

Error:

INFO:tensorflow:Error reported to Coordinator: , ConcatOp : Dimensions of inputs should match: shape[0] = [1,841,600,3] vs. shape[3] = [1,776,600,3]
[[node concat (defined at /home/<>/.virtualenvs/dl4cv/lib/python3.6/site-packages/object_detection-0.1-py3.6.egg/object_detection/legacy/trainer.py:190) ]]
Errors may have originated from an input operation.
Input Source operations connected to node concat:
Preprocessor_3/sub (defined at /home/<>/.virtualenvs/dl4cv/lib/python3.6/site-packages/object_detection-0.1-py3.6.egg/object_detection/models/faster_rcnn_inception_v2_feature_extractor.py:100)

The training with increased batch_size works on SSD mobilenet however.
While, I have solved the issue for my use-case at the moment, posting this question in SO to understand the reason for this behavior.

4

There are 4 answers

0
teusbenschop On

When increasing the batch size, the images loaded in the Tensors should all be of the same size.

This is how you may get the images to be all of the same size:

image_resizer {
  keep_aspect_ratio_resizer {
    min_dimension: 896
    max_dimension: 896
    pad_to_max_dimension: true
  }
}

Padding the images to the maximum dimension, making that "true", that will cause the images to be all of the same size. This enables you to have a batch size larger than one.

0
Hasan Salim Kanmaz On

You don't need to resize every image in your dataset. Tensorflow can handle it if you specify in your config file.

Default frcnn and ssd config is:

## frcnn
   image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 600
        max_dimension: 1024
      }
    }
    
## ssd
   image_resizer {
      fixed_shape_resizer {
        height: 300
        width: 300
      }
    }

If you change image resizer of frcnn as fixed_shape_resizer like in ssd, You can increase the batch size.

I implemented it and training went well. Unfortunately, my loss didn't decrease as I expected. Then, I switched back to batch size 4 with 4 workers (it means batch size 1 for each worker). Latter is better for my case, but maybe it can be different for your case.

3
Nopileos On

Just from the error it seems like your individual inputs have different sizes. I suppose it tries to concatenate (ConcatOp) 4 single inputs into one tensor to build a mini batch as the input.

While trying to concatenate it has one input with 841x600x3 and one input with 776x600x3 (ignored the batch dimension). So obviously 841 and 776 are not equal but they should be. With a batch size of 1 the concat function is probably not called, since you don't need to concatenate inputs to get a minibatch. There also seems to be no other component that relies on a pre defined input size, so the network will train normally or at least doesn't crash.

I would check the dataset you are using and check if this is supposed to be this way or you have some faulty data samples. If the dataset is ok and this can in fact happen you need to resize all inputs to some kind of pre defined resolution to be able to combine them probably into a minibatch.

0
Amruta Muthal On

The reason you get an error is because you cannot technically train Faster RCNN in batch mode on a single GPU. This is due to its two stage architecture. SSD is single stage and hence can be parallelized to give larger batch sizes. If you still want to train F-RCNN with batch size>1, you can do so with multiple GPUs. There is a --num_clones parameter that you need to set to the number of GPUs available to you. set the num_clones and the batchsize to save values (It should be equal to the number of GPUs you have available) I have used batchsizes of 4,8 and 16 in my application. --num_clones=2 --ps_tasks=1 Check this link for more details https://github.com/tensorflow/models/issues/1744