I see that tensorflow object detection API allows one to customise the image sizes which are fed in. My question is how this works with pretrained weights, which are usually trained on 224*224 images, or sometimes 300*300 images.
In other frameworks I used, such as caffe rfcn, and yolo and keras ssd, the images are downscaled to fit to the standard size coming with the pretrained weights.
Are the pretrained weights used by tf of the 300*300 input size ? And if so, how can we use these weights to classify customised image sizes ? Does tf downsize to the respective weights size ?
For my understanding does the input size only affects the input layer of your network. But please correct me if that is wrong, I'm still quite new to the whole deep learning paradigm.
I have used three models of the Tensorflow Object Detection API. The Faster R-CNN and R-FCN, both with Resnet101 Feature extractor and an SSD Model with Inception V2. The SSD Model reshapes the Images to a fixed
M x M
size. This is also mentioned in the Paper "Speed/accuracy trade-offs for modern convolutional object detectors" by Huang et al., whereas the n Faster R-CNN and R-FCN, models are trained on images scaled to M pixels on the shorter edge. This resizing is located in the preprocessing stage of the model.Another method would be to keep the aspect ratio and crop a fixed size on the image, then one can crop from different positions (center, top-left, top-right, bottom-left, bottom-right etc.) to make the model robust. More sophisticated ways include resizing image to several scales and do the cropping, and using different aspect ratios in convolutional layers with adaptive pooling size later to make the same feature dimension like SPP (see Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition by He et al. for more detail.) This is the thing that is done by the
keep_aspect_ratio_resizer
in the config proto.This makes the Architectures for my understanding resilient to different image sizes. So the internal weights of the hidden layers are not affected by the input size of the image.