I'm implementing a U-Net based architecture in PyTorch. At train time, I've patches of size 256x256
which doesn't cause any problem. However at test time, I've full HD images (1920x1080
). This is causing a problem during skip connections.
Downsampling 1920x1080
3 times gives 240x135
. If I downsample one more time, the resolution becomes 120x68
which when upsampled gives 240x136
. Now, I cannot concatenate these two feature maps. How can I solve this?
PS: I thought this is a fairly common problem, but I didn't get any solution or even mentioning of this problem anywhere on the web. Am I missing something?
It is a very common problem in segmentation networks where skip-connections are often involved in the decoding process. Networks usually (depending on the actual architecture) require input size that has side lengths as integer multiples of the largest stride (8, 16, 32, etc.).
There are two main ways:
I prefer (2) because (1) can cause small changes in the pixel level for all the pixels, leading to unnecessary blurriness. Note that we usually need to recover the original shape afterward in both methods.
My favorite code snippet for this task (symmetric padding for height/width):
A test snippet:
Output:
Reference: https://github.com/seoungwugoh/STM/blob/905f11492a6692dd0d0fa395881a8ec09b211a36/helpers.py#L33