Why do flipping images change CNN pooling output

Question

Why do flipping images change CNN pooling output

464 views Asked by sachinruk At 09 October 2020 at 01:20

I am looking at image embeddings and wondering why flipping images changes the output. Consider resnet18 with the head removed for example:

import torch
import torch.nn as nn
import torchvision.models as models
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

model = models.resnet18(pretrained=True)
model.fc = nn.Identity()
model = model.to(device)
model.eval()

x = torch.randn(20, 3, 128, 128).to(device)
with torch.no_grad():
    y1 = model(x)
    y2 = model(x.flip(-1))
    y3 = model(x.flip(-2))

The last layer looks like this and most importantly has a AdaptiveAveragePooling as the last layer where the pixels/ features are pooled to 1 pixel:

According to how I'm thinking, since we are just having convolutions on top of convolutions, before the pooling, all that will happen is that the feature map will flip according to how the image is flipped. The average pooling simply averages the last feature map (along each channel), and is invariant to the orientation of it. AdaptiveMaxPool should have been the same.

The key difference between 'normal' convnets being that we are pooling/ averaging to one pixel width.

However, when I look at y1-y2, y1-y3, y2-y3 the values are significantly different to zero. What am I thinking wrong about?

Original Q&A

There are 1 answers

**Arth Patel** · Accepted Answer · 2020-10-12T12:32:04+00:00

I think the pooling output is changed because the inputs to the pooling layer are not passed as we expect.

Short Answer: The input is flipped but not the weights of Conv2d layers. These kernel weights need to be flipped as well in accordance with the input flipping to get the expected output.

Long Answer: Here, as per the tail of the model, the output of Conv2d is passed to AdaptiveAveragePooling. Let's just ignore BatchNorm for now for the sake of understanding.

For simplicity, lets consider a input tensor as x = [1, 3, 5, 4, 7] and a kernel is k =[0.3, 0.5, 0.8]. When it rolls over the input, the output for position [0,0] will be [0.3*1+0.5*3+0.8*5] = 6.8 and [0,2] will be [0.3*5+0.5*4+0.8*7]=9.3 considering stride=1.

Now if the input is flipped, x_flip = [7, 4, 5, 3, 1], the output for position [0,0] will be [0.3*7+0.5*4+0.8*5] = 8.1 and [0,2] will be [0.3*5+0.5*3+0.8*1] = 3.8.

As head and tail of the outputs are different in both scenario (8.1 != 9.3 and 6.8 != 3.8), the output we get after Convolution layer would be different, giving different/unexpected results as the final output after pooling.

So, to get the desired output here, you need to flip the kernel as well.

TechQA.

Why do flipping images change CNN pooling output

There are 1 answers

Related Questions in IMAGE-PROCESSING

Related Questions in DEEP-LEARNING

Related Questions in CONV-NEURAL-NETWORK

Related Questions in DATA-AUGMENTATION

Popular Questions

Popular Tags

Trending Questions