I am reading through Residual learning, and I have a question. What is "linear projection" mentioned in 3.2? Looks pretty simple once got this but could not get the idea...
Can someone provide simple example?
I am reading through Residual learning, and I have a question. What is "linear projection" mentioned in 3.2? Looks pretty simple once got this but could not get the idea...
Can someone provide simple example?
A linear projection is one where each new feature is simple a weighted sum of the original features. As in the paper, this can be represented by matrix multiplication. if x
is the vector of N
input features and W
is an M
-byN
matrix, then the matrix product Wx
yields M
new features where each one is a linear projection of x
. Each row of W
is a set of weights that defines one of the M
linear projections (i.e., each row of W
contains the coefficients for one of the weighted sums of x
).
In Pytorch (in particular torchvision\models\resnet.py), at the end of a Bottleneck you will either have two scenarios
The input vector x's channels, say x_c (not spatial resolution, but channels), are less than equal to the output after layer conv3 of the Bottleneck, say d dimensions. This can then be alleviated by a 1 by 1 convolution with in planes = x_c and out_planes = d, with stride 1, followed by batch normalization, and then the addition F(x) + x occurs assuming x and F(x) have the same spatial resolution.
Both the spatial resolution of x and its number of channels don't match the output of the BottleNeck layer, in which case the 1 by 1 convolution mentioned above needs to have stride 2 in order for both the spatial resolution and the number of channels to match for the element-wise addition (again with batch normalization of x before the addition).
First up, it's important to understand what
x
,y
andF
are and why they need any projection at all. I'll try explain in simple terms, but basic understanding of ConvNets is required.x
is an input data (called tensor) of the layer, in case of ConvNets it's rank is 4. You can think of it as a 4-dimensional array.F
is usually a conv layer (conv+relu+batchnorm
in this paper), andy
combines the two together (forming the output channel). The result ofF
is also of rank 4, and most of dimensions will be the same as inx
, except for one. That's exactly what the transformation should patch.For example,
x
shape might be(64, 32, 32, 3)
, where 64 is the batch size, 32x32 is image size and 3 stands for (R, G, B) color channels.F(x)
might be(64, 32, 32, 16)
: batch size never changes, for simplicity, ResNet conv-layer doesn't change the image size too, but will likely use a different number of filters - 16.So, in order for
y=F(x)+x
to be a valid operation,x
must be "reshaped" from(64, 32, 32, 3)
to(64, 32, 32, 16)
.I'd like to stress here that "reshaping" here is not what
numpy.reshape
does.Instead,
x[3]
is padded with 13 zeros, like this:If you think about it, this is a projection of a 3-dimensional vector onto 16 dimensions. In other words, we start to think that our vector is the same, but there are 13 more dimensions out there. None of the other
x
dimensions are changed.Here's the link to the code in Tensorflow that does this.