ViT model reconstruction confusion while trying to insert layers into the old model

127 views Asked by At

I met a problem while trying to reconstruct a model from the old one by replicating layer by layer. The problem is the dimension of the output tensor of the reconstructed model(new) is not the same as the original one(old): new:[4,196,10] old:[4,10]. The detailed code that I use and the problem are as follows. My code:


import torch
import torch.nn as nn
import timm

#### the definition of the ViT class
class ViTImageClassifier(nn.Module):
    def __init__(self, num_classes):
        super(ViTImageClassifier, self).__init__()
        self.backbone = timm.create_model('vit_base_patch16_224', pretrained=True)
        self.backbone.head = nn.Linear(self.backbone.head.in_features, num_classes)

    def forward(self, x):
        x = self.backbone(x)
        return x

model = ViTImageClassifier(num_classes=10)

input = torch.randn(4, 3, 224, 224)
output = model(input)

Here's output of original and the shape of it: As you can see, the shape is [4,10] And the structure of the orginal model: the block in the red rectangular concatenates with another 11 same block (totally 12 blocks) The bottom layers According to its structure, I copy this model layer by layer. Here's my code:

new_model=[]
new_model.append(model.backbone.patch_embed)
new_model=nn.Sequential(*new_model)
new_model.append(model.backbone.pos_drop)
new_model.append(model.backbone.patch_drop)
new_model.append(model.backbone.norm_pre)
# out=new_model(input)
# out.shape
for i in range(12):
    new_model.append(model.backbone.blocks[i])
# out=new_model(input)
# out.shape
new_model.append(model.backbone.norm)
new_model.append(model.backbone.fc_norm)
new_model.append(model.backbone.head_drop)
new_model.append(model.backbone.head)
out=new_model(input)
out.shape

here's my confusion: the output of the original model: the output of the replicate model: part of the parameters in out tensor As you can see, the output is different. Can anybody explain why and provide some solutions plz? Looking forward to your reply!

I just tried as the code: print out the output of the model and see it's shape. I also tried to add up the numbers in the middle dimension of the second layer, apply sigmoid function to it. None of it match with the original output. I'm a newbie to ViT and torch, please bare with me. QAQ

1

There are 1 answers

0
Sandro On

You copy all the layers of the model but that's not all that happens in the model. The connections between the layers does not have to be straight forward.

In the case of the Transformer after all the "blocks", it generates a context-dependent embedding for every 16x16 square in the picture. There are 196 such embeddings. In the source code, you can see that there are 3 major ways to deal with this:

if self.classifier == 'token':
    x = x[:, 0]
elif self.classifier == 'gap':
    x = jnp.mean(x, axis=list(range(1, x.ndim - 1)))  # (1,) or (1,2)
elif self.classifier in ['unpooled', 'token_unpooled']:
    pass
else:
    raise ValueError(f'Invalid classifier={self.classifier}')
  1. Take the representation of the first square to represent the entirety.
  2. Take the mean of all representations.
  3. Skip this step and continue with all representations (this is what happens in your code).

Because this is not done in a named layer, you would have to recreate that part yourself.