How does padding work when using a pytorch TransformerEncoder?

201 views Asked by At

I'm trying to make a TransformerEncoder work with variable length sequences. I understand I can pass a src_key_padding_mask to the forward method.

Here's some example code.

import torch
import torch.nn as nn

embedding_dim = 4
num_heads = 1
ff_dim = 16

encoder = nn.TransformerEncoderLayer(
        d_model=embedding_dim,
        nhead=num_heads,
        dim_feedforward=ff_dim,
        batch_first=True
    )

input_tensor = torch.randn(3, 6, embedding_dim)
input_tensor[0,5,:] = 0
input_tensor[0,4,:] = 0
input_tensor[1,5,:] = 0
print(f"input\n{input_tensor}")

print(f"no mask\n{encoder(input_tensor)}")

bool_src_key_padding_mask = torch.tensor(
        [[False, False, False, False, True, True],
         [False, False, False, False, False, True],
         [False, False, False, False, False, False]])

print(f"mask\n{encoder(input_tensor, src_key_padding_mask=bool_src_key_padding_mask)}")

I would expect the result of the last line to print out a tensor containing padding tokens (0 in this case), but it doesn't. I'm not sure what I'm doing wrong?

1

There are 1 answers

1
Karl On

src_key_padding_mask causes the masked items to contribute nothing to the attention calculation for the other items in the sequence. It does not stop computation on the masked inputs. Such is the nature of GPUs.