I'm trying to make a TransformerEncoder work with variable length sequences. I understand I can pass a src_key_padding_mask
to the forward
method.
Here's some example code.
import torch
import torch.nn as nn
embedding_dim = 4
num_heads = 1
ff_dim = 16
encoder = nn.TransformerEncoderLayer(
d_model=embedding_dim,
nhead=num_heads,
dim_feedforward=ff_dim,
batch_first=True
)
input_tensor = torch.randn(3, 6, embedding_dim)
input_tensor[0,5,:] = 0
input_tensor[0,4,:] = 0
input_tensor[1,5,:] = 0
print(f"input\n{input_tensor}")
print(f"no mask\n{encoder(input_tensor)}")
bool_src_key_padding_mask = torch.tensor(
[[False, False, False, False, True, True],
[False, False, False, False, False, True],
[False, False, False, False, False, False]])
print(f"mask\n{encoder(input_tensor, src_key_padding_mask=bool_src_key_padding_mask)}")
I would expect the result of the last line to print out a tensor containing padding tokens (0
in this case), but it doesn't. I'm not sure what I'm doing wrong?
src_key_padding_mask
causes the masked items to contribute nothing to the attention calculation for the other items in the sequence. It does not stop computation on the masked inputs. Such is the nature of GPUs.