# Importing necessary modules
from transformers import GPT2Tokenizer, GPT2LMHeadModel
# Loading pre-trained GPT-2 tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
# Encoding input text
input_ids = tokenizer.encode("The dog is running", return_tensors='pt')
# Generating model output with attention information
output = model.generate(
    input_ids,
    max_length=6,
    num_return_sequences=1,
    no_repeat_ngram_size=2,
    output_attentions=True,
    return_dict_in_generate=True,
)
# Extracting attention tensors
attn = output.attentions
My observations are as follows.
- The attnvariable is a tuple with two items representing the number of newly generated tokens (because 6 - 4 is 2).
- Each item is a tuple of 12 tensors, corresponding to the number of layers in each GPT block.
- The shape of the first tensor is [1, 12, 4, 4], and for the second tensor, it's [1, 12, 1, 5].
- When visualized, the tensor of shape [1, 12, 4, 4] represents masked attention.
Here are my questions.
- What do tensors with shapes [1, 12, 4, 4] and [1, 12, 1, 5] represent? How are they different?
- At what decoding stage do these tensors come from?