I am new to transformer decoder and confused about the mask in attention. It seems to mask all the words before a particular word. If that is what it does, then is it making the network position-aware, so that positional encoding is no more needed?
Because let's assume no positional encoder. Consider inputs "I am good" and "am I good". Let's say after the processing of the first decoder, the "I" will become the vector x, "am" will become y, "good" will become z. The z of these two input sequences will be exactly the same. But, x and y will be completely different because of the mask in decoder. Then when x, y, z as the inputs of the second decoder, the output of z will be different between these two sequences because x and y are different. So the whole network is actually position-aware.
Am I missing something?
I tried read papers but haven't figure it out, thanks for help.
The positional encoding is introduced in the Transformer to provide information about the positions of tokens in the input sequence. Since the Transformer processes inputs in parallel, it lacks inherent knowledge of the order of tokens. Positional encoding helps the model understand the sequential relationships between different positions.
The attention mask in the Transformer decoder ensures that during the self-attention mechanism, each position can only attend to positions before it in the sequence. This is crucial for maintaining the autoregressive property during training. Without the mask, the model would have access to information from future positions, which would lead to data leakage and incorrect training.