Masked self-attention in tranformer's decoder

964 views Asked by At

I'm writing my thesis about attention mechanisms. In the paragraph in which I explain the decoder of transformer I wrote this:

The first sub-layer is called masked self-attention, in which the masking operation consists in preventing the decoder from paying attention to subsequent words. That is to say, while training a transformer for translation purposes, it is possible to access the target translation; on the other hand, during the inference, that is the translation of new sentences, it is not possible to access the target translation. Therefore, when calculating the probabilities of the next word in the sequence, the network must not access that word. Otherwise, the translation task would be banal and the network would not learn to predict the translation correctly.

I don't know if I said something wrong also in the previous part, but my professor thinks I made mistakes in the following part:

To understand in a simple way the functioning of the masked self-attention level, let's go back to the example “Orlando Bloom loves Miranda Kerr” (x1 x2 x3 x4 x5). If we consider the inputs as vectors x1, x2. x3. x4. x5 and we want to translate the word x3 corresponding to "loves", you need to make sure that the following words x4 and x5 do not influence the translation y3. To prevent this influence, masking sets the weights of x4 and x5 to zero. Then a normalization of the weights is performed so that the sum of the elements of each column in the matrix is equal to 1. The result is a matrix with normalized weights in each column.

Can someone please tell me where the miskates are?

0

There are 0 answers