This question relates to the neural machine translation shown here: Neural Machine Translation
self.W1
and self.W2
are initialized to dense neural layers of 10 units each, in lines 4 and 5 in the __init__
function of class BahdanauAttention
In the code image attached, I am not sure I understand the feed forward neural network set up in line 17 and line 18. So, I broke this formula down into it's parts. See line 23 and line 24.
query_with_time_axis
is the input tensor to self.W1
and values
is input to self.W2
. And each compute the function Z = WX + b
, and the Z's are added together. The dimensions of the tensors added together are (64, 1, 10)
and (64, 16, 10)
. I am assuming random weight initialization for both self.W1
and self.W2
is handled by Keras
behind the scenes.
Question:
After adding the Z's together, a non-linearity (tanh
) is applied to come up with an activation and this resulting activation is input to the next layer self.V
, which is a layer with just one output and gives us the score
.
For this last step, we don't apply an activation function (tanh etc) to the result of self.V(tf.nn.tanh(self.W1(query_with_time_axis) + self.W2(values)))
, to get a single output from this last neural network layer.
Is there a reason why an activation function was not used for this last step?
The ouput of the attention form so-called attention energies, i.e., one scalar for each encoder output. These numbers get stacked into a vector a this vector is normalized using softmax, yielding attention distribution.
So, in fact, there is non-linearity applied in the next step, which is the softmax. If you used an activation function before the softmax, you would only decrease the space of distributions that the softmax can do.