My task was to convert english sentence to German sentence. I first did this with normal encoder-decoder network, on which I got fairly good results. Then, I tried to solve the same task with the same exact model as before, but with Bahdanau Attention in it. And, the model without attention outperformed the one with the attention.
The Model's loss without attention gone from approximately 8.0 to 1.4 in 5 epochs and gone to 1.0 in 10 epochs and the loss was still reducing but at a slower rate.
The Model's loss with attention gone from approximately 8.0 to 2.6 in 5 epochs and was not learning much more.
None of the models were overfitting as the validation loss was also decreasing in both the models.
Each English sentence had 47 words in it (after padding), and each German sentence had 54 words in it (after padding). I had 7000 English and 7000 German sentence in the training set and 3000 in the validation set.
I tried almost everything like: different learning rates, different optimizer, different batch sizes, different activation functions I used in the model, tried applying batch and layer normalization, and different number of LSTM units for the encoder and decoder, but nothing makes much difference, except the normalization and increasing the data, in which the loss goes down till approx 1.5 but then again stops learning!
Why did this happened? Why did the Model with Bahdanau attention failed while the one without any kind of attention was performing well?
Edit 1 - I tried applying LayerNormalization before the attention, after the attention and both before and after the attention. The results were approximately the same in each case. But, this time, the loss went from approx 8.0 to 2.1 in 5 epochs, and was again not learning much. But most of the learning was done in 1 epoch as at the end of 1 epoch it reached a loss of approx 2.6 and then reached 2.1 in the next epoch, and then again not learning much.
Still, the model without any attention outperforms the one with both attention and LayerNormzalization. What could be the reason to this? Are the results that I got even possible? How can a normal encoder-decoder network without any kind of normalization, without any dropout layer perform better than the model with both attention and LayerNormalization?
Edit 2 - I tried increasing the data (I did it 7 times more than the previous one), this time, both the models performance improved a lot. But still, the model without attention performed better than the model with attention. Why is this happening?
Edit 3 - I tried to debug the model by first passing just one sample from the whole training dataset. The loss started at approx 9.0 and was reducing and converging at 0. Then, I tried by passing 2 samples, the loss again started at approx 9.0, but, this time, it was just wandering between 1.5 and 2.0 for the first 400 epochs and then reducing slowly. This is a plot of how the loss reduces when I trained it with just 2 samples:
This is a plot of how the loss reduces when I trained it with just 1 sample:
Thank you everyone for the help.... It was an implementation issue... Fixing that, makes the attention model perform better than the normal encoder-decoder model!