I am training a Transformer. In many of my setups I obtain validation and training loss that look like this:
Then, I understand that I should stop training at around epoch 1. But then the training loss is very high. Is this a problem? Does the value of training loss actually mean anything?
Thanks
You are describing overfitting: Your model's expressive power is too strong and it is memorizing the training data, rather than learning useful representations that can generalize to the validation data.
To mitigate this issue, you should apply stronger regularization to your model to prevent it from memorizing and steer it towards useful representations.
regularization methods include (but are not limited to):