What happens if optimal training loss is too high

1.6k views Asked by At

I am training a Transformer. In many of my setups I obtain validation and training loss that look like this:

Training and validation loss for my dataset

Then, I understand that I should stop training at around epoch 1. But then the training loss is very high. Is this a problem? Does the value of training loss actually mean anything?

Thanks

2

There are 2 answers

3
Shai On

You are describing overfitting: Your model's expressive power is too strong and it is memorizing the training data, rather than learning useful representations that can generalize to the validation data.

To mitigate this issue, you should apply stronger regularization to your model to prevent it from memorizing and steer it towards useful representations.
regularization methods include (but are not limited to):

  1. Input augmentations
  2. DropOut
  3. Early stopping
  4. Weight decay
1
Eran H. On

Regarding your first question - it is not necessarily a problem that your training loss is high, since there is no threshold for what is considered as a high training loss. It depends on your dataset, your actual test metrics and your business goals.

More specifically, the problems with the value of training loss:

  1. The number isn't intuitive, since the loss objective is a metric optimized for gradient descent (i.e. a differentiable function, usually the log version of it). You probably have intuitive business metrics (e.g., precision, recall) oriented towards your end goal, which you should use to decide if your model is good or not.

  2. Your train loss is calculated on the training dataset, which is not always representative of a good model, as can be seen in the overfitted model you posted. You shouldn't use this number to make decisions for the goodness of the model.

  3. It depends on what you are trying to achieve. Is 80% accuracy high or low?

Regarding your second question - Technically, the higher the number the worse the model did in converging, so you should always try to lower it (while taking into consideration overfitting). Comparatively, you can say that one model has a higher loss than another and then try multiple hyperparameters (e.g., dropout, different optimizers) to minimize the point where the validation set diverges.