Detectron2: Log training and validation loss

Question

Detectron2: Log training and validation loss

212 views Asked by user17515752 At 30 October 2023 at 12:09

I want to train a detectron2 model in AzureML. In AzureML, one can log metrics. Standard, detectron2 logs losses (total loss, classifier loss, bounding box loss and etc.). However, I do not fully understand what loss it is (training, validation) and how it prevents overfitting (does it use the weights that achieved the lowest validation loss?). For the purpose of better understanding the training process, I want to log the training and validation loss in AzureML. However, I am unsure if I am doing it the correct way. I read that one can create a hook (from https://github.com/facebookresearch/detectron2/issues/810) although I'm unsure what that exactly entails (I'm new). Currently I have something like this:

# After setting up the cfg

from detectron2.engine import HookBase
from detectron2.data import build_detection_train_loader
import detectron2.utils.comm as comm

# Test/vali loss
from detectron2.utils.events import get_event_storage

class TrainingLoss(HookBase):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg.clone()
        self.cfg.DATASETS.TRAIN = self.cfg.DATASETS.TRAIN
        self._loader = iter(build_detection_train_loader(self.cfg))

    def after_step(self):
        data = next(self._loader)
        with torch.no_grad():
            loss_dict = self.trainer.model(data)

            losses = sum(loss_dict.values())
            assert torch.isfinite(losses).all(), loss_dict

            loss_dict_reduced = {"val_" + k: v.item() for k, v in
                                 comm.reduce_dict(loss_dict).items()}
            losses_reduced = sum(loss for loss in loss_dict_reduced.values())
            if comm.is_main_process():
                self.trainer.storage.put_scalars(total_val_loss=losses_reduced,
                                                 **loss_dict_reduced)

            print(f"Training Loss (Iteration {self.trainer.iter}): {losses_reduced}")

class ValidationLoss(HookBase):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg.clone()
        self.cfg.DATASETS.TRAIN = cfg.DATASETS.TEST
        self._loader = iter(build_detection_train_loader(self.cfg))

    def after_step(self):
        data = next(self._loader)
        with torch.no_grad():
            loss_dict = self.trainer.model(data)

            losses = sum(loss_dict.values())
            assert torch.isfinite(losses).all(), loss_dict

            loss_dict_reduced = {"val_" + k: v.item() for k, v in
                                 comm.reduce_dict(loss_dict).items()}
            losses_reduced = sum(loss for loss in loss_dict_reduced.values())
            if comm.is_main_process():
                self.trainer.storage.put_scalars(total_val_loss=losses_reduced,
                                                 **loss_dict_reduced)

            print(f"Vali Loss (Iteration {self.trainer.iter}): {losses_reduced}")



trainer = DefaultTrainer(cfg)
val_loss = ValidationLoss(cfg)
train_loss = TrainingLoss(cfg)
trainer.register_hooks([val_loss])
trainer.register_hooks([train_loss])
trainer.resume_or_load(resume=False)
trainer.train()

During training, it prints something like the following:

The numbers may be quite small in the image, but no matter what I print (the training or validation loss), they do not match what detectron2 would log by default. For example, the total_loss I calculated for iteration 99 is 1.936, whereas detectron2 logs 2.097. I know I logged the calculated training loss, but when I do the calculated validation loss, it is also a bit off.

Does anyone know how one should properly log these metrics? How does detectron2 actually calculate the loss? And does it save the weights that achieved the lowest validation loss, or simply after the number of iterations has ended?

Original Q&A

There are 1 answers

**Prayag Pawar** · Answer 1 · 2023-11-24T07:02:05+00:00

To log the training and validation losses correctly, you may not need to create separate hooks for training and validation. Instead, you can modify the existing hooks or create a new one to handle both training and validation.

from detectron2.data import build_detection_train_loader
import detectron2.utils.comm as comm

class LossHook(HookBase):
    def __init__(self, cfg, is_validation=False):
        super().__init__()
        self.cfg = cfg.clone()
        self.cfg.DATASETS.TRAIN = self.cfg.DATASETS.TEST if is_validation else self.cfg.DATASETS.TRAIN
        self._loader = iter(build_detection_train_loader(self.cfg))
        self.loss_prefix = "val_" if is_validation else "train_"

    def after_step(self):
        data = next(self._loader)
        with torch.no_grad():
            loss_dict = self.trainer.model(data)

            losses = sum(loss_dict.values())
            assert torch.isfinite(losses).all(), loss_dict

            loss_dict_reduced = {self.loss_prefix + k: v.item() for k, v in
                                 comm.reduce_dict(loss_dict).items()}
            losses_reduced = sum(loss for loss in loss_dict_reduced.values())
            if comm.is_main_process():
                self.trainer.storage.put_scalars(total_loss=losses_reduced,
                                                 **loss_dict_reduced)

            print(f"{self.loss_prefix.capitalize()}Loss (Iteration {self.trainer.iter}): {losses_reduced}")

#training code
trainer = DefaultTrainer(cfg)
train_loss_hook = LossHook(cfg, is_validation=False)
val_loss_hook = LossHook(cfg, is_validation=True)
trainer.register_hooks([train_loss_hook, val_loss_hook])
trainer.resume_or_load(resume=False)
trainer.train()

TechQA.

Detectron2: Log training and validation loss

There are 1 answers

Related Questions in LOGGING

Related Questions in LOSS

Related Questions in DETECTRON

Popular Questions

Popular Tags

Trending Questions