I have two networks in sequence that perform an expensive computation.
The loss objective for both is the same, except for the second network's loss I want to apply a mask.
How to achieve this without using retain_graph=True?
# tenc - network1
# unet - network2
# the work flow is input->tenc->hidden_state->unet->output
params = []
params.append([{'params': tenc.parameters(), 'weight_decay': 1e-3, 'lr': 1e-07}])
params.append([{'params': unet.parameters(), 'weight_decay': 1e-2, 'lr': 1e-06}])
optimizer = torch.optim.AdamW(itertools.chain(*params), lr=1, betas=(0.9, 0.99), eps=1e-07, fused = True, foreach=False)
scheduler = custom_scheduler(optimizer=optimizer, warmup_steps= 30, exponent= 5, random=False)
scaler = torch.cuda.amp.GradScaler()
loss = torch.nn.functional.mse_loss(model_pred, target, reduction='none')
loss_tenc = loss.mean()
loss_unet = (loss * mask).mean()
scaler.scale(loss_tenc).backward(retain_graph=True)
scaler.scale(loss_unet).backward()
scaler.unscale_(optimizer)
scaler.step(optimizer)
scaler.update()
scheduler.step()
optimizer.zero_grad(set_to_none=True)
The loss_tenc should only optimize tenc parameters, and the loss_unet only unet. I may have to use two different optimizers if necessary, but I grouped them into one here for simplicity.
Considering both components are connected to
model_pred, you could backpropagate a single time by summing both loss terms together: