I am training a FCN model, I have two dataloaders train_loader and val_loader. As you can see in the code below, I made the model train on the validation data. I did this to debug a problem I had where switching between the two dataloaders would case the iteration time to increase tenfold from the first loop. I obviously can't train the model on the validation data, but why does it work like this?
The dataset is loaded in another class as a ConcatDataset which merges several ImageFolders, and made into dataloaders with a batch_size = 32, num_workers = os.cpu_count(), persistent_workers = True, pin_memory = True
This is my code:
if __name__ == "__main__":
from multiprocessing import freeze_support
freeze_support()
device = "cuda" if torch.cuda.is_available() else "cpu"
model = FCN_resnet50().to(device)
loss_fn = torch.nn.MSELoss().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
datasets = universal_fake_detect.Datasets("<Path to data>", (0.8, 0.1, 0.1))
train_loader = datasets.training()
val_loader = datasets.validation()
# tb_writer = SummaryWriter()
best_eval_loss = float("inf")
for epoch_index in range(10):
model.train(True)
running_loss = 0.0
train_loss = 0.0
for i, (inp, lab) in enumerate(train_loader):
print("Training iteration:", i)
lab = lab.view(-1, 1, 1, 1).expand(-1, 1, 224, 224).float()
inputs, labels = inp.to(device), lab.to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = loss_fn(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
if i % 1000 == 9:
print(" batch {} loss: {}".format(i + 1, running_loss/(i+1)))
tb_x = epoch_index * len(train_loader) + i + 1
train_loss = running_loss / len(train_loader)
print("Training loss", train_loss)
# Evaluation
model.eval()
running_loss = 0.0
for i, (inp, lab) in enumerate(val_loader):
print("val iteration:", i)
lab = lab.view(-1, 1, 1, 1).expand(-1, 1, 224, 224).float()
inputs, labels = inp.to(device), lab.to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = loss_fn(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
avg_val_loss = running_loss / len(val_loader)
print("Validation loss:", avg_val_loss)
if avg_val_loss < best_eval_loss:
best_eval_loss = avg_val_loss
torch.save(model.state_dict(), "../../model/FCN_test_model.pth")
print("finished training")
I have tried a lot of different settings on the parameters for the dataloaders, but nothing seems to help there.
I see two causes for your observation:
(1) If you are using image augmentation, it is usually only active for training data, not for validation data. If this is the case, it is likely that your data loaders are too slow during training (as they perform additional augmentation) compared to validation loaders (skipping augmentation). This would make training on the validation set faster.
If possible, increase your workers so that your training loader becomes faster. Check your GPU utilization (
nvidia-smicommand). If utilization is below 90%, the loaders are too slow (or too few) for your GPU when augmentation slows the workers down.(2) To state the obvious: The time needed for 1 epoch is 10 times shorter, if the validation data is only a tenth of the training data(!)