I am new to many things in deep learning and distributed training. I have defined a function for the distributed training setting:
def distributed_training_init(model, backend='nccl', sync_bn=False):
if sync_bn:
model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
rank = int(os.environ['RANK'])
world_size = int(os.environ['WORLD_SIZE'])
gpu = int(os.environ['LOCAL_RANK'])
print(rank, world_size, gpu)
torch.distributed.init_process_group(backend, world_size=world_size,
rank=rank, init_method='env://')
print('gpu', gpu)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[gpu], output_device=gpu,find_unused_parameters=True)
return model
For the above function, I followed the example in wandb and also for the training and validation datasets, I followed the example:
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
and the model initialization is as follows:
gpu = int(os.environ['LOCAL_RANK'])
device = torch.device(f'cuda:{gpu}')
torch.cuda.set_device(device)
model = MYModel(...).to(device) # MYModel() is the class defining the model
model = distributed_training_init(model)
and the wandb initialization is:
run = wandb.init(project="m_project", name = "experiment_1", config=config, save_code=True)
- My question is why when training the model, wandb creates two different run folders with two models, where I have two gpus?
- This is not correct setting for my experiments. should not be only one trained model when distributed into two gpus?
Thanks