Wandb creates two folder of rruns when distributed trainning is carried out between 2 GPUs?

25 views Asked by At

I am new to many things in deep learning and distributed training. I have defined a function for the distributed training setting:

def distributed_training_init(model, backend='nccl', sync_bn=False):
    if sync_bn:
        model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)

    rank = int(os.environ['RANK'])
    world_size = int(os.environ['WORLD_SIZE'])
    gpu = int(os.environ['LOCAL_RANK'])

    print(rank, world_size, gpu)
    torch.distributed.init_process_group(backend, world_size=world_size,
                                         rank=rank, init_method='env://')

    print('gpu', gpu)
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[gpu], output_device=gpu,find_unused_parameters=True)
    return model 

For the above function, I followed the example in wandb and also for the training and validation datasets, I followed the example:

train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)

and the model initialization is as follows:

gpu = int(os.environ['LOCAL_RANK'])
device = torch.device(f'cuda:{gpu}')
torch.cuda.set_device(device)
model = MYModel(...).to(device)  # MYModel() is the class defining the model

model = distributed_training_init(model)

and the wandb initialization is:

 run = wandb.init(project="m_project", name = "experiment_1", config=config, save_code=True)
  • My question is why when training the model, wandb creates two different run folders with two models, where I have two gpus?
  • This is not correct setting for my experiments. should not be only one trained model when distributed into two gpus?

Thanks

0

There are 0 answers