I have the same dataloader to feed data to 4 models, each with a different hyperparameter loaded on a separate GPU. I want to reduce the bottleneck caused by data-loading, so I intend to load the same batch prepared by the dataloader on all GPUs for them to compute individually and perform a backprop step. I already cache data into RAM to avoid disk-bottlenecks when the dataloader in instantiated.
I am trying to:
- Send/Broadcast the same batch of data to N GPUs. I guess this is possible only if we can sync/wait for all GPUs to finish ops for one batch, before we can proceed to the next one.
- Bonus: Prefetching next batch as soon as one batch is ready (upto P batches) could help ensure continuous flow of data to the GPUs avoiding the wait.
I am not trying to achieve:
- Data Parallelism - Split a large batch into N parts, and compute each part on one GPU
- Model Parallelism - Split computation of a large model (that won't fit on one GPU) into N (or less) parts and place each part on one GPU.
Similar questions:
- This one is about making a Conv2D operation span across multiple GPUs
- This one is about executing different GPU computations in parallel, but I don't know if my problem can be solved with torch.cuda.Stream()
- This one is about loading different models, but it does not deal with sharing the same batch.
- This one is exactly about what I'm asking, but it's CUDA/PCIe and from 7 years ago.
Update:
I found a very similar question in Pytorch discuss where there is a small example at the end using forward prop using multiprocessing, but I'm wondering how to scale this approach to torch dataloaders.