I am training a (siamense) neural network with Pytorch on a very big dataset. Loading data is the biggest bottleneck, and my dataset doesn't fit in RAM to speed it up.
What I would like to do is basically cache part of the data, and repeat it inside the same epoch to speed up the training. Would it be possible to have some kind of double ended queue to sample from, where I append elements upon reading them, and remove them after I included them in the training a few times?
Unfortunately none of the normal functions in either torchdata or torch.utils.data.Dataset seem to allow this. It's either caching a complete epoch of data, or none at all.
I think using the sample multiple time in the same epoch will be messy when training the model. it's better to create a data generator that will for one epoch only use data once.
If you want to use the sample many time in one epoch i hope this little example that i made can help you :
#set batch_size = 1
I couldn't test the code but you have the general idea