I'm trying to train a deep learning model in PyTorch on images that have been bucketed to particular dimensions. I'd like to train my model using mini-batches, but the mini-batch size does not neatly divide the number of examples in each bucket.
One solution I saw in a previous post was to pad the images with additional whitespace (either on the fly or all at once at the beginning of training), but I do not want to do this. Instead, I would like to allow the batch size to be flexible during training.
Specifically, if N
is the number of images in a bucket and B
is the batch size, then for that bucket I would like to get N // B
batches if B
divides N
, and N // B + 1
batches otherwise. The last batch can have fewer than B
examples.
As an example, suppose I have indexes [0, 1, ..., 19], inclusive and I'd like to use a batch size of 3.
The indexes [0, 9] correspond to images in bucket 0 (shape (C, W1, H1))
The indexes [10, 19] correspond to images in bucket 1 (shape (C, W2, H2))
(The channel depth is the same for all images). Then an acceptable partitioning of the indexes would be
batches = [
[0, 1, 2],
[3, 4, 5],
[6, 7, 8],
[9],
[10, 11, 12],
[13, 14, 15],
[16, 17, 18],
[19]
]
I would prefer to process the images indexed at 9 and 19 separately because they have different dimensions.
Looking through PyTorch's documentation, I found the BatchSampler
class that generates lists of mini-batch indexes. I made a custom Sampler
class that emulates the partitioning of indexes described above. If it helps, here's my implementation for this:
class CustomSampler(Sampler):
def __init__(self, dataset, batch_size):
self.batch_size = batch_size
self.buckets = self._get_buckets(dataset)
self.num_examples = len(dataset)
def __iter__(self):
batch = []
# Process buckets in random order
dims = random.sample(list(self.buckets), len(self.buckets))
for dim in dims:
# Process images in buckets in random order
bucket = self.buckets[dim]
bucket = random.sample(bucket, len(bucket))
for idx in bucket:
batch.append(idx)
if len(batch) == self.batch_size:
yield batch
batch = []
# Yield half-full batch before moving to next bucket
if len(batch) > 0:
yield batch
batch = []
def __len__(self):
return self.num_examples
def _get_buckets(self, dataset):
buckets = defaultdict(list)
for i in range(len(dataset)):
img, _ = dataset[i]
dims = img.shape
buckets[dims].append(i)
return buckets
However, when I use my custom Sampler
class I generate the following error:
Traceback (most recent call last):
File "sampler.py", line 143, in <module>
for i, batch in enumerate(dataloader):
File "/home/roflcakzorz/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 263, in __next__
indices = next(self.sample_iter) # may raise StopIteration
File "/home/roflcakzorz/anaconda3/lib/python3.6/site-packages/torch/utils/data/sampler.py", line 139, in __iter__
batch.append(int(idx))
TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'
The DataLoader
class seems to expect to be passed indexes, not list of indexes.
Should I not be using a custom Sampler
class for this task? I also considered making a custom collate_fn
to pass to the DataLoader
, but with that approach I don't believe I can control which indexes are allowed to be in the same mini-batch. Any guidance would be greatly appreciated.
Do you have 2 networks for each of the samples(A cnn kernel size has to be fix). If yes just pass the above
custom_sampler
to the batch_sampler args of DataLoader class. That would fix the issue.