Training on minibatches of varying size

1.1k views Asked by At

I'm trying to train a deep learning model in PyTorch on images that have been bucketed to particular dimensions. I'd like to train my model using mini-batches, but the mini-batch size does not neatly divide the number of examples in each bucket.

One solution I saw in a previous post was to pad the images with additional whitespace (either on the fly or all at once at the beginning of training), but I do not want to do this. Instead, I would like to allow the batch size to be flexible during training.

Specifically, if N is the number of images in a bucket and B is the batch size, then for that bucket I would like to get N // B batches if B divides N, and N // B + 1 batches otherwise. The last batch can have fewer than B examples.

As an example, suppose I have indexes [0, 1, ..., 19], inclusive and I'd like to use a batch size of 3.

The indexes [0, 9] correspond to images in bucket 0 (shape (C, W1, H1))
The indexes [10, 19] correspond to images in bucket 1 (shape (C, W2, H2))

(The channel depth is the same for all images). Then an acceptable partitioning of the indexes would be

batches = [
    [0, 1, 2], 
    [3, 4, 5], 
    [6, 7, 8], 
    [9], 
    [10, 11, 12], 
    [13, 14, 15], 
    [16, 17, 18], 
    [19]
]

I would prefer to process the images indexed at 9 and 19 separately because they have different dimensions.

Looking through PyTorch's documentation, I found the BatchSampler class that generates lists of mini-batch indexes. I made a custom Sampler class that emulates the partitioning of indexes described above. If it helps, here's my implementation for this:

class CustomSampler(Sampler):

    def __init__(self, dataset, batch_size):
        self.batch_size = batch_size
        self.buckets = self._get_buckets(dataset)
        self.num_examples = len(dataset)

    def __iter__(self):
        batch = []
        # Process buckets in random order
        dims = random.sample(list(self.buckets), len(self.buckets))
        for dim in dims:
            # Process images in buckets in random order
            bucket = self.buckets[dim]
            bucket = random.sample(bucket, len(bucket))
            for idx in bucket:
                batch.append(idx)
                if len(batch) == self.batch_size:
                    yield batch
                    batch = []
            # Yield half-full batch before moving to next bucket
            if len(batch) > 0:
                yield batch
                batch = []

    def __len__(self):
        return self.num_examples

    def _get_buckets(self, dataset):
        buckets = defaultdict(list)
        for i in range(len(dataset)):
            img, _ = dataset[i]
            dims = img.shape
            buckets[dims].append(i)
        return buckets

However, when I use my custom Sampler class I generate the following error:

Traceback (most recent call last):
    File "sampler.py", line 143, in <module>
        for i, batch in enumerate(dataloader):
    File "/home/roflcakzorz/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 263, in __next__
        indices = next(self.sample_iter)  # may raise StopIteration
    File "/home/roflcakzorz/anaconda3/lib/python3.6/site-packages/torch/utils/data/sampler.py", line 139, in __iter__
        batch.append(int(idx))
TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'

The DataLoader class seems to expect to be passed indexes, not list of indexes.

Should I not be using a custom Sampler class for this task? I also considered making a custom collate_fn to pass to the DataLoader, but with that approach I don't believe I can control which indexes are allowed to be in the same mini-batch. Any guidance would be greatly appreciated.

2

There are 2 answers

4
Kris On BEST ANSWER

Do you have 2 networks for each of the samples(A cnn kernel size has to be fix). If yes just pass the above custom_sampler to the batch_sampler args of DataLoader class. That would fix the issue.

0
kithri On

Hi since every batch should contain images of the same dimension, your CustomSampler works just fine, it needs to be passed as an argument to mx.gluon.data.DataLoader, with the keyword, batch_sampler. However, as stated in the docs, do remember this:

"Do not specify shuffle, sampler, and last_batch if batch_sampler is specified"