I am looking to train a retinanet_resnet50_fpn_v2 in pytorch, but am facing a problem with varying output sizes.
I am training on the SKU110K dataset, which has images that can have a varying number of bounding boxes. For example, image 1 contains 35 bounding boxes, image 2 contains 79, image 3 contains 132.
When I try to create a Dataset that returns a batch of images with their corresponding bounding boxes, I get
stack expects each tensor to be equal size, but got [74, 4] at entry 0 and [128, 4] at entry 1
I created a collate function to pad the bounding boxes so that they were all the same shape, like so:
def collater(data):
imgs = [s["img"] for s in data]
annots = [s["boxes"] for s in data]
labels = [s["label"] for s in data]
max_num_annots = max(annot.shape[0] for annot in annots)
if max_num_annots > 0:
annot_padded = torch.zeros((len(annots), max_num_annots, 4))
for idx, annot in enumerate(annots):
# print(annot.shape)
if annot.shape[0] > 0:
annot_padded[idx, : annot.shape[0], :] = torch.from_numpy(annot)
else:
annot_padded = torch.zeros((len(annots), 1, 5))
return {"img": imgs, "boxes": annot_padded, "labels": labels}
I then get AssertionError: All bounding boxes should have positive height and width. Found invalid box [1.25, 1.25, 1.25, 1.25] for target at index 0.
What is the correct way to train this network given that any image can have a varying number of input and outputs?