We need to fine tune the ViLT model for the UMPC-Food-101 dataset. The pre-trained processor for ImageandTextClassification has the following syntax:
encoding = processor([image1, image2], text, return_tensors="pt")
Initially, I worked with only the image data for fine tuning the ViT and I used the following method:
val_data = {'image': image_file_paths, 'label': multi_hot_labels}
ds_val = **Dataset.from_dict**(val_data)
def transform(examples):
inputs = processor([pil.open(img).convert("RGB") for img in examples["image"]], return_tensors="pt")
inputs["labels"] = examples["label"]
return inputs
val_dataset = ds_val.with_transform(transform)
But now I can not use the Dataset.from_dict function as it doesn't support three lists. Currently I have the dictionary that has the following lists:
val_data = {
'image': image_file_paths,
'text': texts_csv_lst,
'label': labels,
}