Dino2 for classification has wrong number of labels

152 views Asked by At

I am encountering an issue when using the Dinov2ForImageClassification model from the Hugging Face Transformers library, as outlined in the documentation here. Despite following the provided code example and using the latest Transformers version, the resulting model is performing binary classification instead of the expected ImageNet 1000-way classification. Specifically, the length of the logits returned by the model (logits) is 2, whereas it should be 1000 for ImageNet classification.

Here is my code:

from transformers import AutoImageProcessor, Dinov2ForImageClassification
import torch
from datasets import load_dataset

# Load a sample image dataset (in this case, "huggingface/cats-image")
dataset = load_dataset("huggingface/cats-image")
image = dataset["test"]["image"][0]

# Load the image processor and the Dinov2ForImageClassification model
image_processor = AutoImageProcessor.from_pretrained("facebook/dinov2-base")
model = Dinov2ForImageClassification.from_pretrained("facebook/dinov2-base")

# Prepare the input and obtain logits
inputs = image_processor(image, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits

# The expected number of labels for ImageNet classification should be 1000
predicted_label = logits.argmax(-1).item()

However, I encounter the following error:

csharpCopy code

Some weights of Dinov2ForImageClassification were not initialized from the model checkpoint at facebook/dinov2-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Additionally, the shape of logits is torch.Size([1, 2]), indicating that the model has only 2 labels instead of the expected 1000 as specified by model.num_labels.

I'm seeking guidance on how to correctly use Dinov2ForImageClassification for ImageNet 1000-way classification as mentioned in the documentation.

1

There are 1 answers

0
Timbus Calin On

I tested your code and you are right. There seems to be a bug related to the DinoV2 model. I also tried using AutoModelForImageClassification instead of directly opting for Dino explicitly like in your code, but it yielded the same output(2 labels).

I changed the code to load the SwinTransformer model:

from transformers import AutoImageProcessor, AutoModelForImageClassification
import torch
from datasets import load_dataset

# Load a sample image dataset (in this case, "huggingface/cats-image")
dataset = load_dataset("huggingface/cats-image")
image = dataset["test"]["image"][0]

# Load the image processor and the Dinov2ForImageClassification model
image_processor = AutoImageProcessor.from_pretrained("microsoft/swin-tiny-patch4-window7-224")
model = AutoModelForImageClassification.from_pretrained("microsoft/swin-tiny-patch4-window7-224")

# Prepare the input and obtain logits
inputs = image_processor(image, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs).logits

print(len(outputs[0]))

Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration. 1000

Indeed it prints 1000.

My suggestion is to open a bug report on HuggingFace's official github.

However, if you have to finetune it on your own data, I am pretty sure you will not encounter any errors anymore given the fine-tuning process; meanwhile you can opt for other models if it suits your purpose.