RuntimeError: "addmm_impl_cpu_" not implemented for 'Half' - PEFT Huggingface trying to run on CPU

3.3k views Asked by At

I am relatively new to LLMs, trying to catch up with it. Following an example I modified the code a bit, to make sure I am running the things locally on an EC2 instance. Training went OK on CPU only, (27 hours), saved model, tokenizer and configs to disk. However I am having trouble with inference. Here is the code:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftConfig, PeftModel
from transformers import BitsAndBytesConfig
from accelerate import Accelerator

accelerator=Accelerator(cpu=True)

device = torch.device("cpu")
quantization_config = BitsAndBytesConfig(load_in_8bit_fp32_cpu_offload=True)


# Import the model
config = PeftConfig.from_pretrained(model_dir)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, 
                                             torch_dtype=torch.bfloat16, 
                                             low_cpu_mem_usage=True,
                                             #return_dict=True, 
                                             #quantization_config=quantization_config,
                                             #load_in_8bit=True, 
                                             #device_map=device_map,
                                             #device_map="auto",
                                            )
model.to(device)

tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
peft_model = PeftModel.from_pretrained(model, model_dir, torch_dtype=torch.bfloat16, 
                                       low_cpu_mem_usage=True,
                                      use_cache=True)
peft_model.to(device)

prompt = "The hobbits were so suprised seeing their friend"

inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to(torch.long)  # Convert to Long data type
attention_mask = inputs["attention_mask"].to(torch.float32)

tokens = peft_model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    max_new_tokens=100,
    temperature=1,
    eos_token_id=tokenizer.eos_token_id,
    early_stopping=True,
    use_cache=True
)

model and peft_model seem to be working, I am not getting errors from these parts. The error is coming from the peft_model.generate part. It is a pretty long and ugly error message, here is how the message ends:

RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

I checked a lot of resources and links, and tried to modify the code accordingly but nothing seems to be working. I am obviously doing something wrong or trying to do the mission impossible, by attempting these on the CPU. But let's say, at the moment, this has to be on CPU. Any help is much appreciated.

1

There are 1 answers

0
Mr. Data On

Why are you using "torch_dtype=torch.bfloat16" on cpu?

The error message "RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'" means that the PyTorch function torch.addmm does not have a CPU implementation for the Half data type. Half is a 16-bit floating point data type, and PyTorch does not have all of its operations implemented for this data type on the CPU.