Can I use lora to just reduce the size and run inference?

233 views Asked by At

So, lora basically can make finetune a model really easy right, but I want just to test a language model, in my case Flan-t5 , can I use lora to make it small so it can fit in my gpu ? , I’ve seen tutorials that train the model with HF but I just want it to run as inference, how can I do that, I was just trying with hugging face

peft_config = LoraConfig(task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1) model_name_or_path = “google/flan-t5-xl”

model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path,device_map=‘auto’)

model = get_peft_model(model, peft_config)

to then just save it , but Im not sure if this is the right thing Thanks

1

There are 1 answers

0
NLP from scratch On

If you just want to do inference, not training / fine-tuning, you want model quantization via GPTQ, see the blog post from Hugging Face here: Making LLMs lighter with AutoGPTQ and transformers

More practically, you should look for an already quantized version of the model you wish to try out, e.g. for FLAN-T5 here is one: https://huggingface.co/limcheekin/flan-t5-xl-ct2