My code
from ctransformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("TheBloke/Mistral-7B-v0.1-GGUF", model_file='mistral-7b-v0.1.Q4_K_M.gguf', model_type='mistral', hf=True)
tokenizer = AutoTokenizer.from_pretrained(model)
device = 'cuda:0'
prompt = 'text'
model_inputs = tokenizer(prompt, return_tensors="pt")
model_inputs.to(device)
model.to(device)
But model is still on cpu enter image description here
I've also tried
model = model.to(device)
I've tried to make device type with torch
device = torch.device('cuda')
It looks like you're using the
ctransformerslibrary, which makes GPU-based inference a little tricky. As noted here, you must specify thegpu_layersparameter.The following snippet should work if you're using the regular transformers library.
Info is available here on GPU sizing for this model.