GGUF model running slow compared to GGMLv3 based on same base model

228 views Asked by Jason D At 01 December 2023 at 19:14

I am comparing the performance of two instances of the wizardlm-13b model which I downloaded from HuggingFace. I found that the GGUF version of the model runs 4x slower than the GGMLv3 version. Best I can tell these are both 4-bit quantized models derived from the same base model.

I created an inference for both models using the llama-cpp-python package.

I used the following code to benchmark the performance:

from llama_cpp import Llama
llm = Llama(model_path="./models/7B/llama-model.gguf")
output = llm("Q: Name all of the planets in the solar system? A: ", max_tokens=64, stop=["Q:", "\n"], echo=True)
print(output)

And here are the results, running on my CPU:

I saw similar performance differences when running on GPU as well.

I am trying to understand what could be the root cause of such performance differences. Is GGUF expected to run slower than GGMLv3? Is it possible the newer version of llama-cpp-python could cause the difference? Or maybe something else I am overlooking?

Original Q&A

TechQA.

GGUF model running slow compared to GGMLv3 based on same base model

There are 0 answers

Related Questions in NLP

Related Questions in HUGGINGFACE

Related Questions in LLAMA-CPP-PYTHON

Popular Questions

Popular Tags

Trending Questions