GGUF model running slow compared to GGMLv3 based on same base model

227 views Asked by At

I am comparing the performance of two instances of the wizardlm-13b model which I downloaded from HuggingFace. I found that the GGUF version of the model runs 4x slower than the GGMLv3 version. Best I can tell these are both 4-bit quantized models derived from the same base model.

I created an inference for both models using the llama-cpp-python package.

I used the following code to benchmark the performance:

from llama_cpp import Llama
llm = Llama(model_path="./models/7B/llama-model.gguf")
output = llm("Q: Name all of the planets in the solar system? A: ", max_tokens=64, stop=["Q:", "\n"], echo=True)
print(output)

And here are the results, running on my CPU:

enter image description here

I saw similar performance differences when running on GPU as well.

I am trying to understand what could be the root cause of such performance differences. Is GGUF expected to run slower than GGMLv3? Is it possible the newer version of llama-cpp-python could cause the difference? Or maybe something else I am overlooking?

0

There are 0 answers