I am comparing the performance of two instances of the wizardlm-13b model which I downloaded from HuggingFace. I found that the GGUF version of the model runs 4x slower than the GGMLv3 version. Best I can tell these are both 4-bit quantized models derived from the same base model.
I created an inference for both models using the llama-cpp-python package.
I used the following code to benchmark the performance:
from llama_cpp import Llama
llm = Llama(model_path="./models/7B/llama-model.gguf")
output = llm("Q: Name all of the planets in the solar system? A: ", max_tokens=64, stop=["Q:", "\n"], echo=True)
print(output)
And here are the results, running on my CPU:
I saw similar performance differences when running on GPU as well.
I am trying to understand what could be the root cause of such performance differences. Is GGUF expected to run slower than GGMLv3? Is it possible the newer version of llama-cpp-python could cause the difference? Or maybe something else I am overlooking?