Llama-2, Q4-Quantized model's response time on different CPUs

206 views Asked by At

I am running quantized llama-2 model from here. I am using 2 different machines.

  1. 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz 2.80 GHz 16.0 GB (15.8 GB usable)

Inference time on this machine is pretty good. I get my desired response in 3-4 minutes

  1. Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 2.20 GHz (2 processors) 224 GB

Inference time on this machine is very long. It takes around half an hour to give unsatisfactory response. It even has an Nvidia 2080-Ti GPU as well. (But not using it to load model's weights.

Why is this behavior? How does the CPU affects the performance?

I am using llama_cpp python package to load the model.

0

There are 0 answers