Llama-2, Q4-Quantized model's response time on different CPUs

196 views Asked by Muhammad Burhan At 29 November 2023 at 11:56

I am running quantized llama-2 model from here. I am using 2 different machines.

11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz 2.80 GHz 16.0 GB (15.8 GB usable)

Inference time on this machine is pretty good. I get my desired response in 3-4 minutes

Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 2.20 GHz (2 processors) 224 GB

Inference time on this machine is very long. It takes around half an hour to give unsatisfactory response. It even has an Nvidia 2080-Ti GPU as well. (But not using it to load model's weights.

Why is this behavior? How does the CPU affects the performance?

I am using llama_cpp python package to load the model.

Original Q&A

TechQA.

Llama-2, Q4-Quantized model's response time on different CPUs

There are 0 answers

Related Questions in MACHINE-LEARNING

Related Questions in LARGE-LANGUAGE-MODEL

Related Questions in LLAMA-CPP-PYTHON

Popular Questions

Popular Tags

Trending Questions