High Latency Issue with 4 GPUs on Mixtral 8x7B Model During Inference

50 views Asked by doNothing At 25 January 2024 at 13:15

I'm working with a machine that has four A100 GPUs, and I'm using them for inference on the Mixtral 8x7B model with text-generation-inference. Strangely, I've noticed that using all 4 GPUs increases latency compared to just using two. After digging into it, I found that the delay happens during the initial prefill stage, mainly due to allreduce operations.

I feel that the difference between the allreduce operation in the prefill stage and the decode stage is likely due to the size of the tensors being transferred along the seqlen dimension. For example, in the decode stage, with a batch size of 300 and seqlen of 1, and in the prefill stage, with a batch size of 1 and seqlen of 300, the tensors being passed during the allreduce operation should be of similar size. However, I'm puzzled by why the latency is comparable between 4 GPUs and 2 GPUs during decode but significantly different during prefill.

Original Q&A

TechQA.

High Latency Issue with 4 GPUs on Mixtral 8x7B Model During Inference

There are 0 answers

Related Questions in INFERENCE

Popular Questions

Trending Questions