How to interpret multi-gpu tensorflow profile run to figure out bottleneck?

26 views Asked by danny At 16 January 2024 at 17:34

I am trying to figure out why my multi-gpu training using tensorflow MirroredStrategy is not scaling for training a 20 block x 128 filters ResNet. The single GPU run is scaling 100% with no gaps and the input pipeline seems to be fast enough. For a 2 GPU run though the training epoch time doesn't reduce at all, even though I have doubled the batch size. With a single GPU, I am able to use a maximum batch size of 8 due to the large image sizes of 128x256x74, and 16 for 2 GPUs. I have attached the tensorf profile result below. I do not know how to interpret well to figure out the bottleneck. It seems that GPUs 0 and 1 are working sequentially, and that the NCCL communication time is rather large, right? I just want to understand what is causing the scaling issue: the input pipeline or the interconnect between GPUs ? The data is read from RAM so I am not sure if the former is the cause.

The GPU uses PCIe bridge for GPUs 0 and 1

> nvidia-smi topo -m
    GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    CPU Affinity    NUMA Affinity
GPU0     X  PIX PHB PHB SYS SYS SYS SYS SYS 0-9 0
GPU1    PIX  X  PHB PHB SYS SYS SYS SYS SYS 0-9 0
GPU2    PHB PHB  X  PIX SYS SYS SYS SYS SYS 0-9 0
GPU3    PHB PHB PIX  X  SYS SYS SYS SYS SYS 0-9 0
GPU4    SYS SYS SYS SYS  X  PIX PHB PHB PHB 10-19   1
GPU5    SYS SYS SYS SYS PIX  X  PHB PHB PHB 10-19   1
GPU6    SYS SYS SYS SYS PHB PHB  X  PIX PHB 10-19   1
GPU7    SYS SYS SYS SYS PHB PHB PIX  X  PHB 10-19   1
NIC0    SYS SYS SYS SYS PHB PHB PHB PHB  X      

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx4_0

Original Q&A

TechQA.

How to interpret multi-gpu tensorflow profile run to figure out bottleneck?

There are 0 answers

Related Questions in TENSORFLOW

Related Questions in DEEP-LEARNING

Related Questions in MULTI-GPU

Related Questions in DISTRIBUTED-TRAINING

Popular Questions

Trending Questions