How does SEND bandwidth improve when the registered memory is aligned to system page size? (In Mellanox IBD)

81 views Asked by At

Operating System: RHEL Centos 7.9 Latest

Operation: Sending 500MB chunks 21 times from one System to another connected via Mellanox Cables. (Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6])

(The registered memory region (500MB) is reused for all the 21 iterations.)

The gain in Message Send Bandwidth when using aligned_alloc() (with system page size 4096B) instead of malloc() for the registered memory is around 35Gbps.

with malloc() : ~86Gbps

with aligned_alloc() : ~121Gbps

Since the CPU is not involved for these operations, how is this operation faster with aligned memory? Please provide useful reference links if available that explains this. What change does aligned memory bring to the read/write operations? Is it the address translation within the device that gets improved?

[Very limited information is present over the internet about this, hence asking here.]

1

There are 1 answers

0
Ankush Jain On

RDMA operations use either MMIO or DMA to transfer data from the main memory to the NIC via the PCI bus - DMA is used for larger transfers.

The behavior you're observing can be entirely explained by the DMA component of the transfer. DMA operates at the physical level, and a contiguous region in the Virtual Address Space is unlikely to be mapped to a contiguous region in the physical space. This fragmentation incurs costs - there's more translation needed per unit of transfer, and DMA transfers get interrupted at physical page boundaries.

[1] https://www.kernel.org/doc/html/latest/core-api/dma-api-howto.html

[2] Memory alignment