Performance drop in matrix multiplication for certain sizes on AMD Polaris

Question

Performance drop in matrix multiplication for certain sizes on AMD Polaris

416 views Asked by Artyom At 27 June 2021 at 08:17

I have an OpenCL code that multiplies 2 matrices (GEMM) with M=4096, N=4096 and K=16. (i.e. matrices 4096 x 16 floats)

I run it on Polaris 560, 16CU GPU.

Code: https://github.com/artyom-beilis/oclblas/blob/master/gemm/gemm.cl

I noticed very strange performance drops for this size, matrix multiplication with this size has ~8-10 GFlops performance while if I change N to 4095 or 4097 I'm getting around 130-150Gflops. I notices similar behaviour with other GEMM libraries like clblas or miopengemm - I'm getting significant performance drop for this particular size of 4096x16 and changing N by 1 boosts the performance several times.

The workload is split into work-groups of 256 threads. Each work-group handles 128x16 and 128x16 matrix tiles (8x8 block per threads).

I tried changing matrix tiling to 96x96 with 6x6 blocks instead of 128x128 with 8x8 - same result.

I tested same code with ROCm 3.7 OpenCL, Clover OpenCL and even with Windows OpenCL driver - same behavior.

There is no such issue with nvidia gtx 960 having same number of gpu cores (threads) and same memory type/size.

I suspect that this is somehow cache/collision related but I don't understand how it happens. Thus I don't know how to work-around it.

Original Q&A

There are 1 answers

**Artyom** · Accepted Answer · 2021-07-09T19:41:12+00:00

Finally I found that clBlas library (developed for AMD originally) handles special case of lda % 1024==0, ldb % 1024==0 probably due to cache

https://github.com/clMathLibraries/clBLAS/blob/master/src/library/blas/specialCases/GemmSpecialCases.cpp#L228

I found that the better way was to rearrange blocks in z-curve order instead of queuing several kernels.

https://github.com/artyom-beilis/oclblas/blob/master/gemm/gemm.cl#L109

To handle cases M!=N or M != 1<<n I just increased number of work groups on M/N to neares 1<<n and groups that don't have jobs exit in the begging not adding too much overhead.

z-order improved performance x4 times.

TechQA.

Performance drop in matrix multiplication for certain sizes on AMD Polaris

There are 1 answers

Related Questions in GPU

Related Questions in OPENCL

Related Questions in MATRIX-MULTIPLICATION

Related Questions in AMD-GCN

Popular Questions

Trending Questions