Estimating the optimal tiling size for GPU matrix computations

1.2k views Asked by At

I've written a Matrix Multiplication Kernel in SYCL, based on Tiling sub-matrices to local cache. The performance uplift I get with tiling (tile size 16x16) and without tiling (naive) approach is up to 2x.

For lower tile sizes, I get near to naive speeds, which is expected. For any tile size higher than 16 (and I would choose a power of 2 because so is my matrix size) like 32, the kernel throws a sycl exception.

I suspect this is because GPU cannot accommodate the higher tile-size on its local cache.

Questions:

  1. How do I determine dynamically (and set) the maximum tile size supported on deployment on different GPUs?
  2. For Intel GPUs, how can I find out the maximum GPU local cache size?

I tried checking ark.intel.com, but that doesn't list the GPU local cache size. Current setup: i7-8665U with Intel UHD 620

P.S: If you would like to see my kernel code, please add a comment, I will add. I currently don't feel the need to show the kernel code and bloat the post.

2

There are 2 answers

0
Karan Shah On BEST ANSWER

@Artyom has given an explanation on things to take care of, while implementing Matrix Multiplication on GPU.

On the questions, here are the snippets in SYCL that show what I was looking for:

// Create a queue with device
default_selector d_selector;
queue q(d_selector, dpc_common::exception_handler);
std::cout << "Enumerated Device: " 
          << q.get_device().get_info<info::device::name>() << "\n";
auto wgroup_size = q.get_device().get_info<info::device::max_work_group_size>();
auto local_mem_size = q.get_device().get_info<info::device::local_mem_size>();
auto global_mem_size = q.get_device().get_info<info::device::global_mem_size>();

std::cout << "Maximum workgroup size\t:" << wgroup_size << "\n" 
        << "Global Memory Size\t:" << global_mem_size / 1024 / 1024 << " MB\n"
        << "Local Memory Size\t:" << local_mem_size / 1024 << " KB\n";

This shows:

Enumerated Device: Intel(R) Gen9
Maximum workgroup size  :256
Global Memory Size      :3199 MB
Local Memory Size       :64 KB
  1. Maximum workgroup size is 256, i.e. across each dimension, 16 is maximum supported.
  2. Local Cache Size is 65536 bytes (64KB). This is also confirmed here if anyone wants to look further.
0
Artyom On

In general in matrix multiplication tiling there are several things you need to take care of:

  1. Size of tile per thread - since you need to keep the data in registers that are scares, for example for NVidia it is around 256 - so automatically you can't make a tile larger than 16x16 - in reality 6x6/8x8 is sweet spot for nvidia/amd/intel gpus per thread
  2. It is better to load to the large tile (like 128x128 or 72x72 (for AMD)) to local memory and split the work load on smaller tiles for each thread in work group - but you should be very careful in avoiding bank conflicts
  3. Optimal parameters selection is dependent on gpu vendor (amd/nvidia/intel/arm-mali etc), gpu version/generation and of course matrix size. CLBlast has for example complex tuning routines for matrix multiplication parameters selection.

So in order to select optimal parameters you need to look at wavefront/wrap/simd size for amd/nvidia/intel gpu (64 or 32/32/8-32) number of local memory banks, registers count per thread etc. In general it can be done using automatic tuning and caching these values.

I found this tutorial very helpful in understanding various issues to make fast matrix multiplications:

https://cnugteren.github.io/tutorial/pages/page1.html

And even there he got about 50-60% of efficiency. Implementing good matrix multiplication algorithm is hard.

And this is Intel specific tutorial: https://software.intel.com/content/www/us/en/develop/articles/sgemm-for-intel-processor-graphics.html