I've written a Matrix Multiplication Kernel in SYCL, based on Tiling sub-matrices to local cache. The performance uplift I get with tiling (tile size 16x16) and without tiling (naive) approach is up to 2x.
For lower tile sizes, I get near to naive speeds, which is expected. For any tile size higher than 16 (and I would choose a power of 2 because so is my matrix size) like 32, the kernel throws a sycl exception.
I suspect this is because GPU cannot accommodate the higher tile-size on its local cache.
Questions:
- How do I determine dynamically (and set) the maximum tile size supported on deployment on different GPUs?
- For Intel GPUs, how can I find out the maximum GPU local cache size?
I tried checking ark.intel.com, but that doesn't list the GPU local cache size. Current setup: i7-8665U with Intel UHD 620
P.S: If you would like to see my kernel code, please add a comment, I will add. I currently don't feel the need to show the kernel code and bloat the post.
@Artyom has given an explanation on things to take care of, while implementing Matrix Multiplication on GPU.
On the questions, here are the snippets in SYCL that show what I was looking for:
This shows: