What stride should I use for matrices in CUDA for the fastest possible speed?

82 views Asked by At

I'm working with matrices that range in size from 2,000x2,000 up to 5,000x5,000, doing operations such as multiplication and QR decomposition. I'm curious if, for example, I should align the stride by 64 for all matrixes for best performance. Also, should I avoid strides that are a multiple of some page size due to cache associativity, or does that not apply to GPU memory?

1

There are 1 answers

0
Homer512 On BEST ANSWER

I imagine most people trust cudaMallocPitch or cudaMalloc3D to provide the proper alignment as this is their stated purpose. While not explicitly clarified in the runtime documentation, they align to cudaDeviceProp::textureAlignment (512 byte on current hardware). There are also NPP's allocator functions which seem to have different alignment strategies (or at least did so in the past). See How does CUDA's nppiMalloc... function guarantee alignment? for some discussion on that.

The lack of a pitched allocator function for the stream ordered memory allocator suggests that alignment may not be as relevant today. Or it might be an oversight in the API, who knows?

What we do know from different parts of the programming guide is that

The best practices guide simply recommends 32 byte aligned memory transactions.

I'm not aware of a list of cache parameters for each generation. Turing's L2 is 4 MiB 16-way set associative with 64 byte lines and the memory pages are 2 MiB. If I did the math right, this means an alignment of 256 kiB would be pathological. With these numbers I'd imagine you could start seeing effects with 16 kiB alignment or more but I'm not aware of any official guidance on the subject.

Personally I stick with the pitched allocators and when I don't use them, I use the texture alignment except for smaller line sizes where I just use the next power of 2 as to not waste so much memory unless I plan to use texture binding.