I have a code that accesses ~4GB of memory sequentially,
it accesses 1024bits per request, randomly across all 4GB...
I have a RADEON VII with 16GB HBM2, with 4096bit bus.
Possible optimizations:
4GB and 4x data per mem request! (doesn't work because first request tells me second request across those 4GB, so the needed data for the second request may be far away in memory)
4+4+4+4GB and 1x data per mem request! (doesn't improve performance because each request to a 4GB group delays the other ones to 0.25x performance, so I get 4 Threads with 0.25x performance each)
Questions:
For optimization 1 - Is it possible to split the 4096bit BUS, so I can fetch different areas of 1024bits of the memory in parallel in a non Blocking way?
For optimization 2 - Is it possible to address 'blocks' of 4GB in parallel, in a way that each block is independent, and non Blocking for the others?
PS - I know it depends on the memory controller, so if you know a different hardware that can do this, please let me know too.
Yes, HBM2 is always accessed in parallel, but it's not up to you.
Both of your proposed optimizations don't work. OpenCL does not give you control on how to use the memory bus or where to allocate memory; that is up to the drivers. If you allocate 4GB, these 4GB are not allocated on only one of the 4 HBM2 memory dies, but instead automatically split up across all 4 dies to maximize bandwidth.
The best you can do is make sure you have coalesced memory access (array of structures data layout) and to saturate the GPU with enouth work itms / work groups. The Radeon VII (I use a bunch of them for my stuff as well) has a theoretical bandwidth of 1024GB/s, but don't expect more than 800GB/s in practice.