I have a GPU/CUDA code that processes a cube (3D image, a spectral cube to be precise). Think of the cube as a series of images/slices, or alternatively, a bunch of spectra with different spatial locations (on a square). Each pixel of an image has different x, y values and the same z. Each pixel on a spectrum has the same x,y but varying z. The memory of the cube is aligned in a way so that two consecutive memory addresses correspond to x and x+1.
In my CUDA code I configured each CUDA thread to process a different spectrum. This way I can achieve global memory coalescing. Then I ported this code to Intel Phi (#pragma offload+OpenMP). Like in the GPU case, I have the each Phi core to process a different spectrum. As a result memory coalescing is achieved here as well. However, the performance is bad.
- I assume the problem is that although I have coalescing with the global memory, the pixels across each spectrum are not on consecutive memory addresses and as a result, Phi's vectorization does not provide any performance improvement. (Remember, each core does some kind of reduction across the associated spectrum; to be more precise, it calculates the 1st, 2nd, and 3rd moments). Does this thought make sense?
- If I am not mistaken in order to gain performance from SIMD your memory addresses has to be contiguous, right?
- So it seems that on Xeon phi is impossible to achieve perfect memory coalescing global memory and at the same time take full advantage of the SIMD. Does this make sense or I am totally wrong?
PS: I am using Intel Xeon Phi 7120