Solving collisions - try to coalesce gmem access, using smem, but banks conflicts

182 views Asked by At

I have that code:

   struct __declspec(align(32)) Circle
{
    float x, y;
    float prevX, prevY;
    float speedX, speedY;
    float mass;
    float radius;

void init(const int _x, const int _y, const float _speedX = 0.0f, const float   _speedY = 0.0f,
    const float _radius = CIRCLE_RADIUS_DEFAULT, 
    const float _mass = CIRCLE_MASS_DEFAULT);
};

And the second one:

/*smem[threadIdx.x] = *(((float*)cOut) + threadIdx.x);
        smem[threadIdx.x + blockDim.x] = *(((float*)cOut) + threadIdx.x + blockDim.x);
        smem[threadIdx.x + blockDim.x * 2] = *(((float*)cOut) + threadIdx.x + blockDim.x * 2);
        smem[threadIdx.x + blockDim.x * 3] = *(((float*)cOut) + threadIdx.x + blockDim.x * 3);
        smem[threadIdx.x + blockDim.x * 4] = *(((float*)cOut) + threadIdx.x + blockDim.x * 4);
        smem[threadIdx.x + blockDim.x * 5] = *(((float*)cOut) + threadIdx.x + blockDim.x * 5);
        smem[threadIdx.x + blockDim.x * 6] = *(((float*)cOut) + threadIdx.x + blockDim.x * 6);
        smem[threadIdx.x + blockDim.x * 7] = *(((float*)cOut) + threadIdx.x + blockDim.x * 7);*/
        __syncthreads();
        /*float x, y;
        float prevX, prevY;
        float speedX, speedY;
        float mass;
        float radius;*/
        /*c.x = smem[threadIdx.x];
        c.y = smem[threadIdx.x + blockDim.x]; //there must be [threadId.x * 8 + 0]
        c.prevX = smem[threadIdx.x + blockDim.x * 2]; //[threadId.x * 8 + 1] and e.t.c.
        c.prevY = smem[threadIdx.x + blockDim.x * 3];
        c.speedX = smem[threadIdx.x + blockDim.x * 4];
        c.speedY = smem[threadIdx.x + blockDim.x * 5];
        c.mass = smem[threadIdx.x + blockDim.x * 6];
        c.radius = smem[threadIdx.x + blockDim.x * 7];*/
        c = cOut[j];
        //c = *((Circle*)(smem + threadIdx * SMEM));

There is 2 gmem (I mean global memory) access: 1) Read Circle and detect collisions with it 2) Write Circle after changing it's speed and position Also I have circlesConst-massive of Circle, which was allocated by cudaMallocToSybol(). It is used to check intersection with its circles of the main circle C (it's in the register), which was read from gmem.

As I think, I used const memory well and it gains me all its performance :') (Am I wrong?)

When I read about coalesced access to gmem (is there coalesced access to other types of memory? I didn't find any info about it), I wanted to try it for me. As you can see, Circle-structure has 8 vars typed float = 32bits. I tried (in code it is commented) to do it, but, firstly, I get a wrong answer (because I must read from smem not correctly, mentioned below), secondly, I get 33% performance less. Why? I think, it doesn't depend on wrong fields relations.

And the second question, as I wrote in the comment in the code near the reading from smem to C, I must read another way, but If I do so, there will be a lot of banks conflict, so I will get much less performance... So, how can I load Circles coalasced without bank conflicts and, after that, write it back?

p.s Is the structure with size over 4*float located into the registers?


update: The newest version is:

#define CF (9) //9 because the primary struct has 8 floats, so 1 is for wasting

i = blockIdx.x * blockDim.x;
        smem[threadIdx.x + blockDim.x * 0 + blockDim.x * 0 / (CF - 1) + threadIdx.x / (CF - 1)] =   *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 0);
        smem[threadIdx.x + blockDim.x * 1 + blockDim.x * 1 / (CF - 1)  + threadIdx.x / (CF - 1)] =  *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 1);
        smem[threadIdx.x + blockDim.x * 2 + blockDim.x * 2 / (CF - 1)  + threadIdx.x / (CF - 1)] =  *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 2);
        smem[threadIdx.x + blockDim.x * 3 + blockDim.x * 3 / (CF - 1)  + threadIdx.x / (CF - 1)] =  *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 3);
        smem[threadIdx.x + blockDim.x * 4 + blockDim.x * 4 / (CF - 1)  + threadIdx.x / (CF - 1)] =  *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 4);
        smem[threadIdx.x + blockDim.x * 5 + blockDim.x * 5 / (CF - 1)  + threadIdx.x / (CF - 1)] =  *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 5);
        smem[threadIdx.x + blockDim.x * 6 + blockDim.x * 6 / (CF - 1)  + threadIdx.x / (CF - 1)] =  *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 6);
        smem[threadIdx.x + blockDim.x * 7 + blockDim.x * 7 / (CF - 1)  + threadIdx.x / (CF - 1)] =  *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 7);

c.x =       smem[threadIdx.x * CF + 0];
    c.y =       smem[threadIdx.x * CF + 1];
    c.prevX =   smem[threadIdx.x * CF + 2];
    c.prevY =   smem[threadIdx.x * CF + 3];
    c.speedX =  smem[threadIdx.x * CF + 4];
    c.speedY =  smem[threadIdx.x * CF + 5];
    c.mass =    smem[threadIdx.x * CF + 6];
    c.radius =  smem[threadIdx.x * CF + 7];

Is it right way to coalescing gmem acces using smem? I mean, I am afraid of BlockDim.x * 1 / (CF - 1) + threadIdx.x / (CF - 1). I guess, I didn't get some boost, because it isn't allow gmem to coalesce reading more than for one Circle, but I can't understand, how to make it coalescing two Circles..

1

There are 1 answers

1
Jonas Bötel On BEST ANSWER

Disclaimer

Note that this answer contains more questions than answers. Also note that I'm guessing a lot because I don't get huge parts of your question and source code.

Reconstruction

So I'm guessing that your global memory is an array of Circle structs. You seem to have optimized loading these circles by loading each of their floats separately into shared memory. This way you get continuous access patterns instead of strided ones. Am I still correct here?

So now that you have loaded blockDim.x circles into shared memory cooperatively you want to read a circle c from it for each thread, You seem to have tried 3 different ways:

  1. loading c from strided shared memory
    (c.prevX = smem[threadIdx.x + blockDim.x * 2];, etc.)
  2. loading c directly from shared memory
    (c = *((Circle*)(smem + threadIdx * SMEM));)
  3. loading c directly from global memory
    (c = cOut[j];)

Still correct?

Evaluation

  1. doesn't make any sense when you load circles into shared memory like the way I described before. So you probably have tried a different loading pattern there. Something along the lines of [threadId.x * 8 + 0] as noted in your comment. This solution has the benefit of continuous global access but storing into smem using ank conflicts.
  2. is no better because it has bank conflict when reading into registers.
  3. is worse because of strided global memory access.

Answer

Bank conflicts are easily resolved by inserting dummy values. Instead of using [threadId.x * 8 + 0] you would use [threadId.x * 9 + 0]. Note that you are wasting a bit of shared memory (i.e every ninth float) to spread out the data across banks. Note that you have to do the same when loading the data into shared memory in the first place. But notice that you are still doing a lot of work to just load these Circle structs there. Which leads me to an

Even better answer

Just don't use an array of Circle structs in global memory. Invert your memory pattern by using multiple arrays of float instead. One for each component of a Circle. You can then simply load into registers directly.

c.x = gmem_x[j];
c.y = gmem_y[j];
...

No more shared memory at all, less registers due to less pointer calculation, continuous global access patterns, no bank conflicts. All of it for free!

Now you might think there's a downside to it when preparing the data on the host and getting the results back. My best (and final) guess is that it will still be much faster overall because you'll probably either launch the kernel every frame and visualize with a shader without ever transferring the data back to the host or launch the kernel multiple times in a row before downloading the results. Correct?