Streaming Stores Segmentation Fault on Intel MIC

218 views Asked by At

I want to implement streaming stores in my code on Intel MIC. I have a force_array and 3 variables tempx, tempy and tempz. I need to do some computation and then store them in another array which won't be used in near future. So I felt streaming stores would be a better choice to improve the performance. But I see that I am getting a segmentation fault and I am not sure if it is because of the load or the store. This code is preceded and succeeded by a few lines of code and the entire piece of code is inside two for loops which is preceded by openmp directives. As it is a parallel program, I am not able to debug it well. Can anyone help me by finding out the mistake(s) ?

Thanks in advance !!! The code is given below:

    for(k=0;k<np;k++)    //np is the number of particles.
    {
      for(j=k+1;j<np;j++)
      {
        __m512d y1, y2, y3, y4, y5, y6;

        y1 = _mm512_load_pd(force_array + k*nd + 0);
        y4 = _mm512_load_pd(&tempx);
        y1 = _mm512_sub_pd(y1,y4);

        y2 = _mm512_load_pd(force_array + k*nd + 1);
        y5 = _mm512_load_pd(&tempy);
        y2 = _mm512_sub_pd(y2,y5);

        y3 = _mm512_load_pd(force_array + k*nd + 2);
        y6 = _mm512_load_pd(&tempz);
        y3 = _mm512_sub_pd(y3,y6);

        _mm512_storenr_pd((f+k*nd+0), y1);
        _mm512_storenr_pd((f+k*nd+1), y2);
        _mm512_storenr_pd((f+k*nd+2), y3);
      }
   }
1

There are 1 answers

2
amckinley On BEST ANSWER

_mm512_load_pd() requires the address that you are loading from to be 64 byte aligned.

The arrays f and force_array will need to have their starting addresses 64 byte aligned and allocated with _mm_alloc(size,64) or be declared __attribute__((aligned(64)) for stack objects as you have done.

I think the problem here is not the starting addresses but the computed addresses during your inner loop. If nd=3 that means when k=1 the offset from the beginning of the force_array with be 3 doubles i.e. 24 bytes.

You will need to pad each of these force objects out to 8 bytes to use aligned loads, otherwise you will need to use unaligned loads.

Best regards,

Alastair

P.S. y1 and y2 load 8 doubles that are only 8 bytes apart, are you sure this is what you meant to achieve?