I want to implement streaming stores in my code on Intel MIC. I have a force_array and 3 variables tempx, tempy and tempz. I need to do some computation and then store them in another array which won't be used in near future. So I felt streaming stores would be a better choice to improve the performance. But I see that I am getting a segmentation fault and I am not sure if it is because of the load or the store. This code is preceded and succeeded by a few lines of code and the entire piece of code is inside two for loops which is preceded by openmp directives. As it is a parallel program, I am not able to debug it well. Can anyone help me by finding out the mistake(s) ?
Thanks in advance !!! The code is given below:
for(k=0;k<np;k++) //np is the number of particles.
{
for(j=k+1;j<np;j++)
{
__m512d y1, y2, y3, y4, y5, y6;
y1 = _mm512_load_pd(force_array + k*nd + 0);
y4 = _mm512_load_pd(&tempx);
y1 = _mm512_sub_pd(y1,y4);
y2 = _mm512_load_pd(force_array + k*nd + 1);
y5 = _mm512_load_pd(&tempy);
y2 = _mm512_sub_pd(y2,y5);
y3 = _mm512_load_pd(force_array + k*nd + 2);
y6 = _mm512_load_pd(&tempz);
y3 = _mm512_sub_pd(y3,y6);
_mm512_storenr_pd((f+k*nd+0), y1);
_mm512_storenr_pd((f+k*nd+1), y2);
_mm512_storenr_pd((f+k*nd+2), y3);
}
}
_mm512_load_pd()
requires the address that you are loading from to be 64 byte aligned.The arrays
f
andforce_array
will need to have their starting addresses 64 byte aligned and allocated with_mm_alloc(size,64)
or be declared__attribute__((aligned(64))
for stack objects as you have done.I think the problem here is not the starting addresses but the computed addresses during your inner loop. If
nd=3
that means whenk=1
the offset from the beginning of theforce_array
with be 3 doubles i.e. 24 bytes.You will need to pad each of these force objects out to 8 bytes to use aligned loads, otherwise you will need to use unaligned loads.
Best regards,
Alastair
P.S. y1 and y2 load 8 doubles that are only 8 bytes apart, are you sure this is what you meant to achieve?