Really basic SSE

223 views Asked by At

I have a very simple program that I am trying to improve performance. One way that I know will help is to utilize SSE3 (since the machine that I am working supports this), but I have absolutely no idea how to to do this. Here is a code snippet (c++):

int sum1, sum2, sum3, sum4;
for (int i=0; i<length; i+=4) {
  for (int j=0; j<length; j+=4) {
    sum1 = sum1 + input->value[i][j];
    sum2 = sum2 + input->value[i+1][j+1];
    sum3 = sum3 + input->value[i+2][j+3];
    sum4 = sum4 + input->value[i+3][j+4];    
  {
}

I've read a little about this, and understand the idea, but I have absolutely no idea how to implement this. Can somebody help me please? I think that this is fairly simple, particularly for my simple program, but sometimes getting started is the hardest part.

Thanks!

1

There are 1 answers

0
Mysticial On BEST ANSWER

Actually, in your case, it is not that simple. As it stands right now, your code is NOT vectorizable. (at least not without significant loop transformations)

The reason for this is that you are changing the index i as well inside the inner loop. The breaks any chance of being able to vectorize the j iteration because the memory locations are no longer adjacent and are in different rows of the matrix. (as you seem to be running down the matrix diagonally)

However, I get the feeling that you are trying to sum up all the elements in your matrix, and you actually intended your loop to be like this (and you had a number of typos too):

int sum1 = 0, sum2 = 0, sum3 = 0, sum4 = 0;
for (int i=0; i<length; i++) {
  for (int j=0; j<length; j+=4) {
    sum1 = sum1 + input->value[i][j];
    sum2 = sum2 + input->value[i][j+1];
    sum3 = sum3 + input->value[i][j+2];
    sum4 = sum4 + input->value[i][j+3];    
  }
}

int total = sum1 + sum2 + sum3 + sum4;

If this is what you wanted, then it is very vectorizable. In C/C++ using intrinsics, this can be done as follows using just SSE2:

__m128i sum = _mm_setzero_si128();
for (int i=0; i<length; i++) {
  for (int j=0; j<length; j+=4) {
    __m128i val = _mm_load_si128(&input->value[i][j]);
    sum = _mm_add_epi32(sum,val);
  }
}

Note that alignment restrictions will apply. And a lot more speedup can be gained by further unrolling the loop.