SSE optimization of Gaussian blur

1k views Asked by At

I'm working on a school project , I have to optimize part of code in SSE, but I'm stuck on one part for few days now.

I dont see any smart way of using vector SSE instructions(inline assembler / instric f) in this code(its a part of guassian blur algorithm). I would be glad if somebody could give me just a small hint

for (int x = x_start; x < x_end; ++x)     // vertical blur...
    {
        float sum = image[x + (y_start - radius - 1)*image_w];
        float dif = -sum;

        for (int y = y_start - 2*radius - 1; y < y_end; ++y)
        {                                                   // inner vertical Radius loop           
            float p = (float)image[x + (y + radius)*image_w];   // next pixel
            buffer[y + radius] = p;                         // buffer pixel
            sum += dif + fRadius*p;
            dif += p;                                       // accumulate pixel blur

            if (y >= y_start)
            {
                float s = 0, w = 0;                         // border blur correction
                sum -= buffer[y - radius - 1]*fRadius;      // addition for fraction blur
                dif += buffer[y - radius] - 2*buffer[y];    // sum up differences: +1, -2, +1

                // cut off accumulated blur area of pixel beyond the border
                // assume: added pixel values beyond border = value at border
                p = (float)(radius - y);                   // top part to cut off
                if (p > 0)
                {
                    p = p*(p-1)/2 + fRadius*p;
                    s += buffer[0]*p;
                    w += p;
                }
                p = (float)(y + radius - image_h + 1);               // bottom part to cut off
                if (p > 0)
                {
                    p = p*(p-1)/2 + fRadius*p;
                    s += buffer[image_h - 1]*p;
                    w += p;
                }
                new_image[x + y*image_w] = (unsigned char)((sum - s)/(weight - w)); // set blurred pixel
            }
            else if (y + radius >= y_start)
            {
                dif -= 2*buffer[y];
            }
        } // for y
    } // for x
1

There are 1 answers

3
klm123 On BEST ANSWER
  1. One more feature you can use is logical operations and masks:

for example instead of:

  // process only 1
if (p > 0)
    p = p*(p-1)/2 + fRadius*p;

you can write

  // processes 4 floats
const __m128 &mask = _mm_cmplt_ps(p,0);
const __m128 &notMask = _mm_cmplt_ps(0,p);
const __m128 &p_tmp = ( p*(p-1)/2 + fRadius*p );
p = _mm_add_ps(_mm_and_ps(p_tmp, mask), _mm_and_ps(p, notMask)); // = p_tmp & mask + p & !mask
  1. Also I can recommend you to use a special libraries, which overloads instructions. For example: http://code.compeng.uni-frankfurt.de/projects/vc

  2. dif variable makes iterations of inner loop dependent. You should try to parallelize the outer loop. But with out instructions overloading the code will become unmanageable then.

  3. Also consider rethinking the whole algorithm. Current one doesn't look paralell. May be you can neglect precision, or increase scalar time a bit?