Optimizing loop with few instructions(SSE2, SSE4) with TBB

Question

Optimizing loop with few instructions(SSE2, SSE4) with TBB

1.6k views Asked by prgbenz At 10 February 2011 at 02:58

I have a simple image processing related algorithm. Briefly, an image(mean) in float is subtracted by an 8-bit image the result is then save to an float image(dest)

this function is mainly written by intrinsics.

I have tried to optimize this function with TBB, parrallel_for, but I received no gain in speed but penalty.

What should I do ? Should I use more low-level scheme such as TBB task to optimize the code ?

float           *m, **m_data,
                *o, **o_data;
unsigned char   *p, **src_data;
register unsigned long len, i;
unsigned long   nr,
                nc;

src_data    =   src->UByteData;    // 2d array
m_data      =   mean->FloatData;   // 2d array
o_data      =   dest->FloatData;   // 2d array
nr          =   src->Rows;
nc          =   src->Cols;

__m128i xmm0;

for(i=0; i<nr; i++)
{
    m = m_data[i];
    o = o_data[i];
    p = src_data[i];
    len = nc;
    do
    {
        _mm_prefetch((const char *)(p + 16),  _MM_HINT_NTA);
        _mm_prefetch((const char *)(m + 16),  _MM_HINT_NTA);

        xmm0 = _mm_load_si128((__m128i *) (p));

        _mm_stream_ps(
                        o,
                        _mm_sub_ps(
                                    _mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(xmm0, 0))),
                                    _mm_load_ps(m + offset)
                                )
                    );
        _mm_stream_ps(
                        o + 4,
                        _mm_sub_ps(
                                    _mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(xmm0, 4))),
                                    _mm_load_ps(m + offset + 4)
                                )
                    );
        _mm_stream_ps(
                        o + 8,
                        _mm_sub_ps(
                                    _mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(xmm0, 8))),
                                    _mm_load_ps(m + offset + 8)
                                )
                    );
        _mm_stream_ps(
                        o + 12,
                        _mm_sub_ps(
                                    _mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(xmm0, 12))),
                                    _mm_load_ps(m + offset + 12)
                                )
                    );

        p += 16;
        m += 16;
        o += 16;
        len -= 16;
    }
    while(len);
}

Original Q&A

There are 1 answers

**Paul R** · Accepted Answer · 2011-02-10T08:46:30+00:00

You are doing almost no computation here, relative to the number of loads and stores, so it's likely that you are being limited by memory bandwidth rather than computation. This would explain why you don't see any improvement in throughput when you optimise the computation.

I would get rid of the _mm_prefetch instructions though - they are almost certainly not helping here and may even be hurting performance.

If possible you should combine this loop with any other operations that you are doing before/after this - that way you amortise the cost of memory I/O over more computation.

TechQA.

Optimizing loop with few instructions(SSE2, SSE4) with TBB

There are 1 answers

Related Questions in OPTIMIZATION

Related Questions in IMAGE-PROCESSING

Related Questions in PARALLEL-PROCESSING

Related Questions in TBB

Related Questions in SSE2

Popular Questions

Trending Questions