I try to optimize a loop as much as possible, without optimization the loop looks like this:

typedef typename std::make_unsigned< TEMPLATE >::type unsignedType;
unsignedType *pDest = ...;
auto pSrc = function();

for(int i = 0; i < size ; ++i)
        {
            pDest[i] = static_cast< unsignedType >(pSrc[i] + 1200);
        }

I optimize the loop with omp parallel:

#pragma omp parallel for shared(pDest, pSrc)
for(int i = 0; i < size ; ++i)
        {
            pDest[i] = static_cast< unsignedType >(pSrc[i] + 1200);
        }

It is 10% more faster !

I try it to optimize it with memcpy and by avoiding conversion (so I changed the type of pSrc) but it is slower. I only gains 5%

memcpy(pDest, pSrc, size * sizeof(unsignedType) );

are there ways to optimize even more with omp or other methods?

0 Answers