I'd like to implement a fast correlation coefficient computation using SSE/AVX2. The operands are two unsigned char
vectors. The function should be mostly equivalent to this:
float correlate_simple(const unsigned char* vec1, const unsigned char* vec2, size_t length)
{
int sum1 = 0;
int sum2 = 0;
int sum11 = 0;
int sum22 = 0;
int sum12 = 0;
for (size_t i = length; i > 0; --i, ++vec1, ++vec2) {
sum1 += *vec1;
sum2 += *vec2;
sum11 += *vec1 * *vec1;
sum22 += *vec2 * *vec2;
sum12 += *vec1 * *vec2;
}
double mean1 = double(sum1) / double(length);
double mean2 = double(sum2) / double(length);
double mean11 = double(sum11) / double(length);
double mean22 = double(sum22) / double(length);
double mean12 = double(sum12) / double(length);
double b = (mean11 - mean1 * mean1) * (mean22 - mean2 * mean2);
if (b <= 0.0)
return 0.0f;
double a = (mean12 - mean1 * mean2);
return float(a / sqrt(b));
}
The parameter length
ranges from 1 to less than 1000.
In order to do this, I researched on how to implement an inner product of two unsigned byte arrays. However, I could not come up with a solution that does not involve converting all the unsigned 8 bit values to signed 16 bit values.
The intrinsic _mm256_maddubs_epi16(a, b)
expects b
to be a signed byte. This would not be a problem in this case since subtracting some constant (here: 127) from b does not change the correlation coefficient. Unfortunately I could not find an intrinsic that would allow me to subtract 127 from unsigned bytes producing signed bytes (not relying on some two's complement magic).
// vec: const unsigned char*
auto x = _mm256_load_si256((const __m256i*) vec);
auto v = _mm256_set1_epi8(127);
// wrong if vec[i] is less than 127:
auto x_centered = _mm256_sub_epi8 (x, v);
What would be the best approach here to compute inner products (and finally a correlation coefficient)?
Addendum:
Below is my current implementation of a pure inner product. I decided to convert to 16 bit integer to avoid overflow errors. Update: Changed from reading 128 bits to 256 bits at a time.
int accumulate_i32(__m256i x)
{
auto tmp1 = _mm256_srli_si256(x, 8);
x = _mm256_add_epi32(x, tmp1);
auto tmp2 = _mm256_extractf128_si256(x, 1);
tmp2 = _mm_add_epi32(tmp2, _mm256_castsi256_si128(x));
return _mm_cvtsi128_si32(tmp2) + _mm_extract_epi32(tmp2, 1);
}
int inner_product_avx(const unsigned char* vec1, const unsigned char* vec2, unsigned int length)
{
constexpr unsigned int memoryAlignmentBytes = 32;
constexpr unsigned int bytesPerPack = 256 / 8;
assert((reinterpret_cast<std::uintptr_t>(vec1) % memoryAlignmentBytes) == 0);
assert((reinterpret_cast<std::uintptr_t>(vec2) % memoryAlignmentBytes) == 0);
// compute middle part via AVX2
unsigned int packCount = length / bytesPerPack;
const __m256i zeros = _mm256_setzero_si256();
auto sumlo = _mm256_setzero_si256();
auto sumhi = _mm256_setzero_si256();
for (unsigned int packIdx = 0; packIdx < packCount; ++packIdx) {
auto x1 = _mm256_load_si256((const __m256i*)vec1);
auto x2 = _mm256_load_si256((const __m256i*)vec2);
auto x1lo = _mm256_unpacklo_epi8(x1, zeros);
auto x1hi = _mm256_unpackhi_epi8(x1, zeros);
auto x2lo = _mm256_unpacklo_epi8(x2, zeros);
auto x2hi = _mm256_unpackhi_epi8(x2, zeros);
auto tmplo = _mm256_madd_epi16(x1lo, x2lo);
auto tmphi = _mm256_madd_epi16(x1hi, x2hi);
sumlo = _mm256_add_epi32(sumlo, tmplo);
sumhi = _mm256_add_epi32(sumhi, tmphi);
vec1 += bytesPerPack;
vec2 += bytesPerPack;
}
int sum = accumulate_i32(sumlo) + accumulate_i32(sumhi);
// compute remaining part that cannot be represented as a
// whole packed integer
unsigned int packRestCount = length % bytesPerPack;
for (size_t i = packRestCount; i > 0; --i, ++vec1, ++vec2)
sum += int(*vec1) * int(*vec2);
return sum;
}
This takes roughly 20 % of the time of the simple C++ implementation (see below). Considering the fact that the AVX code works on 16 16-bit integers simultaneously, I would have expected a higher gain. - Is this reasonable or did I miss something?
Unrolling the last loop in the AVX code did not descrease computation time.
int inner_product_simple(const unsigned char* vec1, const unsigned char* vec2, size_t length)
{
int sum = 0;
for (size_t i = length; i > 0; --i, ++vec1, ++vec2)
sum += int(*vec1) * int(*vec2);
return sum;
}
I would start from something like that. It uses 32-bit accumulators just like your current code. Untested.