How to add each byte of an 8-byte long integer?

1.3k views Asked by At

I'm learning how to use the Intel MMX and SSE instructions in a video application. I have an 8-byte word and I would like to add all 8 bytes and produce a single integer as result. The straightforward method is a series of 7 shifts and adds, but that is slow. What is the fastest way of doing this? Is there an MMX or SSE instruction for this?

This is the slow way of doing it

unsigned long PackedWord = whatever....
int byte1 = 0xff & (PackedWord);
int byte2 = 0xff & (PackedWord >> 8);
int byte3 = 0xff & (PackedWord >> 16);
int byte4 = 0xff & (PackedWord >> 24);
int byte5 = 0xff & (PackedWord >> 32);
int byte6 = 0xff & (PackedWord >> 40);
int byte7 = 0xff & (PackedWord >> 48);
int byte8 = 0xff & (PackedWord >> 56);
int sum = byte1 + byte2 + byte3 + byte4 + byte5 + byte6 + byte7 + byte8;
3

There are 3 answers

2
nickie On

Based on the suggestion of @harold, you'd want something like:

#include <emmintrin.h>

inline int bytesum(uint64_t pw)
{
  __m64 result = _mm_sad_pu8(*((__m64*) &pw), (__m64) 0LLU); // aka psadbw
  return _mm_cvtsi64_si32(result);
}
0
fuz On

I'm not an assembly guru but this code should be a little bit faster on platforms that don't have fancy SIMD instructions:

#include <stdint.h>

int bytesum(uint64_t pw) {
    uint64_t a, b, mask;

    mask = 0x00ff00ff00ff00ffLLU;
    a = (pw >> 8) & mask;
    b = pw & mask;
    pw = a + b;

    mask = 0x0000ffff0000ffffLLU;
    a = (pw >> 16) & mask;
    b = pw & mask;
    pw = a + b;

    return (pw >> 32) + (pw & 0xffffffffLLU);
}

The idea is that you first add every other byte, then every other word, and finally every other doubleworld.

0
Veedrac On

You can do this with a horizontal sum-by-multiply after one pairwise reduction:

uint16_t bytesum(uint64_t x) {
    uint64_t pair_bits = 0x0001000100010001LLU;
    uint64_t mask = pair_bits * 0xFF;

    uint64_t pair_sum = (x & mask) + ((x >> 8) & mask);
    return (pair_sum * pair_bits) >> (64 - 16);
}

This produces much leaner code than doing three pairwise reductions.