I'm learning how to use the Intel MMX and SSE instructions in a video application. I have an 8-byte word and I would like to add all 8 bytes and produce a single integer as result. The straightforward method is a series of 7 shifts and adds, but that is slow. What is the fastest way of doing this? Is there an MMX or SSE instruction for this?
This is the slow way of doing it
unsigned long PackedWord = whatever....
int byte1 = 0xff & (PackedWord);
int byte2 = 0xff & (PackedWord >> 8);
int byte3 = 0xff & (PackedWord >> 16);
int byte4 = 0xff & (PackedWord >> 24);
int byte5 = 0xff & (PackedWord >> 32);
int byte6 = 0xff & (PackedWord >> 40);
int byte7 = 0xff & (PackedWord >> 48);
int byte8 = 0xff & (PackedWord >> 56);
int sum = byte1 + byte2 + byte3 + byte4 + byte5 + byte6 + byte7 + byte8;
Based on the suggestion of @harold, you'd want something like: