This is related to Power4 and lack of vector long long. On Power7 and Power8 we can perform:
typedef __vector unsigned long long uint64x2_p;
...
uint64x2_p val = {...};
uint64x2_p res = vec_rl(val, val, bits);
I need to find a workaround for the missing 64-bit vector type and rotate on Power4. I think there are two strategies. First, rotate in C/C++ or; second, use 32-bit vector types. I'm guessing (2) is the faster strategy given the data is in a vector register.
I feel like this problem was solved long ago since there's nothing special about a double-word rotate. Unfortunately search is not returning useful hits: "power4" "doubleword" rotate.
I think I have the basic algorithm that consists of three LOAD's, two SHIFT's, two PERM's and an OR. But I'm not sure if there's a better approach.
How do I perform a 64-bit rotate when working on Power4, which lacks the double-word rotate?
typedef __vector unsigned int uint32x4_p;
template <unsigned int R>
inline uint32x4_p VecRotateLeft64(const uint32x4_p val)
{
enum {LSHIFT = R%32};
enum {RSHIFT = 32 - (R%32)};
enum {PERMUTE = R > 32};
const uint32x4_p lbits = {LSHIFT,LSHIFT,LSHIFT,LSHIFT};
uint32x4_p left(vec_sl(val, lbits));
const uint32x4_p rbits = {RSHIFT,RSHIFT,RSHIFT,RSHIFT};
uint32x4_p right(vec_sr(val, rbits));
const uint8x16_p mask = {4,5,6,7, 0,1,2,3, 12,13,14,15, 8,9,10,11};
right = vec_perm(right, right, mask);
uint32x4_p result = vec_or(left, right);
// Permute left and right parts of 64-bit word as needed
if (PERMUTE)
result = vec_perm(result, result, mask);
return result;
}