I have this V6.16b register : 0a,0b,0c,0d,0e,0f,07,08,0a,0b,0c,0d,0e,0f,07,08
and the goal is : ab,cd,ef,78,ab,cd,ef,78
I did it like this :
movi v7.8h, 0x04 // 04,00,04,00,04,00,04,00,04,00,04,00,04,00,04,00
ushl v6.16b, v6.16b, v7.16b // a0,0b,c0,0d,e0,0f,70,08,a0,0b,c0,0d,e0,0f,70,08
movi v8.8h, 0xf8 // f8,00,f8,00,f8,00,f8,00,f8,00,f8,00,f8,00,f8,00
ushl v10.8h, v6.8h, v8.8h // 0b,00,0d,00,0f,00,08,00,0b,00,0d,00,0f,00,08,00
orr v10.16b, v10.16b, v6.16b // ab,0b,cd,0d,ef,0f,78,08,ab,0b,cd,0d,ef,0f,78,08
mov v10.b[1], v10.b[2]
mov v10.b[2], v10.b[4]
mov v10.b[3], v10.b[6]
mov v10.b[4], v10.b[8]
mov v10.b[5], v10.b[10]
mov v10.b[6], v10.b[12]
mov v10.b[7], v10.b[14] // ab,cd,ef,78,ab,cd,ef,78,ab,0b,cd,0d,ef,0f,78,08
It works, but is there a way to do it with fewer instructions? (in particular the mov)
So you have zero-extended nibbles unpacked in big-endian order to pack into bytes?
Like for strtol for hex -> integer, after some initial processing to map ASCII hex digits to the integer digits they represent.
For your original setup where you want to pack bytes from the even positions,
UZP1, but you can optimize the shift/orr step as well.Instead of the first block of 2x ushl + orr, maybe
shl v10.8h, v6.8h, #12/orrto get the bytes you want in the odd elements, garbage (unmodified) in the even elements. (Counting from 0, the0aelement since I think you're writing your vectors in least-significant-first order where wider left shifts move data to the right across byte boundaries). Or better,sli v6.8h, v6.8h, #12(Shift-Left and Insert, where bits keep their original values in the positions where the left shift created zeros.)For the packing step,
UZP2should work to take the odd-numbered vector elements (starting with 1) and pack them down into the low 8 bytes. (Repeated in the high 8 bytes if you use the same vector as both source operands.)(I notice you have an
e0byte.(0xe0 as u16) << 12shifts out the bit to become 0, if that wasn't a typo for0x0e)This leaves your data in big-endian byte order if that was the order across pairs of nibbles. You might need a byte-shuffle
tblinstead oruzp2to reverse the order into auint64_twhile packing. Or if you're only doing this for one number at a time (so loading a shuffle-control constant would take another instruction that can't be hoisted out of a loop) perhapsrev64 v10.8b, v10.8bafteruzp2. Orrev64withv10.16bto do two u64 integers in two halves of the vector.For packing pairs of bytes, shift-right and accumulate
usraby#4can also do that in one instruction (shift and accumulate), since ORR, ADD, and insert are equivalent when the set bits don't overlap. But it would give you0xbanot0xab, shifting the second byte down to become the high half of a u8.rev16+usrawould work, butshl+orris also 2 instructions and probably cheaper, probably running on more execution units on at least some CPUs. Andsliis even better, thanks @fuz.There is no
usla. A multiply-accumulate could be used with a power-of-2 multiplier, but might be slower on some CPUs thanshl+orrand would require a vector constant. And certainly worse thansli.