How to perform Arithmetic right shift on a 64 bit number stored across two 32 bit registers in ARM?

102 views Asked by At

I am trying to implement the booth's multiplication algorithm in ARM assembly language.

Algorithm 2: Booth’s Algorithm to multiply two 32-bit numbers to produce a 64-bit result
Data: Multiplier in V , U = 0, Multiplicand in N
Result: The lower 64 bits of UV contain the result
1 i←0
2 prevBit ← 0
3 fori<32do
4 i←i+1
5 currBit ← LSB of V
6 if (currBit,prevBit) = (1,0) then
7 U←U−N
8 end
9 else if (currBit,prevBit) = (0,1) then
10 U←U+N
11 end
12 prevBit ← currBit
13 UV ← UV ≫ 1 (arithmetic right shift)
14 end

How do I perform the 13th step of the algorithm ? How do I perform asr on a 64 bit number stored as 32 bit parts on two registers ?

I have tried performing asr on both of the registers and then for the lower 32 bits I replace the MSB with the LSB of the upper 32 bits (stored before performing asr).

1

There are 1 answers

0
Peter Cordes On

Ask a compiler: int64_t asr_1(int64_t a){ return a>>1; }.
Godbolt compiler explorer with GCC and clang for ARM

In the standard calling convention, the first arg and the return value are both in R1:R0 so compilers will make asm that operates in-place. For a shift-count of 1, Clang uses the Carry flag to get the bit from the bottom of the high half to the top of the low half.

// clang -O2 -Wall -mcpu=cortex-a77
        asrs    r1, r1, #1         @ set flags, including C from the bit shifted out
        rrx     r0, r0             @ rotate-through-carry, shifting C into the top

GCC doesn't use the carry flag, using the same strategy it and clang use for shift counts between 2 and 31. The low half of an int64_t is unsigned; a logical shift leaves zeros where you can OR in some bits from the high half. This strategy is less efficient than the rrx trick for a count of 1, unless rrx is slow on some CPUs.

// return a>>5   with clang; GCC is similar but uses a MOV to copy R1 and OR last
        lsr     r0, r0, #5                  @ lo >>= 5
        orr     r0, r0, r1, lsl #27         @ lo|=high<<(32-5)  to shift bits between them
        asr     r1, r1, #5                  @ hi >>= 5

Clang with -mthumb strangely uses asrs.w before rrx, but still plain asrs (16-bit) for return a>>(32+5); which is easy: asrs r0, r1, #5 ; asrs r1, r1, #31 (high half of the return val is either all-0 or all-1, according to the sign bit.)

AFAICT there's no correctness reason for using asrs.w. asrs is encodable as a 16-bit Thumb instruction, although rrx is only available with Thumb 2. This doesn't need ARMv8 or anything, in fact clang -march=armv4t still uses this; I just tend to use -mcpu=cortex-a77 or a53 because it's recent and easy to think of, and I want to know if there are any new tricks, and for tuning choices appropriate for modern ARM cores. cortex-m3 or m0 are also relevant for some projects. M0 notably lacks most Thumb 2 encodings:

// GCC or Clang  -mcpu=cortex-m0
        lsls    r2, r1, #31        // extract the low bit of the high half
        lsrs    r0, r0, #1
        adds    r0, r0, r2         // and OR it in to  lo >> 1
        asrs    r1, r1, #1         // hi >>= 1

Variable-count shifts are harder, but compilers can still show you how, using predicated execution. GCC -marm has the shortest out of any of it or clang's versions for int64_t asr(int64_t a, int c){ return a>>c; }

// GCC -O3 -mcpu=cortex-a77   for a variable-count shift.  Count in R2
        lsr     r0, r0, r2
        rsb     r3, r2, #32           // 32-count
        subs    ip, r2, #32           // count-32 and set flags
        orr     r0, r0, r1, lsl r3
        orrpl   r0, r0, r1, asr ip    // ORR if PLus (if Negative flag == 0)
        asr     r1, r1, r2
        bx      lr

ARM shifts with counts >= 32 shift out all the bits, producing zero for logical shifts. So the r1, lsl r3 source operand is zero if it's actually the other bits we need, I think. And asr r1, r1, r2 to set the high half works like asr r1, #31 for r2==32 or higher; it's still the original input shift count.