Zero/sign-extend are no-op, why then instructions for each size type?

2.4k views Asked by At

For x86 and x64 compilers generate similar zero/sign extend MOVSX and MOVZX. The expansion itself is not free, but allows processors to perform out-of-order magic speed up.

But on RISC-V:

Consequently, conversion between unsigned and signed 32-bit integers is a no-op, as is conversion from a signed 32-bit integer to a signed 64-bit integer.

A few new instructions (ADD[I]W/SUBW/SxxW) are required for addition and shifts to ensure reasonable performance for 32-bit values.

(C) RISC-V Spec

But at the same time, the new modern RISC-V 64-bit processors contains instructions for 32-bit signed integers. Why? To increase performance? Where then are 8 and 16-bits? I already do not understand anything.

3

There are 3 answers

2
Palmer Dabbelt On

This is one of those cases where the ABI starts to bleed in to the ISA. You'll find a handful of these floating around in RISC-V. As a result of us having a pretty significant software stack ported by the time we standardized the ISA we got to fine tune the ISA to match real code. Since an explicit goal of the base RISC-V ISAs was to keep a lot of encoding space available for future expansion.

In this case, the ABI design decision is to answer the question "Is there a canonical representation of types that, when stored in registers, do not need every bit pattern provided by those registers in order to represent every value representable by the type?" In the case of RISC-V we chose to mandate a canonical representation for all types. There's a feedback loop here with some ISA design decisions and I think the best way to go about this is to work through an example of what ISA would have co-evolved with an ABI where we didn't mandate a canonical representation.

As a thought exercise, let's assume that the RISC-V ABI did not mandate a canonical representation for the high bits of int when stored in an X register on RV64I. The result here is that the existing W family of instructions wouldn't be particularly useful: you can use addiw t0, t0, 0 as a sign extension so the compiler can the rely on what's in the high-order bits, but that adds an additional instruction to many common patterns like compare+branch. The correct ISA design decision to make here would be to have a different set of W instructions, something like "compare on the low 32 bits and branch". If you run the numbers, you end up with about the same number of additional instructions (branch and set as opposed to add, sub, and shift). The issue is that the branch instructions are much more expensive in terms of encoding space because they have much longer offsets. Since encoding space is considered an important resource in RISC-V, when there is no clear performance advantage we tend to chose the design decision that conserves more encoding space. In this case there's no meaningful performance distinction as long as the ABI matches the ISA.

There's a second order design decision to be made here: is the canonical representation to sign extend or to zero extend? There's a trade off here: sign extension results in faster software (for the same amount of encoding space used), but more complicated hardware. Specifically, the common C fragment

 long func_pos();
 long func_neg();

 long neg_or_pos(int a) {
     if (a > 0) return func_pos();
     return func_neg();
 }

compiles very efficiently when sign extension is used

neg_or_pos:
    bgtz    a0,.L4
    tail    func_neg
.L4:
    tail    func_pos

but is slower when zero-extension is used (again, assuming we're unwilling to blow a lot of encoding space on word-sized compare+branch instructions)

neg_or_pos:
    addiw   a0, a0, 0
    bgtz    a0,.L4
    tail    func_neg
.L4:
    tail    func_pos

When we balanced things out, it appeared that the software cost of zero extension was higher than the hardware cost of sign extension: for the smallest possible design (ie, a microcoded implementation) you still need an arithmetic right shift so you don't lose any datapath, and for the biggest possible design (ie, a wide out of order core) the code would just end up shuffling bits around before branching. Oddly enough, the one place you pay a meaningful cost for sign extension is in in-order machines with short pipelines: you could shave a MUX delay off the ALU path, which is critical in some designs. In practice there are a lot of other places where sign extension is the right decision to make, so just changing this one wouldn't result in the removal of that datapath.

4
Margaret Bloom On

The full quote seems clear to me:

The compiler and calling convention maintain an invariant that all 32-bit values are held in a sign-extended format in 64-bit registers. Even 32-bit unsigned integers extend bit 31 into bits 63 through 32.

Consequently, conversion between unsigned and signed 32-bit integers is a no-op, as is conversion from a signed 32-bit integer to a signed 64-bit integer.
Existing 64-bit wide SLTU and unsigned branch compares still operate correctly on unsigned 32-bit integers under this invariant.
Similarly, existing 64-bit wide logical operations on 32-bit sign-extended integers preserve the sign-extension property.

A few new instructions (ADD[I]W/SUBW/SxxW) are required for addition and shifts to ensure reasonable performance for 32-bit values.

It says that 32-bit values are stored in 64-bit registers with their MSb (Most Significant bit) repeated through bits 32-63.
This is done for both signed and unsigned integers.

This allows a few optimisations as outlined in the quote:

  • Unsigned <-> signed conversion is free.
    Compare this to the usual algorithm where you have to zero or sign extend the low 32-bit value to promote it a 64-bit value of different "sign-ness" (Ignoring overflow).
  • Signed 32-bit <-> Signed 64-bit is free.
    This spares a sign extension.
  • Branches and set instructions still work.
    This is because repeating the MSb doesn't change the result of the comparison.
  • Logical 64-bit operations preserve this property
    It's easy to see this after a couple of examples.

However addition (to name one) doesn't preserve this invariant: 0x000000007fffffff + 0x0000000000000001 = 0x0000000080000000 which violates the assumption.

Since a) working with 32-bit values happens very often and b) fixing the result would require additional work (I can think of using a slli/srai pair) a new format of instructions has been introduced.
These instructions operate on 64-bit registers but only use their lower 32-bit value and will sign-extend the 32-bit result.
This is easily done in hardware so it's worth having this new class of instruction.

As noted in the comments, 8 and 16-bit arithmetic is rare so no engineering effort has been spent on finding new room for it (both in terms of the gates required and of the opcode space used).

0
Davislor On

To expand on the accepted answer’s comment that “8 and 16-bit arithmetic is rare”: some of the most common computer languages are designed not to need it, because popular ISAs of the past did not have it

C specifies that any operand narrower than an int gets “promoted” to int when doing any arithmetic on it. On RISC-V, an int is 32-bits wide. There are the LB/LBU and LH/LHU instructions to choose between zero-extending an unsigned short and sign extending a signed char when loading them from memory.

C-family languages don’t need any support for 8-bit or 16-bit math beyond that. For common cases like some_unsigned_short += 1, it might be somewhat useful to have some kind of hypothetical ADDIH that automatically truncates the result. However, that’s just one extra instruction (bitmask by 0xFFFF). Expressions like some_signed_short -= 1 don’t even need to do that much to be “correct,” or at least for their compilers to technically comply with the language Standard, because signed overflow or underflow is undefined behavior in C, so the compiler can just ignore the possibility or do whatever it wants.