FPU operations generated by GCC during casting integer to float

1.2k views Asked by At

I want to perform division on an FPU in C (using integer values):

float foo;
uint32_t *ptr1, *ptr2;
foo = (float)*(ptr1) / (float)*(ptr2);

And in NASM (from object compiled via GCC) it has following representation:

    mov     rax, QWORD [ptr1]
    mov     eax, DWORD [rax]
    mov     eax, eax
    test    rax, rax
    js      ?_001
    pxor    xmm0, xmm0
    cvtsi2ss xmm0, rax
    jmp     ?_002

?_001:
    mov     rdx, rax
    shr     rdx, 1
    and     eax, 01H
    or      rdx, rax
    pxor    xmm0, xmm0
    cvtsi2ss xmm0, rdx
    addss   xmm0, xmm0
?_002:
    mov     rax, QWORD [ptr2]

; ... for ptr2 pattern repeats

What does this "black magic" under ?_001 mean? Isn't only cvtsi2ss enough for conversion from integer to float?

2

There are 2 answers

2
Anty On BEST ANSWER

In general cvtsi2ss does the trick - converts scalar integer (other sources name it doubleword integer to single scalar but my naming is consistent with other vector ins) to scalar single (float). But it expects signed integer.

So this code

mov     rdx, rax                                
shr     rdx, 1                                  
and     eax, 01H                                
or      rdx, rax                                
pxor    xmm0, xmm0                              
cvtsi2ss xmm0, rdx                              
addss   xmm0, xmm0  

help convert unsigned to signed (please note js jump - if sign bit is set this code is executed - otherwise it is skipped). Sign is set when value is greater then 0x7FFFFFFF for uint32_t.

So the "magic" code does:

mov     rdx, rax       ; move value from ptr1 to edx                         
shr     rdx, 1         ; div by 2 - logic shift not arithmetic because ptr1 is unsigned
and     eax, 01H       ; save least significant bit                          
or      rdx, rax       ; move this bit to divided value to someway fix rounding errors                         
pxor    xmm0, xmm0                              
cvtsi2ss xmm0, rdx                              
addss   xmm0, xmm0     ; add to itself = multiply by 2

I'm not sure what compiler and what compile options you use - GCC does simply

cvtsi2ssq       xmm0, rbx
cvtsi2ssq       xmm1, rax
divss   xmm0, xmm1

I hope it helps.

1
Cody Gray - on strike On

You must be looking at unoptimized code. That is a waste of time. When the optimizer is disabled, compilers generate a bunch of nonsense code for various reasons—to achieve a faster compilation speed, to make it easier to set breakpoints on lines of source, to make it easier to catch bugs, etc.

When you generate optimized code on a compiler targeting x86-64, all of this noise goes away, the code becomes much more efficient, and by consequence, much easier to interpret/understand.

Here's a function that performs the operation you want. I've written it as a function so that I can pass the inputs as opaque parameters and the compiler can't optimize it away.

float DivideAsFloat(uint32_t *ptr1, uint32_t *ptr2)
{
    return (float)(*ptr1) / (float)(*ptr2);
}

Here's the object code that all versions of GCC (back to 4.9.0) generate for this function:

DivideAsFloat(unsigned int*, unsigned int*):
    mov        eax, DWORD PTR [rdi]   ; retrieve value of 'ptr1' parameter
    pxor       xmm0, xmm0             ; zero-out xmm0 register
    pxor       xmm1, xmm1             ; zero-out xmm1 register
    cvtsi2ssq  xmm0, rax              ; convert *ptr1 into a floating-point value in XMM0
    mov        eax, DWORD PTR [rsi]   ; retrieve value of 'ptr2' parameter
    cvtsi2ssq  xmm1, rax              ; convert *ptr2 into a floating-point value in XMM1
    divss      xmm0, xmm1             ; divide the two floating-point values
    ret

This is almost exactly what you would expect to see. The only "black magic" here is the PXOR instructions. Why is the compiler bothering to pre-zero the XMM registers before executing the CVTSI2SS instruction that's just going to clobber them anyway? Well, because CVTSI2SS only partially clobbers its destination register. Specifically, it clobbers only the lower bits, leaving the upper bits untouched. This results in a false dependency on the upper bits, which results in execution stalls. This dependency can be broken by pre-zeroing the register, thus preventing the possibility of stalls and speeding up execution. The PXOR instruction is a fast, efficient way to clear a register. (I talked about this exact same phenomenon recently over here—see last paragraph.)

In fact, older versions of GCC (prior to 4.9.0) did not perform this optimization and thus generated code that did not include the PXOR instructions. It looks more efficient, but it actually runs slower.

DivideAsFloat(unsigned int*, unsigned int*):
    mov        eax, DWORD PTR [rdi]   ; retrieve value of 'ptr1' parameter
    cvtsi2ssq  xmm0, rax              ; convert *ptr1 into a floating-point value in XMM0
    mov        eax, DWORD PTR [rsi]   ; retrieve value of 'ptr2' parameter
    cvtsi2ssq  xmm1, rax              ; convert *ptr2 into a floating-point value in XMM1
    divss      xmm0, xmm1             ; divide the two floating-point values
    ret

Clang 3.9 emits the same code as these older versions of GCC. It doesn't know about the optimization, either. MSVC does know about it (since VS 2010), and so do modern versions of ICC (verified on ICC 16 and later; missing in ICC 13).

However, that's not to say that Anty's answer (and Mystical's comment) is entirely incorrect. CVTSI2SS is indeed designed for converting a signed integer to a scalar single-precision float, not an unsigned integer like you have here. So what gives? Well, a 64-bit processor has 64-bit wide registers available, so the unsigned 32-bit input values can be stored as signed 64-bit intermediate values, which allows CVTSI2SS to be used after all.

Compilers do that when optimization is enabled because it results in more efficient code. If, on the other hand, you were targeting 32-bit x86 and didn't have 64-bit registers available, the compiler would have to deal with the signed vs. unsigned problem. Here's how GCC 6.3 deals with it:

DivideAsFloat(unsigned int*, unsigned int*):
    sub       esp,  4                 
    pxor      xmm0, xmm0              
    mov       eax,  DWORD PTR [esp+8] 
    pxor      xmm1, xmm1              
    movss     xmm3, 1199570944        
    pxor      xmm2, xmm2              
    mov       eax,  DWORD PTR [eax]   
    movzx     edx,  ax                
    shr       eax,  16                
    cvtsi2ss  xmm0, eax               
    mov       eax,  DWORD PTR [esp+12]
    cvtsi2ss  xmm1, edx               
    mov       eax,  DWORD PTR [eax]   
    movzx     edx,  ax                
    shr       eax,  16                
    cvtsi2ss  xmm2, edx               
    mulss     xmm0, xmm3              
    addss     xmm0, xmm1              
    pxor      xmm1, xmm1              
    cvtsi2ss  xmm1, eax               
    mulss     xmm1, xmm3
    addss     xmm1, xmm2
    divss     xmm0, xmm1
    movss     DWORD PTR [esp], xmm0
    fld       DWORD PTR [esp]
    add       esp,  4
    ret

This is a bit difficult to follow because of how the optimizer rearranged and interleaved instructions. Here, I've "unoptimized" it, reordering the instructions and breaking them up into more logical groups, in hopes of making it easier to follow the flow of execution. (The only instruction I removed was a dependency-breaking PXOR—the rest of the code is the same, just rearranged.)

DivideAsFloat(unsigned int*, unsigned int*):
  ;;; Initialization ;;;
  sub       esp,  4           ; reserve 4 bytes on the stack

  pxor      xmm0, xmm0        ; zero-out XMM0
  pxor      xmm1, xmm1        ; zero-out XMM1
  pxor      xmm2, xmm2        ; zero-out XMM2
  movss     xmm3, 1199570944  ; load a constant into XMM3


  ;;; Deal with the first value ('ptr1') ;;;
  mov       eax,  DWORD PTR [esp+8]  ; get the pointer specified in 'ptr1'
  mov       eax,  DWORD PTR [eax]    ; dereference the pointer specified by 'ptr1'
  movzx     edx,  ax                 ; put the lower 16 bits of *ptr1 in EDX
  shr       eax,  16                 ; move the upper 16 bits of *ptr1 down to the lower 16 bits in EAX
  cvtsi2ss  xmm0, eax                ; convert the upper 16 bits of *ptr1 to a float
  cvtsi2ss  xmm1, edx                ; convert the lower 16 bits of *ptr1 (now in EDX) to a float

  mulss     xmm0, xmm3               ; multiply FP-representation of upper 16 bits of *ptr1 by magic number
  addss     xmm0, xmm1               ; add the result to the FP-representation of *ptr1's lower 16 bits


  ;;; Deal with the second value ('ptr2') ;;;
  mov       eax,  DWORD PTR [esp+12] ; get the pointer specified in 'ptr2'
  mov       eax,  DWORD PTR [eax]    ; dereference the pointer specified by 'ptr2'
  movzx     edx,  ax                 ; put the lower 16 bits of *ptr2 in EDX
  shr       eax,  16                 ; move the upper 16 bits of *ptr2 down to the lower 16 bits in EAX
  cvtsi2ss  xmm2, edx                ; convert the lower 16 bits of *ptr2 (now in EDX) to a float
  cvtsi2ss  xmm1, eax                ; convert the upper 16 bits of *ptr2 to a float

  mulss     xmm1, xmm3               ; multiply FP-representation of upper 16 bits of *ptr2 by magic number
  addss     xmm1, xmm2               ; add the result to the FP-representation of *ptr2's lower 16 bits


  ;;; Do the division, and return the result ;;;
  divss     xmm0, xmm1               ; FINALLY, divide the FP-representation of *ptr1 by *ptr2
  movss     DWORD PTR [esp], xmm0    ; store this result onto the stack, in the memory we reserved
  fld       DWORD PTR [esp]          ; load this result onto the top of the x87 FPU
                                     ;  (the 32-bit calling convention requires floating-point values be returned this way)

  add       esp,  4                  ; clean up the space we allocated on the stack
  ret

Notice that the strategy here is to break each of the unsigned 32-bit integer values up into its two 16-bit halves. The upper half is converted to a floating-point representation and multiplied by a magic number (to compensate for the signed-ness). Then, the lower half is converted to a floating-point representation, and these two floating-point representations (each 16-bit half of the original 32-bit value) are added together. This is done twice—once for each 32-bit input value (see the two "groups" of instructions). Then, finally, the resulting two floating-point representations are divided, and the result is returned.

The logic is similar to what the unoptimized code does, but…well, more optimal. In particular, redundant instructions are removed and the algorithm is generalized so that branching on the signed-ness is unnecessary. This speeds things up because mispredicted branches are slow.

Note that Clang uses a slightly different strategy and is able to generate even more optimal code here than GCC does:

DivideAsFloat(unsigned int*, unsigned int*):
   push     eax                       ; reserve 4 bytes on the stack

   mov      eax,  DWORD PTR [esp+12]  ; get the pointer specified in 'ptr2'
   mov      ecx,  DWORD PTR [esp+8]   ; get the pointer specified in 'ptr1'
   movsd    xmm1, QWORD PTR 4841369599423283200 ; load a constant into XMM1

   movd     xmm0, DWORD PTR [ecx]     ; dereference the pointer specified by 'ptr1',
                                      ;  and load the bits directly into XMM0
   movd     xmm2, DWORD PTR [eax]     ; dereference the pointer specified by 'ptr2'
                                      ;  and load the bits directly into XMM2

   orpd     xmm0, xmm1                ; bitwise-OR *ptr1's raw bits with the magic number
   orpd     xmm2, xmm1                ; bitwise-OR *ptr2's raw bits with the magic number

   subsd    xmm0, xmm1                ; subtract the magic number from the result of the OR
   subsd    xmm2, xmm1                ; subtract the magic number from the result of the OR

   cvtsd2ss xmm0, xmm0                ; convert *ptr1 from single-precision to double-precision in place
   xorps    xmm1, xmm1                ; zero register to break dependencies
   cvtsd2ss xmm1, xmm2                ; convert *ptr2 from single-precision to double-precision, putting result in XMM1

   divss    xmm0, xmm1                ; FINALLY, do the division on the single-precision FP values
   movss    DWORD PTR [esp], xmm0     ; store this result onto the stack, in the memory we reserved
   fld       DWORD PTR [esp]          ; load this result onto the top of the x87 FPU
                                      ;  (the 32-bit calling convention requires floating-point values be returned this way)

   pop      eax                       ; clean up the space we allocated on the stack
   ret

It doesn't even use the CVTSI2SS instruction! Instead, it loads the integer bits and manipulates them with some magic bit-twiddling so it can treat it as a double-precision floating-point value. Some more bit-twiddling later, it uses CVTSD2SS to convert each of these double-precision floating-point values into single-precision floating-point values. Finally, it divides the two single-precision floating-point values, and arranges to return the value.

So when targeting 32-bit, compilers do have to deal with the difference between signed and unsigned integers, but they do so in different ways, using different strategies—some probably more optimal than others. And this is why looking at optimized code is far more enlightening, aside from the fact that it is what will actually be executing on your client's machine.