CMPXCHG16B correct?

8.4k views Asked by At

This doesn't exactly seem to be right although I am unsure why. Advice would be great as the documentation for CMPXCHG16B is pretty minimal (I don't own any intel manuals...)

template<>
inline bool cas(volatile types::uint128_t *src, types::uint128_t cmp, types::uint128_t with)
{
    /*
    Description:
     The CMPXCHG16B instruction compares the 128-bit value in the RDX:RAX and RCX:RBX registers 
     with a 128-bit memory location. If the values are equal, the zero flag (ZF) is set, 
     and the RCX:RBX value is copied to the memory location. 
     Otherwise, the ZF flag is cleared, and the memory value is copied to RDX:RAX.
     */
    uint64_t * cmpP = (uint64_t*)&cmp;
    uint64_t * withP = (uint64_t*)&with;
    unsigned char result = 0;
    __asm__ __volatile__ (
    "LOCK; CMPXCHG16B %1\n\t"
    "SETZ %b0\n\t"
    : "=q"(result)  /* output */ 
    : "m"(*src), /* input */
      //what to compare against
      "rax"( ((uint64_t) (cmpP[1])) ), //lower bits
      "rdx"( ((uint64_t) (cmpP[0])) ),//upper bits
      //what to replace it with if it was equal
      "rbx"( ((uint64_t) (withP[1])) ), //lower bits
      "rcx"( ((uint64_t) (withP[0]) ) )//upper bits
    : "memory", "cc", "rax", "rdx", "rbx","rcx" /* clobbered items */
    );
    return result;
}

When running with an example I am getting 0 when it should be 1. Any ideas?

5

There are 5 answers

4
luke h On BEST ANSWER

Noticed a few issues,

(1) The main problem is the constraints, "rax" doesn't do what it looks like, rather the first character "r" lets gcc use any register.

(2) Not sure how your storing types::uint128_t, but assuming the standard little endian for x86 platforms, then the high and low dwords are also swapped around.

(3) Taking the address of something and casting it to something else can break aliasing rules. Depends on how your types::uint128_t is defined as to wether or not this is an issue (fine if it is a struct of two uint64_t's). GCC with -O2 will optimize assuming aliasing rules are not violated.

(4) *src should really be marked as an output, rather than specifying memory clobber. but this is really more of a performance rather than correctness issue. similarly rbx and rcx do not need to specified as clobbered.

Here is a a version that works,

#include <stdint.h>

namespace types
{
    // alternative: union with  unsigned __int128
    struct uint128_t
    {
        uint64_t lo;
        uint64_t hi;
    }
    __attribute__ (( __aligned__( 16 ) ));
}

template< class T > inline bool cas( volatile T * src, T cmp, T with );

template<> inline bool cas( volatile types::uint128_t * src, types::uint128_t cmp, types::uint128_t with )
{
    // cmp can be by reference so the caller's value is updated on failure.

    // suggestion: use __sync_bool_compare_and_swap and compile with -mcx16 instead of inline asm
    bool result;
    __asm__ __volatile__
    (
        "lock cmpxchg16b %1\n\t"
        "setz %0"       // on gcc6 and later, use a flag output constraint instead
        : "=q" ( result )
        , "+m" ( *src )
        , "+d" ( cmp.hi )
        , "+a" ( cmp.lo )
        : "c" ( with.hi )
        , "b" ( with.lo )
        : "cc", "memory" // compile-time memory barrier.  Omit if you want memory_order_relaxed compile-time ordering.
    );
    return result;
}

int main()
{
    using namespace types;
    uint128_t test = { 0xdecafbad, 0xfeedbeef };
    uint128_t cmp = test;
    uint128_t with = { 0x55555555, 0xaaaaaaaa };
    return ! cas( & test, cmp, with );
}
1
Tomek On

All Intel documentation is available for free: IntelĀ® 64 and IA-32 Architectures Software Developer's Manuals.

0
chamibuddhika On

I got it compiling for g++ with a slight change (removing oword ptr in cmpxchg16b instruction). But it doesn't seem to overwrite the memory as required though I may be wrong. [See update] Code is given below followed by output.

#include <stdint.h>
#include <stdio.h>

namespace types
{
  struct uint128_t
  {
    uint64_t lo;
    uint64_t hi;
  }
  __attribute__ (( __aligned__( 16 ) ));
 }

 template< class T > inline bool cas( volatile T * src, T cmp, T with );

 template<> inline bool cas( volatile types::uint128_t * src, types::uint128_t cmp,  types::uint128_t with )
 {
   bool result;
   __asm__ __volatile__
   (
    "lock cmpxchg16b %1\n\t"
    "setz %0"
    : "=q" ( result )
    , "+m" ( *src )
    , "+d" ( cmp.hi )
    , "+a" ( cmp.lo )
    : "c" ( with.hi )
    , "b" ( with.lo )
    : "cc"
   );
   return result;
}

void print_dlong(char* address) {

  char* byte_array = address;
  int i = 0;
  while (i < 4) {
     printf("%02X",(int)byte_array[i]);
     i++;
  }

  printf("\n");
  printf("\n");

}

int main()
{
  using namespace types;
  uint128_t test = { 0xdecafbad, 0xfeedbeef };
  uint128_t cmp = test;
  uint128_t with = { 0x55555555, 0xaaaaaaaa };

  print_dlong((char*)&test);
  bool result = cas( & test, cmp, with );
  print_dlong((char*)&test);

  return result;
}

Output

FFFFFFADFFFFFFFBFFFFFFCAFFFFFFDE


55555555

Not sure the output makes sense to me. I was expecting the before value to be something like 00000000decafbad00000feedbeef according to struct definition. But bytes seem to be spread out within words. Is that due to aligned directive? Btw the CAS operation seem to return the correct return value though. Any help in deciphering this?

Update : I just did some debugging with memory inspection with gdb. There the correct values are shown there. So I guess this must be a problem with my print_dlong procedure. Feel free to correct it. I am leaving this reply as it is to be corrected, since a corrected version of this would be instructive of the cas operation with printed results.

1
Cory Nelson On

It's good to note that if you're using GCC, you don't need to use inline asm to get at this instruction. You can use one of the __sync functions, like:

template<>
inline bool cas(volatile types::uint128_t *src,
                types::uint128_t cmp,
                types::uint128_t with)
{
    return __sync_bool_compare_and_swap(src, cmp, with);
}

Microsoft has a similar function for VC++:

__int64 exchhi = __int64(with >> 64);
__int64 exchlo = (__int64)(with);

return _InterlockedCompareExchange128(a, exchhi, exchlo, &cmp) != 0;
13
Swiss Frank On

Here are some alternatives compared:

  1. Inline assembly, such as @luke h's answer.

  2. __sync_bool_compare_and_swap(): a GNU extension, gcc/clang/ICC-only, deprecated, pseudo-function that the compiler will issue a CMPXCHG16B instruction for at least with -mcx16

  3. atomic_compare_exchange_weak() / strong: the C11 pseudo-function that does what atomic<> does in C++11. For GNU, this will NOT issue a CMPXCHG16B in gcc 7 and later, but instead calls libatomic (which must therefore be linked to). Dynamically linked libatomic will decide what version of functions to use based on what the CPU is capable of, and on machines where the CPU is capable of CMPXCHG16B it will use that.

  4. apparently clang will still inline CMPXCHG16B for atomic_compare_exchange_weak() or strong

I haven't tried the machine language, but looking at the disassembly of (2), it looks perfect and I don't see how (1) could beat it. (I have little knowledge of x86 but programmed a lot of 6502.) Further, there's ample advice never to use assembly if you can avoid it, and it can be avoided at least with gcc/clang. So I can cross (1) off the list.

Here's the code for (2) in gcc version 9.2.1 20190827 (Red Hat 9.2.1-1) (GCC):

Thread 2 "mybinary" hit Breakpoint 1, MyFunc() at myfile.c:586
586               if ( __sync_bool_compare_and_swap( &myvar,
=> 0x0000000000407262 <MyFunc+904>:       49 89 c2        mov    %rax,%r10
   0x0000000000407265 <MyFunc+907>:       49 89 d3        mov    %rdx,%r11
(gdb) n
587                                                  was, new ) ) {
=> 0x0000000000407268 <MyFunc+910>:       48 8b 45 a0     mov    -0x60(%rbp),%rax
   0x000000000040726c <MyFunc+914>:       48 8b 55 a8     mov    -0x58(%rbp),%rdx
(gdb) n
586               if ( __sync_bool_compare_and_swap( &myvar,
=> 0x0000000000407270 <MyFunc+918>:       48 c7 c6 00 d3 42 00    mov    $0x42d300,%rsi
   0x0000000000407277 <MyFunc+925>:       4c 89 d3        mov    %r10,%rbx
   0x000000000040727a <MyFunc+928>:       4c 89 d9        mov    %r11,%rcx
   0x000000000040727d <MyFunc+931>:       f0 48 0f c7 8e 70 04 00 00      lock cmpxchg16b 0x470(%rsi)
   0x0000000000407286 <MyFunc+940>:       0f 94 c0        sete   %al

Then doing hammer tests of (2) and (3) on real-world algos, I am seeing no real performance difference. Even in theory, (3) only has the overhead of one additional function call and some work in the libatomic wrapper function, including a branch on whether the CAS succeeded or not.

(With lazy dynamic linking, the first call into the libatomic function will actually run an init function that uses CPUID to check if your CPU has cmpxchg16b. Then it will update the GOT function-pointer that the PLT stub jumps through, so future calls will go straight to libat_compare_exchange_16_i1 that uses lock cmpxchg16b. The i1 suffix in the name is from GCC's ifunc mechanism for function multi-versioning; if you ran it on a CPU without cmpxchg16b support, it would resolve the shared library function to a version that used locking.)

In my real-world-ish hammer tests, that function call overhead is lost in the amount of CPU taken by the functionality the lock-free mechanism is protecting. Therefore, I don't see a reason to use the __sync, functions, that are compiler-specific and deprecated to boot.

Here's the assembly for the libatomic wrapper that gets called for each .compare_exchange_weak(), from single-stepping through the assembly on my Fedora 31. If compiled with -fno-plt, a callq *__atomic_compare_exchange_16@GOTPCREL(%rip) will get inlined into the caller, avoiding the PLT and running the CPU-detection early, at program startup time instead of on the first call.

Thread 2 "tsquark" hit Breakpoint 2, 0x0000000000403210 in 
__atomic_compare_exchange_16@plt ()
=> 0x0000000000403210 <__atomic_compare_exchange_16@plt+0>:     ff 25 f2 8e 02 00       jmpq   *0x28ef2(%rip)        # 0x42c108 <[email protected]>
(gdb) disas
Dump of assembler code for function __atomic_compare_exchange_16@plt:
=> 0x0000000000403210 <+0>:     jmpq   *0x28ef2(%rip)        # 0x42c108 <[email protected]>
   0x0000000000403216 <+6>:     pushq  $0x1e
   0x000000000040321b <+11>:    jmpq   0x403020
End of assembler dump.
(gdb) s
Single stepping until exit from function __atomic_compare_exchange_16@plt,
...

0x00007ffff7fab250 in libat_compare_exchange_16_i1 () from /lib64/libatomic.so.1
=> 0x00007ffff7fab250 <libat_compare_exchange_16_i1+0>: f3 0f 1e fa     endbr64
(gdb) disas
Dump of assembler code for function libat_compare_exchange_16_i1:
=> 0x00007ffff7fab250 <+0>:     endbr64
   0x00007ffff7fab254 <+4>:     mov    (%rsi),%r8
   0x00007ffff7fab257 <+7>:     mov    0x8(%rsi),%r9
   0x00007ffff7fab25b <+11>:    push   %rbx
   0x00007ffff7fab25c <+12>:    mov    %rdx,%rbx
   0x00007ffff7fab25f <+15>:    mov    %r8,%rax
   0x00007ffff7fab262 <+18>:    mov    %r9,%rdx
   0x00007ffff7fab265 <+21>:    lock cmpxchg16b (%rdi)
   0x00007ffff7fab26a <+26>:    mov    %r9,%rcx
   0x00007ffff7fab26d <+29>:    xor    %rax,%r8
   0x00007ffff7fab270 <+32>:    mov    $0x1,%r9d
   0x00007ffff7fab276 <+38>:    xor    %rdx,%rcx
   0x00007ffff7fab279 <+41>:    or     %r8,%rcx
   0x00007ffff7fab27c <+44>:    je     0x7ffff7fab288 <libat_compare_exchange_16_i1+56>
   0x00007ffff7fab27e <+46>:    mov    %rax,(%rsi)
   0x00007ffff7fab281 <+49>:    xor    %r9d,%r9d
   0x00007ffff7fab284 <+52>:    mov    %rdx,0x8(%rsi)
   0x00007ffff7fab288 <+56>:    mov    %r9d,%eax
   0x00007ffff7fab28b <+59>:    pop    %rbx
   0x00007ffff7fab28c <+60>:    retq
End of assembler dump.

The only benefit I've found to using (2) is if your machine didn't ship with a libatomic (true of older Red Hat's) and you don't have the ability to request sysadmins provide this or don't want to count on them installing the right one. I personally downloaded one in source code and built it incorrectly, so that 16-byte swaps ended up using a mutex: disaster.

I haven't tried (4). Or rather I started to but so many warnings/errors on code that gcc passed without comment that I wasn't able to get it compiled in the budgeted time.

Note that while options 2, 3, and 4 seem like the same code or nearly the same code should work, in fact all three have substantially different checks and warnings and even if you have one of the three compiling fine and without warning in -Wall, you may get a lot more warnings or errors if you try one of the other options. The __sync* pseudo-functions aren't well-documented. (In fact the documentation only mentions 1/2/4/8 bytes, not that they work for 16 bytes. Meanwhile, they work "kind of" like function templates but you don't see the template and they seem finicky about whether the first and second arg types actually are the same type in a way that atomic_* aren't.) In short it wasn't the 3 minute job to compare 2, 3, and 4 that you might guess.