Largest data type which can be fetch-ANDed atomically?

1.4k views Asked by At

I wanted to try and atomically reset 256 bits using something like this:

#include <x86intrin.h>
#include <iostream>
#include <array>
#include <atomic>

int main(){

    std::array<std::atomic<__m256i>, 10> updateArray;

    __m256i allZeros = _mm256_setzero_si256();

    updateArray[0].fetch_and(allZeros);
}

but I get compiler errors about the element not having fetch_and(). Is this not possible because 256 bit type is too large to guarantee atomicity?

Is there any other way I can implement this? I am using GCC.

If not, what is the largest type I can reset atomically- 64 bits?

EDIT: Could any AVX instructions perform the fetch-AND atomically?

2

There are 2 answers

7
Mats Petersson On

So there are a few different things that need to be solved:

  1. What can the processor do?
  2. What do we mean by atomically?
  3. Can you make the compiler generate code for what the processor can do?
  4. Does the C++11/14 standard support that?

For #1 and #2:

In x86, there are instructions to do 8, 16, 32, 64, 128, 256 and 512 bit operations. One processor will [at least if the data is aligned to it's own size] perform that operation atomically. However, for an operation to be "true atomic", it also needs to prevent race conditions within the update of that data [in other words, prevent some other processor from reading, modifying and writing back that same location]. Aside from a small number of "implied lock" instructions, this is done by adding a "lock prefix" to a particular instruction - this will perform the right kind of cache-talk [technical term] to the other processors in the system to ensure that ONLY THIS processor can update this data.

We can't use VEX instructions with LOCK prefix (from Intel's manual)

Any VEX-encoded instruction with a LOCK prefix preceding VEX will #UD

You need a VEX prefix to use AVX instructions, and #UD means "undefined instruction" - in other words, the code will cause a processor exception if we try to execute it.

So, it is 100% certain that the processor can not do an atomic operation on 256 bits at a time. This answer discusses SSE instruction atomicity: SSE instructions: which CPUs can do atomic 16B memory operations?

#3 is pretty meaningless if the instruction isn't valid.

#4 - well, the standard supports std::atomic<uintmax_t>, and if uintmax_t happens to be 128 or 256 bits, then you could certainly do that. I'm not aware of any processor supporting 128 or higher bits for uintmax_t, but the language doesn't prevent it.

If the requirement for "atomic" isn't as strong as "need to ensure 100% certainly that no other processor updates this at the same time", then using regular SSE, AVX or AVX512 instructions would suffice - but there will be race conditions if you have two processor(cores) doing read/modify/write operations on the same bit of memory simultaneously.

The largest atomic operation on x86 is CMPXCHG16B, which will swap two 64-bit integer registers with the content in memory if the value in two other registers MATCH the value in memory. So you could come up with something that reads one 128-bit value, ands out some bits, and then stores the new value back atomically if nothing else got in there first - if that happened, you have to repeat the operation, and of course, it's not a single atomic and-operation either.

Of course, on other platforms than Intel and AMD, the behaviour may be different.

0
Peter Cordes On

The operation can only be atomic if the memory read/modify/write all happens as a single operation. e.g. lock and [mem], %rax is atomic. (Intel's insn ref manual explicitly says that the lock prefix does work with and to make it atomic.)

Since typical AVX instructions like VPAND can have memory source operands (combining a memory read with modifying a register), but not memory destination operands (read/modify/write), this whole idea isn't going to work.

Mats Petersson's answer does a good job explaining what you can do, but I just wanted to point out why normal AVX can't possibly be used as single-instruction atomic operations. You have to load, modify, and cmpxchange, and then try again if something else modified the memory between reading the load and the cmpexchange.