WinAPI _Interlocked* intrinsic functions for char, short

784 views Asked by At

I need to use _Interlocked*** function on char or short, but it takes long pointer as input. It seems that there is function _InterlockedExchange8, I don't see any documentation on that. Looks like this is undocumented feature. Also compiler wasn't able to find _InterlockedAdd8 function. I would appreciate any information on that functions, recommendations to use/not to use and other solutions as well.

update 1

I'll try to simplify the question. How can I make this work?

struct X
{
    char data;
};

X atomic_exchange(X another)
{
    return _InterlockedExchange( ??? );
}

I see two possible solutions

  1. Use _InterlockedExchange8
  2. Cast another to long, do exchange and cast result back to X

First one is obviously bad solution. Second one looks better, but how to implement it?

update 2

What do you think about something like this?

template <typename T, typename U>
class padded_variable
{
public:
    padded_variable(T v): var(v) {}
    padded_variable(U v): var(*static_cast<T*>(static_cast<void*>(&v))) {}
    U& cast()
    {
        return *static_cast<U*>(static_cast<void*>(&var));
    }
    T& get()
    {
        return var;
    }
private:
    T var;
    char padding[sizeof(U) - sizeof(T)];
};

struct X
{
    char data;
};

template <typename T, int S = sizeof(T)> class var;
template <typename T> class var<T, 1>
{
public:
    var(): data(T()) {}
    T atomic_exchange(T another)
    {
        padded_variable<T, long> xch(another);
        padded_variable<T, long> res(_InterlockedExchange(&data.cast(), xch.cast()));
        return res.get();
    }
private:
    padded_variable<T, long> data;
};

Thanks.

4

There are 4 answers

11
Ben Voigt On

Why do you want to use smaller data types? So you can fit a bunch of them in a small memory space? That's just going to lead to false sharing and cache line contention.

Whether you use locking or lockless algorithms, it's ideal to have your data in blocks of at least 128 bytes (or whatever the cache line size is on your CPU) that are only used by a single thread at a time.

9
jalf On

Creating a new answer because your edit changed things a bit:

  • Use _InterlockedExchange8
  • Cast another to long, do exchange and cast result back to X

The first simply won't work. Even if the function existed, it would allow you to atomically update a byte at a time. Which means that the object as a whole would be updated in a series of steps which wouldn't be atomic.

The second doesn't work either, unless X is a long-sized POD type. (and unless it is aligned on a sizeof(long) boundary, and unless it is of the same size as a long)

In order to solve this problem you need to narrow down what types X might be. First, of course, is it guaranteed to be a POD type? If not, you have an entirely different problem, as you can't safely treat non-POD types as raw memory bytes.

Second, what sizes may X have? The Interlocked functions can handle 16, 32 and, depending on circumstances, maybe 64 or even 128 bit widths.

Does that cover all the cases you can encounter?

If not, you may have to abandon these atomic operations, and settle for plain old locks. Lock a Mutex to ensure that only one thread touches these objects at a time.

1
jalf On

Well, you have to make do with the functions available. _InterlockedIncrement and `_InterlockedCompareExchange are available in 16 and 32-bit variants (the latter in a 64-bit variant as well), and maybe a few other interlocked intrinsics are available in 16-bit versions as well, but InterlockedAdd doesn't seem to be, and there seem to be no byte-sized Interlocked intrinsics/functions at all.

So... You need to take a step back and figure out how to solve your problem without an IntrinsicAdd8.

Why are you working with individual bytes in any case? Stick to int-sized objects unless you have a really good reason to use something smaller.

0
Steve-o On

It's pretty easy to make 8-bit and 16-bit interlocked functions but the reason they're not included in WinAPI is due to IA64 portability. If you want to support Win64 the assembler cannot be inline as MSVC no longer supports it. As external function units, using MASM64, they will not be as fast as inline code or intrinsics so you are wiser to investigate promoting algorithms to use 32-bit and 64-bit atomic operations instead.

Example interlocked API implementation: intrin.asm