Swap memory pointers atomically on CUDA

1.3k views Asked by At

I have two pointers in memory and I want to swap it atomically but atomic operation in CUDA support only int types. There is a way to do the following swap?

classA* a1 = malloc(...);
classA* a2 = malloc(...);
atomicSwap(a1,a2);
2

There are 2 answers

0
Fabio T. On

I managed to have the needed behaviour, it is not atomic swap but still safe. The context was a monotonic Linked List working both on CPU and GPU:

template<typename T>
union readablePointer
{
    T* ptr;
    unsigned long long int address;
};

template<typename T>
struct LinkedList
{

    struct Node
    {
        T value;
        readablePointer<Node> previous;
    };

    Node start;
    Node end;

    int size;

    __host__ __device__ void initialize()
    {
        size = 0;

        start.previous.ptr = nullptr;
        end.previous.ptr = &start;
    }

    __host__ __device__ void push_back(T value)
    {
        Node* node = nullptr;
        malloc(&node, sizeof(Node));

        readablePointer<Node> nodePtr;
        nodePtr.ptr = node;

        nodePtr.ptr->value = value;

#ifdef __CUDA_ARCH__
        nodePtr.ptr->previous.address = atomicExch(&end.previous.address, nodePtr.address);
        atomicAdd(&size,1);
#else
        nodePtr.ptr->previous.address = end.previous.address;
        end.previous.address = nodePtr.address;
        size += 1;
#endif

    }

    __host__ __device__ T pop_back()
    {
        assert(end.previous.ptr != &start);

        readablePointer<Node> lastNodePtr;
        lastNodePtr.ptr = nullptr;

#ifdef __CUDA_ARCH__
        lastNodePtr.address = atomicExch(&end.previous.address,end.previous.ptr->previous.address);
        atomicSub(&size,1);
#else
        lastNodePtr.address = end.previous.address;
        end.previous.address = end.previous.ptr->previous.address;
        size -= 1;
#endif
        T toReturn = lastNodePtr.ptr->value;

        free(lastNodePtr.ptr);

        return toReturn;
    }

    __host__ __device__ void clear()
    {
        while(size > 0)
        {
            pop_back();
        }
    }
};
0
einpoklum On

When writing device-side code...

While CUDA provides atomics, they can't cover multiple (possibly remote) memory locations at once.

To perform this swap, you will need to "protect" access to both these values with something like mutex, and have whoever wants to write values to them take a hold of the mutex for the duration of the critical section (like in C++'s host-side std::lock_guard). This can be done using CUDA's actual atomic facilities, e.g. compare-and-swap, and is the subject of this question:

Implementing a critical section in CUDA

A caveat to the above is mentioned by @RobertCrovella: If you can make do with, say, a pair of 32-bit offsets rather than a 64-bit pointer, then if you were to store them in a 64-bit aligned struct, you could use compare-and-exchange on the whole struct to implement an atomic swap of the whole struct.

... but is it really device side code?

Your code actually doesn't look like something one would run on the device: Memory allocation is usually (though not always) done from the host side before you launch your kernel and do actual work. If you could make sure these alterations only happen on the host side (think CUDA events and callbacks), and that device-side code will not be interfered with by them - you can just use your plain vanilla C++ facilities for concurrent programming (like lock_guard I mentioned above).