I wanted to replace the pthread_spinlock_t example with my own spinlock implementation. However, my implementation's result is literally far lower than the pthread_spinlock_t performance. While the pthread_spinlock_t result is around 0.9s, my own implementation is taking around 2.4s. Can someone explain what is missing in my implementation or what the further room for improvement? I believe that I am missing something related to memory ordering. Here is my implementation below
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <errno.h>
#include <sys/time.h>
#include <atomic>
#include <list>
#define LOOPS 10000000
using namespace std;
list<int> the_list;
//pthread_spinlock_t spinlock;
std::atomic_flag flag = ATOMIC_FLAG_INIT;
pid_t gettid() { return syscall( __NR_gettid ); }
void *consumer(void *ptr)
{
printf("Consumer TID %lu\n", (unsigned long)gettid());
while (1)
{
//pthread_spin_lock(&spinlock);
while (flag.test_and_set(std::memory_order_acquire));
if (the_list.empty())
{
//pthread_spin_unlock(&spinlock);
flag.clear(std::memory_order_release);
break;
}
the_list.front();
the_list.pop_front();
//pthread_spin_unlock(&spinlock);
flag.clear(std::memory_order_release);
}
return NULL;
}
int main()
{
int i;
pthread_t thr1, thr2;
struct timeval tv1, tv2;
//pthread_spin_init(&spinlock, 0);
// Creating the list content...
for (i = 0; i < LOOPS; i++)
the_list.push_back(i);
// Measuring time before starting the threads...
gettimeofday(&tv1, NULL);
pthread_create(&thr1, NULL, consumer, NULL);
pthread_create(&thr2, NULL, consumer, NULL);
pthread_join(thr1, NULL);
pthread_join(thr2, NULL);
// Measuring time after threads finished...
gettimeofday(&tv2, NULL);
if (tv1.tv_usec > tv2.tv_usec)
{
tv2.tv_sec--;
tv2.tv_usec += 1000000;
}
printf("Result - %ld.%ld\n", tv2.tv_sec - tv1.tv_sec,
tv2.tv_usec - tv1.tv_usec);
//pthread_spin_destroy(&spinlock);
return 0;
}
I was expecting to achive the performance of pthread_spin with my own implementation
The two things
pthread_spin_lockdoes differently on contention are:pauseinstruction before retrying the atomic RMW.(I'm assuming you're on an x86-64 CPU? You didn't mention it. But I'm guessing Intel, not AMD, based on how much
pausehelps in this weird case where fine-grained multithreading actively hurts). See alsoDifferences that don't matter:
pthread_spin_lockuseslock dec dword ptr [rdi]as its atomic RMW.flag.test_and_setusesxchgto store and get the old value in a register fortest al,al.pthread_spinlock_tis a 32-bit type, vs.atomic_flagbeing 8-bit.(
lock decrequires a wide-enough type to not wrap around back to1= unlocked.)movstore, since they only needreleasesemantics. (A seq_cst store likeflag.clear(seq_cst)would be done withxchg, since the implicitlockprefix makes it a full memory barrier.)I found what it does by setting a breakpoint before the call to
pthread_spin_lockand single-stepping the asm in GDB. (layout asm). The asm is also visible inobjdump -drwC -Mintel /lib/libc.so.6 | lessand search forspin_lock.Why these matter so much in this case
This case of extreme contention (threads trying to take the lock again right after unlocking, with no useful work in between) magnifies the effect of these differences.
When one thread backs off due to the first attempt failing to get the lock, it gives the other thread time to complete multiple iterations, releasing and re-taking the lock without contention.
In Skylake and later Intel CPUs,
pausepauses the front-end for about 100 cycles, up from about 5 in Broadwell. AFAIK, current AMD still use a pretty shortpause, so I'd expect the effect to be a lot less pronounced on a Ryzen.(
xchg mem,regthroughput is one per 18 cycles on current Intel (https://uops.info/) if done back to back with no other memory ops in between to drain from the store buffer. In our case there's a load and store, but those probably hit in cache since they were allocated sequentially, so the load-use latency is pretty short. Normally linked lists suck because they make a long chain of load latencies.)So one back-off by another thread lets the thread holding the lock probably complete that iteration and then another 2 or 3 without disturbance, keeping exclusive ownership of the cache line. (With Intel's
pausetime).When the other thread only checks read-only for availability, it doesn't disturb the other thread as much, since it can keep the cache line in Shared state.
Both
flagandthe_listare probably in the same cache line. We could try aligning them both by 128 to avoid that, but it makes no measurable difference. (Cache lines are 64 bytes, but the L2 spatial prefetcher likes to complete an aligned pair of cache lines. If you were going to definestd::hardware_destructive_interference_size, 128 would be a good choice for current x86-64.Those things speed up the
atomic_flagversion to matchpthreadJust adding
_mm_pause()from<immintrin.h>into thewhile(flag.TAS()){ _mm_pause(); }spin-wait loop speeds it up from about 1.13-1.20sec to about 0.58 sec on my Skylake i7-6700k. (Linux 6.5.3, glibc 2.38)atomic_flagspinlock_mm_pause()pauseon contention.pthread_spinlock_twhich also spins read-only withpause, but with function-call overhead.Adding the read-only test before spin-retry speeds it up all the way, making it faster than
pthread_spin_unlocksince there's no function-call overhead.You could experiment with doing 2 or 4 pauses per check, like
_mm_pause(); _mm_pause();to further magnify this effect. Or pin both threads to the same core so they can't contend with each other, liketaskset -c 2 ./spinlock_customvs.-c 1,2to allow cores #1 and #2. (But that will often mean a context switch while holding the lock, leading to a fully wasted timeslice for the other thread since we don'tsched_yield()even after hundreds of spin iterations. That's why it's actually slightly slower to run with both threads pinned to a single core.)4x
_mm_pause();makes this hand-rolled spinlock microbenchmark complete another 1.5x faster, since we're trading fairness for throughput. And we know there's another thread that will also be hammering on the lock indefinitely. vs. in the normal case, we'd be aiming for a backoff time where they'll probably be done, or where this burst of contention has ended. But it's not a burst, it's constant contention. Pausing longer just means taking turns with a coarser time scale, bouncing the cache line back and forth less often. And we have no useful work we could be doing instead of pausing. So the only useful work is serialized, and multithreading + locking makes it much slower; the farther we get away from actual multithreading, the better our throughput.So a benchmark like this would be a poor choice for making tuning decisions for a general-purpose spinlock. It's totally fine for comparing two different implementations to see how they differ and what effect that has on this situation, though. (pthread's choices are normally good in general; that's why they do it. They also happen to help a lot for this artificial case.)
flag.test()is a C++20 feature; I had to compile withg++ -O2 -std=gnu++20 -pthread. In earlier C++ revisions, simply usestd::atomic<bool>with.exchange(acquire)and.load(relaxed), and.store(0, release).Some primitive ISAs (or early versions of them) only provide
xchg/swap instructions, or an actual test-and-set where the value to be swapped is baked in. Either is sufficient for the operationsatomic_flagprovides, hence it being the only guaranteedalways_lock_freetype, but modern versions of all mainstream ISAs havealways_lock_freefor power-of-2 types up to pointer width at least, and some for 2 pointers wide.Terminology: a spinlock can't be lock-free by definition
A spinlock is by definition not lock-free: its whole purpose is to set a variable such that other threads have to wait until we're done before they can do anything.
Using lock-free atomic building blocks only results in a lock-free algorithm if you avoid things like spinning indefinitely waiting to see a value stored by another thread. (https://en.wikipedia.org/wiki/Non-blocking_algorithm). There are non-blocking queue implementations, often using a fixed-size circular buffer to avoid the deallocation problem, especially in non-garbage-collected languages like C++.
Rolling your own lock just to see what happens is a valid exercise, just don't call it "lock-free"!