Why my MCS lock performs worse than simple spinlock?

127 views Asked by At

This LWN article claims that:

Spinlocks still suffer from a fundamental problem beyond the fact that simply spinning for a lock can be painful, though: every attempt to acquire a lock requires moving the cache line containing that lock to the local CPU. For contended locks, this cache-line bouncing can hurt performance significantly.

...

By expanding a spinlock into a per-CPU structure, an MCS lock is able to eliminate much of the cache-line bouncing experienced by simpler locks, especially in the contended case.

To see the effect, I wrote a benchmark. The full code can be found here.

Implementation of simple spinlock

The code is modified from https://probablydance.com/2019/12/30/measuring-mutexes-spinlocks-and-how-bad-the-linux-scheduler-really-is/.

This lock is labeled by spinlock (amd) in the benchmark results.

pub struct RawSpinlock {
    locked: AtomicBool,
}

unsafe impl RawMutex for RawSpinlock {
    // ...

    fn lock(&self) {
        loop {
            let was_locked = self.locked.load(Ordering::Relaxed);
            if !was_locked
                && self
                    .locked
                    .compare_exchange_weak(was_locked, true, Ordering::Acquire, Ordering::Relaxed)
                    .is_ok()
            {
                break;
            }
            spin_loop()
        }
    }

    unsafe fn unlock(&self) {
        self.locked.store(false, Ordering::Release);
    }
}

Implementation of MCS lock

The code is modified from https://gitlab.com/numa-spinlock/numa-spinlock, which comes from https://sourceware.org/pipermail/libc-alpha/2019-January/100345.html.

This lock is labeled by spinlock (mcs) in the benchmark results.

struct Node {
    next: AtomicPtr<Self>,
    locked: AtomicBool,
}

#[thread_local]
static mut NODE: Node = Node {
    next: AtomicPtr::new(null_mut()),
    locked: AtomicBool::new(false),
};

pub struct RawSpinlock {
    tail: AtomicPtr<Node>,
}

unsafe impl RawMutex for RawSpinlock {
    // ...

    fn lock(&self) {
        unsafe {
            NODE.next = AtomicPtr::new(null_mut());
            NODE.locked = AtomicBool::new(false);
        }

        let node = unsafe { &mut NODE as *mut _ };
        let prev = self.tail.swap(node, Ordering::Acquire);

        if prev.is_null() {
            return;
        }

        unsafe { (*prev).next.store(node, Ordering::Relaxed) };
        while unsafe { !NODE.locked.load(Ordering::Acquire) } {
            spin_loop();
        }
    }

    unsafe fn unlock(&self) {
        let node = &mut NODE as *mut _;
        let mut next = NODE.next.load(Ordering::Relaxed);

        if next.is_null() {
            if self
                .tail
                .compare_exchange(node, null_mut(), Ordering::Release, Ordering::Relaxed)
                .is_ok()
            {
                return;
            }
            loop {
                next = NODE.next.load(Ordering::Relaxed);
                if !next.is_null() {
                    break;
                }
                spin_loop();
            }
        }

        unsafe { (*next).locked.store(true, Ordering::Release) };
    }
}

Benchmark results

My CPU is Ryzen 3700X. In the benchmark, I spawned 16 threads, all of them contending for 2 locks. Here's the result:

> cargo run --release -- 16 2 10000 100
Options {
    n_threads: 16,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

spinlock (amd)       avg 6.242695ms   min 4.784792ms   max 10.356368ms
spinlock (mcs)       avg 19.968031ms  min 13.980298ms  max 60.338753ms

We can see that the MCS lock performed significantly worse than the simple spinlock.

Even if I relax the arguments to 16 threads + 32 locks or even more locks to simulate lighter contentions, I still can't observe the advantage of MCS locks.

Did I do anything wrong or MCS locks are just slow?

0

There are 0 answers