I implemented Morris's algorithm for a no-starve mutex from the Little Book of Semaphores (page 85) almost verbatim. About half the time, it terminates correctly, and the other half it freezes up mid-execution. This indicates to me that there is some race condition, but I can't identify it.
I found other versions of Morris's algorithm pseudocode online, and they appear identical to what's in the Little Book of Semaphores. My current guess is that semaphores in C++ may work a little differently from semaphores in research and in other languages, but I'm not sure. What is causing this race condition?
#include <mutex>
#include <semaphore>
#include <thread>
#include <vector>
#include <iostream>
#include <chrono>
#include <syncstream>
std::binary_semaphore t1(1);
std::binary_semaphore t2(0);
std::mutex mutex;
int room1 = 0;
int room2 = 0;
void Morris(std::stop_token stop_token){
while (!stop_token.stop_requested()){
mutex.lock();
room1 += 1;
mutex.unlock();
t1.acquire();
room2 += 1;
mutex.lock();
room1 -= 1;
if (room1 == 0){
mutex.unlock();
t2.release();
} else {
mutex.unlock();
t1.release();
}
t2.acquire();
room2 -= 1;
// critical region
{
std::osyncstream synced_out(std::cout);
synced_out << "critical region of thread " << std::hex << std::this_thread::get_id() << "\n";
}
// end critical region
if (room2 == 0){
t1.release();
} else {
t2.release();
}
}
}
int main(){
std::vector<std::jthread> threads;
for (int i = 0; i < 50; ++i){
threads.emplace_back(Morris);
}
std::this_thread::sleep_for(std::chrono::seconds(5));
for (std::jthread& thread: threads){
thread.request_stop();
}
}
Commenter have indicated that they aren't experiencing freezing on Windows, so I'm including my computer information here:
compiler: g++12.3.0 -std=c++20 -lpthread -o
pc: Ubuntu 22.04.3 LTS, Intel Core i7-6700K, 4.00GHz x 8 (4 cores, 8 threads). processor is 64-bit x86
I am adding the backtrace of the last thread to print since it was asked for in the comments:
(gdb) thread 11
[Switching to thread 11 (Thread 0x7fffde7fc640 (LWP 115600))]
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
38 ../sysdeps/unix/sysv/linux/x86_64/syscall.S: No such file or directory.
(gdb) bt
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1 0x0000555555558fd7 in std::__detail::__platform_wait<int> (__addr=0x5555555611a0 <t1>, __val=0) at /usr/include/c++/12/bits/atomic_wait.h:108
#2 0x000055555555906c in std::__atomic_wait_address_bare<std::__atomic_semaphore::_M_acquire()::{lambda()#1}>(int const*, std::__atomic_semaphore::_M_acquire()::{lambda()#1}) (
__addr=0x5555555611a0 <t1>, __pred=...) at /usr/include/c++/12/bits/atomic_wait.h:444
#3 0x000055555555921c in std::__atomic_semaphore::_M_acquire (this=0x5555555611a0 <t1>) at /usr/include/c++/12/bits/semaphore_base.h:215
#4 std::counting_semaphore<1l>::acquire (this=0x5555555611a0 <t1>) at /usr/include/c++/12/semaphore:72
#5 0x0000555555557806 in Morris (stop_token=...) at /home/me/Desktop/nostarve/nostarve2.cpp:21
#6 0x000055555555be50 in std::__invoke_impl<void, void (*)(std::stop_token), std::stop_token> (__f=@0x555555574df0: 0x55555555779f <Morris(std::stop_token)>) at /usr/include/c++/12/bits/invoke.h:61
#7 0x000055555555bda1 in std::__invoke<void (*)(std::stop_token), std::stop_token> (__fn=@0x555555574df0: 0x55555555779f <Morris(std::stop_token)>) at /usr/include/c++/12/bits/invoke.h:96
#8 0x000055555555bd01 in std::thread::_Invoker<std::tuple<void (*)(std::stop_token), std::stop_token> >::_M_invoke<0ul, 1ul> (this=0x555555574de8) at /usr/include/c++/12/bits/std_thread.h:279
#9 0x000055555555bcb6 in std::thread::_Invoker<std::tuple<void (*)(std::stop_token), std::stop_token> >::operator() (this=0x555555574de8) at /usr/include/c++/12/bits/std_thread.h:286
#10 0x000055555555bb8a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(std::stop_token), std::stop_token> > >::_M_run (this=0x555555574de0)
at /usr/include/c++/12/bits/std_thread.h:231
#11 0x00007ffff7cdc253 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#12 0x00007ffff7894ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#13 0x00007ffff7926660 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Some things that I have tried that have not worked:
- switching the
binary_semaphores out forcounting_semaphores - making
room1androom2std::atomic<int>
I believe you have hit GCC bug 104928, a race condition in the libstdc++ implementation of
std::counting_semaphore::acquire(). Congratulations, this is one of those rare occasions when it isn't your fault. ;-)I found that your test case is much easier to reproduce if you reduce the number of threads from 50 down to just 2. I was able to actually step through the code in the deadlock case and observe the problem for myself, at which point I went searching for "semaphore" in the GCC Bugzilla and found it already reported.
To explain what goes wrong: in the function
_S_do_spinin<bits/atomic_wait.h>, we first load the current value of the semaphore's counter into__val. If we are going to sleep, we will sleep (usingfutex) until the counter no longer equals this value.Then we call
pred()(a few times in a spin loop) to see if the desired predicate holds (i.e. if the semaphore value is nonzero) or if we should go to sleep.pred()eventually calls_S_do_try_acquirein<bits/semaphore_base.h>which loads the counter again.And now you see the problem. The value of the counter may have changed in between, if another thread has acquired the semaphore.
So suppose thread A calls
acquire()when the semaphore has a value of 1. Then the first load sets__val = 1. Suppose that, at this point, thread B acquires the semaphore. Then when thread A calls_S_do_try_acquire, it observes the counter having value 0 and returnsfalse. So we are now going to invoke the futex wait system call with the value 1, i.e. saying to wait untilcounter != 1, which is the opposite of what we want.If thread B still holds the semaphore at this point, then the futex wait will return immediately and no harm is done except for a wasted system call. But suppose that thread B releases the semaphore first, setting the counter back to 1. The semaphore should now be available for A to take, but instead the futex system call will wait indefinitely until its value is no longer 1, i.e. until some other thread acquires or releases the semaphore. Of course, this may not ever happen.
Disappointingly, bug 104928 was reported back in March 2022 and appears to have received no attention at all. I'll try to nudge it.