I have some C code that launches processes. For each process it registers some metadata in a linked list of structs. Then, the SIGCHLD handler updates said metadata. The structure is something like this:
// block/unblock utilities
static sigset_t block () {
sigset_t sigchld_mask, old_mask;
sigemptyset(&sigchld_mask);
sigaddset(&sigchld_mask, SIGCHLD);
if(sigprocmask(SIG_BLOCK, &sigchld_mask, &old_mask))
exit_with_error();
return old_mask;
}
static void unblock (sigset_t* old_mask) {
if(sigprocmask(SIG_SETMASK, old_mask, NULL)) exit_with_error();
}
typedef struct ChildProcess {
pid_t pid;
int* status;
} ChildProcess;
// ProcessNode is a linked list node containing process metadata
typedef struct ProcessNode {
ChildProcess* proc;
volatile struct ProcessNode* next;
} ProcessNode;
volatile ProcessNode * head = NULL;
int launch_process ( ... ) {
block(); // block SIGCHLD
... // launch process with posix_spawn()
register_proc(...); // register all metadata
unblock(); // unblock SIGCHLD
}
void register_proc (pid_t pid, int* status) {
ChildProcess* child = (ChildProcess*)malloc(sizeof(ChildProcess));
child->pid = pid;
child->status = status;
volatile ProcessNode* node = (ProcessNode*)malloc(sizeof(ProcessNode));
new_node->proc = child;
new_node->next = head;
head = new_node;
return 0;
}
int update_proc (pid_t pid, int status) {
volatile ProcessNode* cursor = head;
while(cursor != NULL && curr->proc->pid != pid) {
cursor = cursor->next;
}
if(cursor != NULL) {
*(cursor->proc->status) = status;
}
}
void sigchld_handler (int sig) {
int status;
pid_t pid;
while((pid = waitpid(-1, &status, WNOHANG))) {
update_proc(pid, status);
}
}
// API function for checking the status pointer
// if we are to wait for the process to terminate, we pselect until sigchld eventually updates the status pointer
int get_state (int* status, bool wait_for_termination) {
sigset_t old_mask = block(); // block SIGCHLD, store old mask
bool state_unknown = true;
bool done = false;
while(state_unknown) {
if(WIFEXITED(*status) || WIFSIGNALED(*status)) {
done = true;
}
state_unknown = false;
if(wait_for_termination && done) {
if(pselect(0, NULL, NULL, NULL, NULL, &old_mask)) {
if(errno == EINTR) {
state_unknown = true; // check status again on interrupt
}
}
}
else if(!wait_for_termination) {
state_unknown = false;
}
}
unblock(); // unblock SIGCHLD
return 0;
}
int main () {
// ... allocate some memory for other tasks
//Setup SIGCHLD handler
sigset_t sigchld_mask;
sigemptyset(&sigchld_mask);
sigaddset(&sigchld_mask, SIGCHLD);
struct sigaction sa;
sa.sa_handler = sigchld_handler;
sa.sa_mask = sigchld_mask;
sa.sa_flags = SA_RESTART;
sigaction(SIGCHLD, &sa, NULL);
}
Sometimes this works, but it can sometimes fails because update_status fails to find a matching ProcessNode. This happens because head is NULL when sigchld_handler executes -- even if it is non-null before an after. I've elided a lot of detail, but I block SIGCHLD every time head is written or read, and it's volatile. I thought those two things would be sufficient to deal with any race conditions.
Any other ideas for things that could go wrong?
With the correct initialization, you should get one
SIGCHLDfor each terminated process.However ... Consider the following:
You fire off two children: child A and child B
SIGCHLDfor child A)waitpidfor child Awaitpidfor child BSIGCHLDfor child B)waitpidreturns -1 (forECHILD)In the signal handler, the fix is to change:
Into:
Note that
volatileforheadis not necessary because of theblock/unblockcalls.Here is the modified/fixed code. Because I was trying to produce a full diagnostic and get this out quickly, I hacked it up somewhat severely [Sorry]:
In the code above, I've used
cppconditionals to denote old vs. new code:Note: this can be cleaned up by running the file through
unifdef -kHere is the program output with the original bug (compiled with
-DBUG=1):Before I arrived at the above fix, I went through several rounds of experimentation.
Some of the attempts I tried.
There appears to be a race condition in the original above code. The child processes terminate so fast that some signals can be lost.
I added timestamping to show when the handler is invoked.
I added
SA_NODEFER(which did nothing). And ananosleepin the child (which was an "okay" workaround but not the ultimate fix).My best guess is:
SIGCHLDis not a realtime signal.Here is the program output with the [simplified] fix: