Gearman worker in shell hangs as a zombie

114 views Asked by At

I have a Gearman worker in a shell script started with perp in the following way:

runuid -s gds \
  /usr/bin/gearman -h 127.0.0.1 -t 1000 -w -f gds-rel \
  -- xargs /home/gds/gds-rel-worker.sh < /dev/null 2>/dev/null

The worker only does some input validation and calls another shell script run.sh that invokes bash, curl, Terragrunt, Terraform, Ansible and gcloud to provision and update resources in GCP like this:

./run.sh --release 1.2.3 2>&1 >> /var/log/gds-release

The script is intended to run unattended. The problem I have is that after the job finishes successfully (that's both shell scripts run.sh and gds-rel-worker.sh) the Gearman job remains executing, because the child process becomes zombie (see last line below).

root      144748       1  0 Apr29 ?        00:00:00 perpboot -d /etc/perp
root      144749  144748  0 Apr29 ?        00:00:00  \_ tinylog -k 8 -s 100000 -t -z /var/log/perp/perpd-root
root      144750  144748  0 Apr29 ?        00:00:00  \_ perpd /etc/perp
root     2492482  144750  0 May14 ?        00:00:00      \_ tinylog (gearmand) -k 10 -s 100000000 -t -z /var/log/perp/gearmand
gearmand 2492483  144750  0 May14 ?        00:00:08      \_ /usr/sbin/gearmand -L 127.0.0.1 -p 4730 --verbose INFO --log-file stderr --keepalive --keepalive-idle 120 --keepalive-interval 120 --keepalive-count 3 --round-robin --threads 36 --worker-wakeup 3 --job-retries 1
root     2531800  144750  0 May14 ?        00:00:00      \_ tinylog (gds-rel-worker) -k 10 -s 100000000 -t -z /var/log/perp/gds-rel-worker
gds      2531801  144750  0 May14 ?        00:00:00      \_ /usr/bin/gearman -h 127.0.0.1 -t 1000 -w -f gds-rel -- xargs /home/gds/gds-rel-worker.sh
gds      2531880 2531801  0 May14 ?        00:00:00          \_ [xargs] <defunct>

So far I have traced the problem to run.sh, because if I replace its call with something simpler (e.g. echo "Hello"; sleep 5) the worker does not hang. Unfortunately, I have no clue what is causing the problem. The script run.sh is rather long and complex, but has been working without a problem so far. Tracing the worker process I see this:

getpid()                                = 2531801
write(2, "gearman: ", 9)                = 9
write(2, "gearman_worker_work", 19)     = 19
write(2, " : ", 3)                      = 3
write(2, "gearman_wait(GEARMAN_TIMEOUT) ti"..., 151) = 151
write(2, "\n", 1)                       = 1
sendto(5, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
recvfrom(5, "\0RES\0\0\0\n\0\0\0\0", 8192, MSG_NOSIGNAL, NULL, NULL) = 12
sendto(5, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}], 2, 1000) = 1 ([{fd=5, revents=POLLIN}])
sendto(5, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
recvfrom(5, "\0RES\0\0\0\6\0\0\0\0\0RES\0\0\0(\0\0\0QH:terra-"..., 8192, MSG_NOSIGNAL, NULL, NULL) = 105
pipe([6, 7])                            = 0
pipe([8, 9])                            = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fea38480a50) = 2531880
close(6)                                = 0
close(9)                                = 0
write(7, "1.2.3\n", 18)                 = 6
close(7)                                = 0
read(8, "which: no terraform-0.14 in (/us"..., 1024) = 80
read(8, "Identity added: /home/gds/.ssh/i"..., 1024) = 54
read(8, 0x7fff6251f5b0, 1024)           = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2531880, si_uid=1006, si_status=0, si_utime=0, si_stime=0} ---
read(8,

So the worker continues reading standard output even though the child has finished successfully and presumably closed it. Any ideas how to catch what causes this problem?

1

There are 1 answers

0
Georgi D. Sotirov On

I was able to solve it. The script run.sh was starting ssh-agent, which opens a socket and since Gearman redirects all outputs the worker continued reading the open file descriptor even after the script successfully completed.

I found it by examining the open file descriptors for the Gearman worker process after it hang:

# ls -l /proc/2531801/fd/*
lr-x------. 1 gds devops 64 May 17 11:26 /proc/2531801/fd/0 -> /dev/null
l-wx------. 1 gds devops 64 May 17 11:26 /proc/2531801/fd/1 -> 'pipe:[9356665]'
l-wx------. 1 gds devops 64 May 17 11:26 /proc/2531801/fd/2 -> 'pipe:[9356665]'
lr-x------. 1 gds devops 64 May 17 11:26 /proc/2531801/fd/3 -> 'pipe:[9357481]'
l-wx------. 1 gds devops 64 May 17 11:26 /proc/2531801/fd/4 -> 'pipe:[9357481]'
lrwx------. 1 gds devops 64 May 17 11:26 /proc/2531801/fd/5 -> 'socket:[9357482]'
lr-x------. 1 gds devops 64 May 17 11:26 /proc/2531801/fd/8 -> 'pipe:[9369888]'

Then identified the processes using file node for the pipe in file descriptor 8 that German worker continued reading:

# lsof | grep 9369888
gearman   2531801                              gds    8r     FIFO               0,13      0t0    9369888 pipe
ssh-agent 2531899                              gds    9w     FIFO               0,13      0t0    9369888 pipe

And finally listed files opened by ssh-agent and found what stands behind file descriptor 3:

# ls -l /proc/2531899/fd/*
lrwx------. 1 root root 64 May 17 11:14 /proc/2531899/fd/0 -> /dev/null
lrwx------. 1 root root 64 May 17 11:14 /proc/2531899/fd/1 -> /dev/null
lrwx------. 1 root root 64 May 17 11:14 /proc/2531899/fd/2 -> /dev/null
lrwx------. 1 root root 64 May 17 11:14 /proc/2531899/fd/3 -> 'socket:[9346577]'
# lsof | grep 9346577
ssh-agent 2531899                              gds    3u     unix 0xffff89016fd34000      0t0    9346577 /tmp/ssh-0b14coFWhy40/agent.2531898 type=STREAM

As a solution I added kill of the ssh-agent before exit from run.sh script and now there are no jobs hanging due to zombie process.