I have started seeing this issue in the last couple of days. Ganglia gemtad process gets terminated within 5 min of its start with SIGSEGV (segfault)
This was stable since last few months..so not sure what changed.
Version - gmetad 3.7.1
I don't see any core dump or anything specific to gmetad in /var/log/messages or /var/log/secure either.
System snap (from top) at the time of this event
load average: 1.97, 0.99, 0.42
Memory also looks fairly Ok
free -m
total used free shared buffers cached
Mem: 7989 3624 4364 0 333 2562
-/+ buffers/cache: 728 7260
Swap: 4095 0 4095
I have a superviord process that forks & watches the gmetad -
here is the supervisor log
2016-10-20 14:34:55,707 INFO exited: gmetad (terminated by SIGSEGV; not expected)
2016-10-20 14:34:55,707 INFO received SIGCLD indicating a child quit
2016-10-20 14:34:57,712 INFO spawned: 'gmetad' with pid 24561
2016-10-20 14:34:59,929 INFO exited: gmetad (terminated by SIGSEGV; not expected)
2016-10-20 14:34:59,929 INFO received SIGCLD indicating a child quit
2016-10-20 14:35:02,932 INFO spawned: 'gmetad' with pid 24593
2016-10-20 14:35:04,897 INFO exited: gmetad (terminated by SIGSEGV; not expected)
2016-10-20 14:35:04,897 INFO received SIGCLD indicating a child quit
2016-10-20 14:35:08,903 INFO spawned: 'gmetad' with pid 24618
2016-10-20 14:35:11,257 INFO exited: gmetad (terminated by SIGSEGV; not expected)
2016-10-20 14:35:11,257 INFO received SIGCLD indicating a child quit
2016-10-20 14:35:12,257 INFO gave up: gmetad entered FATAL state, too many start retries too quickly
Has anyone faced this kind of issue with gmetad in particular? Appreciate any pointers.
I was able to identify the issue and resolve.
Some key steps/findings -
In my case, to pin point a file name - 'part_max_used.rrd' was a file name under /path/to/ganglia/rrds/node_name was the root cause of SIGSEGV
Hope this helps -)