Ganglia - gmetad - process is getting terminated by SIGSEGV

1k views Asked by At

I have started seeing this issue in the last couple of days. Ganglia gemtad process gets terminated within 5 min of its start with SIGSEGV (segfault)

This was stable since last few months..so not sure what changed.

Version - gmetad 3.7.1

I don't see any core dump or anything specific to gmetad in /var/log/messages or /var/log/secure either.

System snap (from top) at the time of this event

load average: 1.97, 0.99, 0.42

Memory also looks fairly Ok

 free -m
             total       used       free     shared    buffers     cached
Mem:          7989       3624       4364          0        333       2562
-/+ buffers/cache:        728       7260
Swap:         4095          0       4095

I have a superviord process that forks & watches the gmetad -

here is the supervisor log

2016-10-20 14:34:55,707 INFO exited: gmetad (terminated by SIGSEGV; not expected)
2016-10-20 14:34:55,707 INFO received SIGCLD indicating a child quit
2016-10-20 14:34:57,712 INFO spawned: 'gmetad' with pid 24561
2016-10-20 14:34:59,929 INFO exited: gmetad (terminated by SIGSEGV; not expected)
2016-10-20 14:34:59,929 INFO received SIGCLD indicating a child quit
2016-10-20 14:35:02,932 INFO spawned: 'gmetad' with pid 24593
2016-10-20 14:35:04,897 INFO exited: gmetad (terminated by SIGSEGV; not expected)
2016-10-20 14:35:04,897 INFO received SIGCLD indicating a child quit
2016-10-20 14:35:08,903 INFO spawned: 'gmetad' with pid 24618
2016-10-20 14:35:11,257 INFO exited: gmetad (terminated by SIGSEGV; not expected)
2016-10-20 14:35:11,257 INFO received SIGCLD indicating a child quit
2016-10-20 14:35:12,257 INFO gave up: gmetad entered FATAL state, too many start retries too quickly

Has anyone faced this kind of issue with gmetad in particular? Appreciate any pointers.

1

There are 1 answers

0
Rishi On

I was able to identify the issue and resolve.

Some key steps/findings -

  1. Change the 'debug_level' to > 1 in gmetad.conf to run the gmetaa in foreground and spit out verbose log on what its doing.
  2. I found out that gmetad process was getting killed at an exact same point - when it was trying to process a file for a particular node of a particular data_source.
  3. You could comment out all the other 'data_source' from gmetad.conf and try to isolate which data_source->node is problematic.
  4. After figuring out the problematic node, I just deleted the /path/to/rrd/node_dir/file_with_issue or entire dir itself. (Need to find a better way as this is data loss)
  5. Change back the debug_level and Restart the gmetad!

In my case, to pin point a file name - 'part_max_used.rrd' was a file name under /path/to/ganglia/rrds/node_name was the root cause of SIGSEGV

Hope this helps -)