I have 3 machines in my HDFS cluster: pr2
is the meta node, and pr1
and pr3
are the other 2 nodes. I want to monitor them with ganglia web
. To that end, Ganglia
(with both gmetad
and gmond
) has been installed on all 3 nodes, and ganglia web
has been installed on the meta node, pr2
.
I am running the following:
On pr2:
service gmetad start
service gmond start
service httpd start
On pr1 and pr3:
service gmond start
I expect to see in ganglia web
information about all 3 nodes. But this is what I see:
Only 1 node seems to be running, which is pr2
.
What could be the reason?
I have explored the following possibilities when configuring the setup:
- In the gmond configuration file, there is a place to specify the "host machine". What should be the "host" here? For example, for node
pr1
, should it be the local machine (pr1
), or the meta node (pr2
)? I have tried with bothpr1
andpr2
separately, and I still can't see info ofpr1
orpr3
onganglia web
. Also, do I need to add anything more in this section?
[root@pr1]# nano /usr/local/etc/gmond.conf
/* The host section describes attributes of the host, like the location */
host {
location = "unspecified" ### Should I replace "unspecified" with pr2 (the meta node), or with pr1? ###
}
- Also, in the same configuration file, there is a section for configuring
udp_send_channels
. There is a parametermcast_join
, which I think is to specify the meta node. So I changed the IP topr2
. But sinceganglia web
is not showing info ofpr1
andpr3
, I tried changing it topr1
andpr3
respectively, but to no avail.
[root@pr1]# nano /usr/local/etc/gmond.conf
udp_send_channel {
#bind_hostname = yes
mcast_join = 239.2.11.71 ### Should I change this IP to pr2 or pr1? ###
port = 8649
ttl = 1
}
- Again, in the same file, there's a section for configuring
udp_recv_channel
. I have commented out themcast_join
andbind
here, since I thought thepr1
andpr3
nodes do not need to receive any info from anything, they only need to send info to the meta node,pr2
. Am I wrong?
[root@pr1]# nano /usr/local/etc/gmond.conf
udp_recv_channel {
#mcast_join = 239.2.11.71
port = 8649
#bind = 239.2.11.71
retry_bind = true
# Size of the UDP buffer. If you are handling lots of metrics you really
# should bump it up to e.g. 10MB or even higher.
# buffer = 10485760
}
- Then I run the following smoke test:
gmond -d 5 -c /usr/local/etc/gmond.conf
and it returns the following (showing the tail part of the long output):
...
...
Counting device /dev/mapper/centos-root (2.69 %)
Counting device /dev/mapper/centos-home (0.02 %)
Counting device /dev/sda1 (18.49 %)
For all disks: 206.174 GB total, 204.498 GB free for users.
metric 'part_max_used' has value_threshold 1.000000
sent message 'heartbeat' of length 48 with 0 errors
Processing a metric value message from pr1.embs
Got a heartbeat message 1511171541
This means that gmond
is working correctly, am I right?
- Then I configure the
gmond
init file (for calling the command), and make the following 2 changes: (i) pointGMOND
to the correct path, and (ii) point the daemon to the correct path of thegmond.conf
file. Could I possibly be doing anything wrong here?
[root@pr1]# nano /etc/init.d/gmond
### Change here ###
#GMOND=/usr/sbin/gmond
GMOND=/usr/local/sbin/gmond
# TMPDIR set to SysV IPC ramdrive to avoid include processing failures
TMPDIR=/dev/shm
export TMPDIR
. /etc/rc.d/init.d/functions
start() {
[ -x $GMOND ] || exit 5
[ -f /etc/ganglia/gmond.conf ] || exit 6
echo -n "Starting GANGLIA gmond: "
### Change here ###
#daemon $GMOND
daemon $GMOND -c /usr/local/etc/gmond.conf
- Then I modify the
gmetad.conf
file to change thedatasource
and add the cluster name and the user name. Here, should I change the datasource to the local machine (pr1
orpr3
), or to the meta node (pr2
)?
[root@pr1]# nano /usr/local/etc/gmetad.conf
===(change)
data_source "edhcluster" pr1 ### Should I change this to pr2, the meta node?
===
===(add)
gridname "edhcluster"
===
===(add)
setuid_username "ganglia"
===
Or, an even bigger question, should I not change anything in this file of the nodes pr1
and pr3
(as only pr2
is the meta node that will monitor everything), and instead change the gmetad.conf
file of the meta node, pr2
, only to include in its datasource
all the 3 nodes, like so:
[root@pr**2**]# nano /usr/local/etc/gmetad.conf
data_source "edhcluster" pr1 pr2 pr3
But then, I read somewhere a little while ago while browsing to solve this problem, that writing the names of the nodes like this means that it'll first check pr1, if it is down it'll check pr2, and if that is down too, it'll check pr3. So, if we want to make it collect data from all 3 machines, we should write separate lines for each node, like so:
[root@pr2]# nano /usr/local/etc/gmetad.conf
data_source "edhcluster" pr1
data_source "edhcluster" pr2
data_source "edhcluster" pr3
I tried this as well, and then restarted all the services in all the nodes (gmetad
, gmond
and httpd
in pr2
, and gmond
in pr1
and pr3
), but ganglia web still doesn't show anything for the other 2 nodes.
- Then I modified the
gmetad.init
file (for calling the command) onpr1
to make theGMETAD
daemon point to the correct path of thegmetad.conf
file. However, I suspect this might be wrong, as thegmetad
ofpr2
should handle this. So, am I correct in identifying this as the mistake? If yes, how do I make the gmetad daemon point to the path of the config file in another node, the meta node,pr2
?
[root@pr1]# nano /etc/init.d/gmetad
### Change here ###
#GMETAD=/usr/sbin/gmetad
GMETAD=/usr/local/sbin/gmetad
. /etc/rc.d/init.d/functions
test -f /etc/sysconfig/gmetad && . /etc/sysconfig/gmetad
export RRDCACHED_ADDRESS
start() {
[ -x $GMETAD ] || exit 5
[ -f /etc/ganglia/gmetad.conf ] || exit 6
echo -n "Starting GANGLIA gmetad: "
### Change here ###
#daemon $GMETAD
daemon $GMETAD -c /usr/local/etc/gmetad.conf
### If this means that the gmetad of the local system
### will monitor this node, how do I make the daemon
### point to the gmetad.conf file of the meta node, pr2?
RETVAL=$?
echo
[ $RETVAL -eq 0 ] && touch /var/lock/subsys/gmetad
return $RETVAL
}
Thanks for going through all the details of the configuration steps. If I missed or overlooked any other possible causes, please feel free to point them out.