Node state=down with TORQUE v6.1.0 on a Workstation

3.8k views Asked by At

I was installing Torque 6.1.0 on a Ubuntu 16.04 Workstation, but the installation doesn't seem to recognize how many cores and threads the machine has. The only node I set up showed a status of "state=down" and any job would trigger an error saying "not enough of the right type of nodes". In fact, the workstation has 56 threads or 28 physical cores on 2 processors, and I only want to use 54 threads or 27 physical cores for the shared computing jobs. I realized that this might be related to the configuration of cgroup or NUMA starting from Torque V6.0 which I am not if I was doing the right thing while installing. I indeed had the cgroup enabled, but not sure if I also need to configure NUMA-aware function to be enabled as well. Below are some outputs of current configs. What should I do? Thanks.

$ pbsnodes
node1
 state = down
 power_state = Running
 np = 54
 ntype = cluster
 mom_service_port = 15002
 mom_manager_port = 15003
 total_sockets = 0
 total_numa_nodes = 0
 total_cores = 0
 total_threads = 0
 dedicated_sockets = 0
 dedicated_numa_nodes = 0
 dedicated_cores = 0
 dedicated_threads = 0


$ lssubsys -am
cpuset /sys/fs/cgroup/cpuset
cpu,cpuacct /sys/fs/cgroup/cpu,cpuacct
blkio /sys/fs/cgroup/blkio
memory /sys/fs/cgroup/memory
devices /sys/fs/cgroup/devices
freezer /sys/fs/cgroup/freezer
net_cls,net_prio /sys/fs/cgroup/net_cls,net_prio
perf_event /sys/fs/cgroup/perf_event
hugetlb /sys/fs/cgroup/hugetlb
pids /sys/fs/cgroup/pids

There is also a fishy part that it seems the server cannot see the node I defined already on the server's configure file. This can be seen on the /var/spool/torque/server_logs log file:

12/27/2016 15:48:33.147;01;PBS_Server.2692;Svr;PBS_Server;LOG_ERROR::get_node_from_str, Node node1 is reporting on node NapaValley, which pbs_server doesn't know about
12/27/2016 15:49:18.232;01;PBS_Server.2692;Svr;PBS_Server;LOG_ERROR::get_node_from_str, Node node1 is reporting on node NapaValley, which pbs_server doesn't know about
12/27/2016 15:49:25.491;08;PBS_Server.2696;Job;0.NapaValley;Job deleted at request of cquic@localhost
12/27/2016 15:49:27.023;08;PBS_Server.2657;Job;0.NapaValley;on_job_exit valid pjob: 0.NapaValley (substate=59)
12/27/2016 15:49:32.996;256;PBS_Server.2657;Job;0.NapaValley;dequeuing from batch, state COMPLETE
12/27/2016 15:49:59.722;256;PBS_Server.2696;Job;1.NapaValley;enqueuing into batch, state 1 hop 1
12/27/2016 15:49:59.722;08;PBS_Server.2696;Job;perform_commit_work;job_id: 1.NapaValley
12/27/2016 15:49:59.722;02;PBS_Server.2696;node;close_conn;Closing connection 9 and calling its accompanying function on close
12/27/2016 15:49:59.795;64;PBS_Server.2692;Req;node_spec;job allocation request exceeds currently available cluster nodes, 1 requested, 0 available
12/27/2016 15:49:59.796;08;PBS_Server.2692;Job;1.NapaValley;Job Modified at request of root@localhost
12/27/2016 15:50:03.312;01;PBS_Server.2696;Svr;PBS_Server;LOG_ERROR::get_node_from_str, Node node1 is reporting on node NapaValley, which pbs_server doesn't know about

On my /etc/hosts, I have

127.0.0.1 localhost node1
127.0.0.1 NapaValley

PS: I have tried to mount cpu and other modules to /var/spool/torque/cgroup directories, but lssubsys -am still showed the same information as above. I assume they should have been mounted?

1

There are 1 answers

1
clusterdude On

A node will report to the server with a name returned by the gethostbyname call. Based on the log lines you posted, the server and the node don't agree on that name. You can have pbs_mom return a different name by starting it with the -H option:

http://docs.adaptivecomputing.com/torque/6-0-2/adminGuide/help.htm#topics/torque/commands/pbs_mom.htm#-h

"-H hostname Sets the MOM's hostname. This can be useful on multi-homed networks."

This is equivalent to setting $mom_host node1 in /var/spool/torque/mom_priv/config.