slurmd.service is Failed & there is no PID file /var/run/slurmd.pid

2.8k views Asked by At

I am trying to start slurmd.service using below commands but it is not successful permanently. I will be grateful if you could help me to resolve this issue!

systemctl start slurmd
scontrol update nodename=fwb-lab-tesla1 state=idle

This is the status of slurmd.service

 cat /usr/lib/systemd/system/slurmd.service
[Unit]
Description=Slurm node daemon
After=network.target munge.service
ConditionPathExists=/etc/slurm/slurm.conf

[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurmd.pid
KillMode=process
LimitNOFILE=51200
LimitMEMLOCK=infinity
LimitSTACK=infinity


[Install]
WantedBy=multi-user.target

and this the status of the node:

$ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpucompute*    up   infinite      1  drain fwb-lab-tesla1

$ sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Low RealMemory       root      2020-09-28T16:46:28 fwb-lab-tesla1

$ sinfo -Nl
Thu Oct  1 14:00:10 2020
NODELIST        NODES   PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
fwb-lab-tesla1      1 gpucompute*     drained   32   32:1:1  64000        0      1   (null) Low RealMemory  

Here there is the contents of slurm.conf

$ cat /etc/slurm/slurm.conf
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=FWB-Lab-Tesla
#ControlAddr=137.72.38.102
#
MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
#SlurmUser=slurm
SlurmdUser=root
StateSaveLocation=/var/spool/slurm/StateSave
SwitchType=switch/none
TaskPlugin=task/cgroup
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
# Prevent very long time waits for mix serial/parallel in multi node environment 
SchedulerParameters=pack_serial_at_end
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/filetxt
# Need slurmdbd for gres functionality
#AccountingStorageTRES=CPU,Mem,gres/gpu,gres/gpu:Titan
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm/slurmctld.log
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurm/slurmd.log
#
#
# COMPUTE NODES
GresTypes=gpu
#NodeName=fwb-lab-tesla[1-32] Gres=gpu:4 RealMemory=64000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
#PartitionName=compute Nodes=fwb-lab-tesla[1-32] Default=YES MaxTime=INFINITE State=UP
#NodeName=fwb-lab-tesla1 NodeAddr=137.73.38.102 Gres=gpu:4 RealMemory=64000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
NodeName=fwb-lab-tesla1 NodeAddr=137.73.38.102 Gres=gpu:4 RealMemory=64000 CPUs=32 State=UNKNOWN
PartitionName=gpucompute Nodes=fwb-lab-tesla1 Default=YES MaxTime=INFINITE State=UP

There is not any slurmd.pid in the below path. Just once by starting system it appears here but it is gone after few minutes again.

$ ls /var/run/
abrt          cryptsetup         gdm            lvm             openvpn-server  slurmctld.pid   tuned
alsactl.pid   cups               gssproxy.pid   lvmetad.pid     plymouth        sm-notify.pid   udev
atd.pid       dbus               gssproxy.sock  mariadb         ppp             spice-vdagentd  user
auditd.pid    dhclient-eno2.pid  httpd          mdadm           rpcbind         sshd.pid        utmp
avahi-daemon  dhclient.pid       initramfs      media           rpcbind.sock    sudo            vpnc
certmonger    dmeventd-client    ipmievd.pid    mount           samba           svnserve        xl2tpd
chrony        dmeventd-server    lightdm        munge           screen          sysconfig       xrdp
console       ebtables.lock      lock           netreport       sepermit        syslogd.pid     xtables.lock
crond.pid     faillock           log            NetworkManager  setrans         systemd
cron.reboot   firewalld          lsm            openvpn-client  setroubleshoot  tmpfiles.d
[shirin@FWB-Lab-Tesla Seq2KMR33]$ systemctl status slurmctld
â slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2020-09-28 15:41:25 BST; 2 days ago
 Main PID: 1492 (slurmctld)
   CGroup: /system.slice/slurmctld.service
           ââ1492 /usr/sbin/slurmctld

Sep 28 15:41:25 FWB-Lab-Tesla systemd[1]: Starting Slurm controller daemon...
Sep 28 15:41:25 FWB-Lab-Tesla systemd[1]: Started Slurm controller daemon.

I try to start the service slurmd.service but it returns to failed after few minutes again

$ systemctl status slurmd
â slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
   Active: failed (Result: timeout) since Tue 2020-09-29 18:11:25 BST; 1 day 19h ago
  Process: 25650 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/slurmd.service
           ââ2986 /usr/sbin/slurmd

Sep 29 18:09:55 FWB-Lab-Tesla systemd[1]: Starting Slurm node daemon...
Sep 29 18:09:55 FWB-Lab-Tesla systemd[1]: Can't open PID file /var/run/slurmd.pid (yet?) after start: No ...ctory
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: slurmd.service start operation timed out. Terminating.
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: Failed to start Slurm node daemon.
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: Unit slurmd.service entered failed state.
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: slurmd.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

Log output of starting slurmd:

[2020-09-29T18:09:55.074] Message aggregation disabled
[2020-09-29T18:09:55.075] gpu device number 0(/dev/nvidia0):c 195:0 rwm
[2020-09-29T18:09:55.075] gpu device number 1(/dev/nvidia1):c 195:1 rwm
[2020-09-29T18:09:55.075] gpu device number 2(/dev/nvidia2):c 195:2 rwm
[2020-09-29T18:09:55.075] gpu device number 3(/dev/nvidia3):c 195:3 rwm
[2020-09-29T18:09:55.095] slurmd version 17.11.7 started
[2020-09-29T18:09:55.096] error: Error binding slurm stream socket: Address already in use
[2020-09-29T18:09:55.096] error: Unable to bind listen port (*:6818): Address already in use```
1

There are 1 answers

1
Marcus Boden On BEST ANSWER

The log files states that it cannot bind to the standard slurmd port 6818, because there is something else using this address already.

Do you have another slurmd running on this node? Or something else listening there? Try netstat -tulpen | grep 6818 to see what is using the address.