Hi Support, We have configured slurmctld with ReturnToService=2 as we want nodes that regain a valid configuration to return to service: e.g. cm01:~ # grep -i return /etc/slurm/slurm.conf ReturnToService=2 Our GPU nodes are, correctly I think, flagged as drained immediately after boot (bright cluster manager generates /etc/slurm/gres.conf a touch too late for slurmd I believe) but never return to service despite gaining a valid configuration shortly after - given we've configured "ReturnToService=2" I'm surprised by this behaviour, am I missing something? e.g. g001:~ # date Thu Jun 30 11:33:51 AEST 2016 g001:~ # cat /etc/slurm/gres.conf # This section of this file was automatically generated by cmd. Do not edit manually! # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE Name=gpu Count=0 Name=mic Count=0 # END AUTOGENERATED SECTION -- DO NOT REMOVE ... Bright generates gres.conf ... g001:~ # date Thu Jun 30 11:34:52 AEST 2016 g001:~ # cat /etc/slurm/gres.conf # This section of this file was automatically generated by cmd. Do not edit manually! # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE Name=gpu File=/dev/nvidia0 Name=gpu File=/dev/nvidia1 Name=gpu File=/dev/nvidia2 Name=mic Count=0 # END AUTOGENERATED SECTION -- DO NOT REMOVE ... Node drained ... cm01:~ # date ; sinfo -lN | grep g001 Thu Jun 30 11:40:11 AEST 2016 g001 1 gpu drained 16 2:8:1 129162 4086 1 gpu_ex,g gres/gpu count too l ... node now looks "happy" ... g001:~ # tail /var/log/slurmd [2016-06-30T11:34:07.325] Slurmd shutdown completing [2016-06-30T11:34:07.353] Message aggregation disabled [2016-06-30T11:34:07.354] gpu 0 is device number 0 [2016-06-30T11:34:07.354] gpu 1 is device number 1 [2016-06-30T11:34:07.354] gpu 2 is device number 2 [2016-06-30T11:34:07.355] error: _cpu_freq_cpu_avail: Could not open /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies [2016-06-30T11:34:07.355] Resource spec: Reserved system memory limit not configured for this node [2016-06-30T11:34:07.366] slurmd version 15.08.6 started [2016-06-30T11:34:07.367] slurmd started on Thu, 30 Jun 2016 11:34:07 +1000 [2016-06-30T11:34:07.367] CPUs=16 Boards=1 Sockets=2 Cores=8 Threads=1 Memory=129162 TmpDisk=4086 Uptime=209 CPUSpecList=(null) ... stays drained though ... cm01:~ # date ; sinfo -lN | grep g001 Thu Jun 30 11:52:40 AEST 2016 g001 1 gpu drained 16 2:8:1 129162 4086 1 gpu_ex,g gres/gpu count too l I've waited much longer to see if it resumes service but no luck (Manually undraining the node is successful and our current work around) cm01:~ # scontrol update NodeName=g001 State=RESUME cm01:~ # date ; sinfo -lN | grep g001 Thu Jun 30 11:53:15 AEST 2016 g001 1 gpu idle 16 2:8:1 129162 4086 1 gpu_ex,g none Cheers James
forgot to show how cpu node is defined in slurm.conf cm01:~ # grep g001 /etc/slurm/slurm.conf NodeName=g001 CoresPerSocket=8 Sockets=2 ThreadsPerCore=1 Gres=gpu:3 Feature=gpu_ex,gpu_pro,gpu_ex_pro,gpu_def PartitionName=gpu Default=NO MinNodes=1 DefaultTime=00:10:00 MaxTime=7-00:00:00 AllowGroups=ALL Priority=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=512 AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO State=UP Nodes=g001
If I understand correctly, slurmd is starting up with a gres.conf missing the devices? slurmd does not re-read the file at any point until a SIGHUP is sent to it. (Or 'scontrol reconfigure' for the cluster, but I don't believe this would work when the node is still marked as down.) In that case the node would not register properly - ReturnToService will not allow a node missing resources to start running jobs. It sounds like the real fix would be to make the slurmd service on the node depend on Bright finishing up all of its setup tasks.
Hi Tim, Thanks for clarifying how "returntoservice" works in this instance. I'll pursue starting the slurmd service later as the solution. Please consider this ticket closed. Cheers James