Ticket 2869

Summary:	nodes staying offline despite "ReturnToService=2"
Product:	Slurm	Reporter:	James Powell <James.Powell>
Component:	slurmctld	Assignee:	Tim Wickberg <tim>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	15.08.6
Hardware:	Linux
OS:	Linux
Site:	CSIRO	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description James Powell 2016-06-29 19:54:47 MDT

Hi Support,

We have configured slurmctld with ReturnToService=2 as we want nodes that regain a valid configuration to return to service:

e.g.
cm01:~ # grep -i return /etc/slurm/slurm.conf 
ReturnToService=2

Our GPU nodes are, correctly I think, flagged as drained immediately after boot (bright cluster manager generates /etc/slurm/gres.conf a touch too late for slurmd I believe) but never return to service despite gaining a valid configuration shortly after - given we've configured "ReturnToService=2" I'm surprised by this behaviour, am I missing something?

e.g.
g001:~ # date
Thu Jun 30 11:33:51 AEST 2016
g001:~ # cat /etc/slurm/gres.conf 
# This section of this file was automatically generated by cmd. Do not edit manually!
# BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE
Name=gpu Count=0
Name=mic Count=0
# END AUTOGENERATED SECTION   -- DO NOT REMOVE

...
Bright generates gres.conf
...

g001:~ # date
Thu Jun 30 11:34:52 AEST 2016
g001:~ # cat /etc/slurm/gres.conf 
# This section of this file was automatically generated by cmd. Do not edit manually!
# BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE
Name=gpu File=/dev/nvidia0
Name=gpu File=/dev/nvidia1
Name=gpu File=/dev/nvidia2
Name=mic Count=0
# END AUTOGENERATED SECTION   -- DO NOT REMOVE


...
Node drained
...

cm01:~ # date ; sinfo -lN | grep g001
Thu Jun 30 11:40:11 AEST 2016
g001            1       gpu     drained   16    2:8:1 129162     4086      1 gpu_ex,g gres/gpu count too l

...
node now looks "happy"
...

g001:~ # tail /var/log/slurmd 
[2016-06-30T11:34:07.325] Slurmd shutdown completing
[2016-06-30T11:34:07.353] Message aggregation disabled
[2016-06-30T11:34:07.354] gpu 0 is device number 0
[2016-06-30T11:34:07.354] gpu 1 is device number 1
[2016-06-30T11:34:07.354] gpu 2 is device number 2
[2016-06-30T11:34:07.355] error: _cpu_freq_cpu_avail: Could not open /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
[2016-06-30T11:34:07.355] Resource spec: Reserved system memory limit not configured for this node
[2016-06-30T11:34:07.366] slurmd version 15.08.6 started
[2016-06-30T11:34:07.367] slurmd started on Thu, 30 Jun 2016 11:34:07 +1000
[2016-06-30T11:34:07.367] CPUs=16 Boards=1 Sockets=2 Cores=8 Threads=1 Memory=129162 TmpDisk=4086 Uptime=209 CPUSpecList=(null)

...
stays drained though
...

cm01:~ # date ; sinfo -lN | grep g001
Thu Jun 30 11:52:40 AEST 2016
g001            1       gpu     drained   16    2:8:1 129162     4086      1 gpu_ex,g gres/gpu count too l


I've waited much longer to see if it resumes service but no luck (Manually undraining the node is successful and our current work around)

cm01:~ # scontrol update NodeName=g001 State=RESUME
cm01:~ # date ; sinfo -lN | grep g001
Thu Jun 30 11:53:15 AEST 2016
g001            1       gpu        idle   16    2:8:1 129162     4086      1 gpu_ex,g none       

Cheers
James

Comment 1 James Powell 2016-06-29 20:03:48 MDT

forgot to show how cpu node is defined in slurm.conf

cm01:~ # grep g001 /etc/slurm/slurm.conf 
NodeName=g001  CoresPerSocket=8 Sockets=2 ThreadsPerCore=1 Gres=gpu:3 Feature=gpu_ex,gpu_pro,gpu_ex_pro,gpu_def
PartitionName=gpu Default=NO MinNodes=1 DefaultTime=00:10:00 MaxTime=7-00:00:00 AllowGroups=ALL Priority=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=512 AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO State=UP Nodes=g001

Comment 2 Tim Wickberg 2016-06-29 20:55:17 MDT

If I understand correctly, slurmd is starting up with a gres.conf missing the devices? slurmd does not re-read the file at any point until a SIGHUP is sent to it. (Or 'scontrol reconfigure' for the cluster, but I don't believe this would work when the node is still marked as down.)

In that case the node would not register properly - ReturnToService will not allow a node missing resources to start running jobs.

It sounds like the real fix would be to make the slurmd service on the node depend on Bright finishing up all of its setup tasks.

Comment 3 James Powell 2016-06-29 21:46:36 MDT

Hi Tim,

Thanks for clarifying how "returntoservice" works in this instance. I'll pursue starting the slurmd service later as the solution. 

Please consider this ticket closed.

Cheers

James