Ticket 2193

Summary:	Slurm allowing job with unavailable configuration after slurmctld restart
Product:	Slurm	Reporter:	Akmal Madzlan <akmalm>
Component:	slurmctld	Assignee:	Alejandro Sanchez <alex>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	14.11.10
Hardware:	Linux
OS:	Linux
Site:	DownUnder GeoSolutions	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	scontrol show config output

Description Akmal Madzlan 2015-11-24 16:31:48 MST

[akmalm@hud10 ~]$ rjs test.job queue=teamtim gpu=1 mem=150G
sbatch: error: Batch job submission failed: Requested node configuration is not available

* slurmctld restarted*

[akmalm@hud10 ~]$ rjs test.job queue=teamtim gpu=1 mem=150G
830050

after a while, (or maybe after scontrol reconfigure?)
it went back to normal

is it partly caused by FastSchedule=0?

Comment 1 Akmal Madzlan 2015-11-24 16:52:32 MST

using srun, same result

[akmalm@hud10 ~]$ srun -pteamtim --mem=2000000000 --constraint=gpu hostname
srun: error: Unable to allocate resources: Requested node configuration is not available

*slurmctld restarted*

[akmalm@hud10 ~]$ srun -pteamtim --mem=2000000000 --constraint=gpu hostname
srun: job 830063 queued and waiting for resources

Comment 2 David Bigagli 2015-11-24 20:22:44 MST

I cannot reproduce this. Is it possible that some hosts became available
after the restart with the requested configuration?

David

Comment 3 Akmal Madzlan 2015-11-25 11:22:20 MST

> I cannot reproduce this. Is it possible that some hosts became available
> after the restart with the requested configuration?

I dont think so.
I'm able to reproduce this in a different cluster and on my test multiple slurmd setup

Have you tried with a large number of nodes and FastSchedule=0?

Comment 4 Akmal Madzlan 2015-11-25 17:53:54 MST

Created attachment 2452 [details]
scontrol show config output

scontrol show config output

Comment 5 Moe Jette 2015-11-26 02:17:28 MST

(In reply to Akmal Madzlan from comment #0)
> [akmalm@hud10 ~]$ rjs test.job queue=teamtim gpu=1 mem=150G
> sbatch: error: Batch job submission failed: Requested node configuration is
> not available
> 
> * slurmctld restarted*
> 
> [akmalm@hud10 ~]$ rjs test.job queue=teamtim gpu=1 mem=150G
> 830050
> 
> after a while, (or maybe after scontrol reconfigure?)
> it went back to normal
> 
> is it partly caused by FastSchedule=0?

What is your memory size (e.g.. "RealMemory=x") for the nodes defined in slurm.conf?

If not specified, I believe it defaults to 1 MB and until the compute nodes register with their actual size, anything requesting more memory would get the error indicating that no nodes exist with the specified size. The thought behind that is better to reject a job at submit time that will never run then let it sit in the queue indefinitely. You should set FastSchedule=1 and define a minimum memory size for the nodes in slurm.conf such that if the node has less it is configured down and if more then the size will be reset higher.

Comment 6 Akmal Madzlan 2015-11-26 13:21:15 MST

> What is your memory size (e.g.. "RealMemory=x") for the nodes defined in slurm.conf?

we specify RealMemory for each node with the real/close to real value

> until the compute nodes register with their actual size, anything requesting more memory would get the error indicating that no nodes exist with the specified size.

I dont think this is the behaviour with FastSchedule=0

> You should set FastSchedule=1 and define a minimum memory size for the nodes in slurm.conf

We kinda prefer FastSchedule=0 because we dont want a node to be drained if we lost some DIMM and at the same time we dont want the node to be overallocated

Comment 7 Moe Jette 2015-11-27 03:56:05 MST

(In reply to Akmal Madzlan from comment #6)

> > until the compute nodes register with their actual size, anything requesting more memory would get the error indicating that no nodes exist with the specified size.
> 
> I dont think this is the behaviour with FastSchedule=0

That is exactly how it works, but if you define a reasonable memory size in slurm.conf then it would work fine. If you specified less than 150G it would fail as you report and as shown below:

Slurmctld restart, slurmd not up yet:
$ scontrol show node
NodeName=jette CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=4 CPULoad=N/A Features=(null)
   Gres=(null)
   NodeAddr=jette-desktop NodeHostName=jette-desktop Version=(null)
   RealMemory=1 AllocMem=0 FreeMem=N/A Sockets=4 Boards=1
   State=UNKNOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
   BootTime=None SlurmdStartTime=None
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

$ sbatch -N1 --mem=1g tmp
sbatch: error: Batch job submission failed: Requested node configuration is not available
===============================================

Slurmd started/responds:
$ sbatch -N1 --mem=1g tmp
Submitted batch job 96464

$ scontrol show node
NodeName=jette Arch=i686 CoresPerSocket=1
   CPUAlloc=1 CPUErr=0 CPUTot=4 CPULoad=0.77 Features=(null)
   Gres=(null)
   NodeAddr=jette-desktop NodeHostName=jette-desktop Version=15.08
   OS=Linux RealMemory=3774 AllocMem=1024 FreeMem=1312 Sockets=4 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=466925 Weight=1 Owner=N/A
   BootTime=2015-11-27T09:07:44 SlurmdStartTime=2015-11-27T09:49:45
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

   NodeAddr=jette-desktop NodeHostName=jette-desktop Version=15.08
   OS=Linux RealMemory=3774 AllocMem=1024 FreeMem=1312 Sockets=4 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=466925 Weight=1 Owner=N/A
   BootTime=2015-11-27T09:07:44 SlurmdStartTime=2015-11-27T09:49:45
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Submitted batch job 96464

$ scontrol show node
NodeName=jette Arch=i686 CoresPerSocket=1
   CPUAlloc=1 CPUErr=0 CPUTot=4 CPULoad=0.77 Features=(null)
   Gres=(null)jette@jette-desktop:~/Desktop/SLURM/install.linux/bin$ scontrol show node
NodeName=jette CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=4 CPULoad=N/A Features=(null)
   Gres=(null)
   NodeAddr=jette-desktop NodeHostName=jette-desktop Version=(null)
   RealMemory=1 AllocMem=0 FreeMem=N/A Sockets=4 Boards=1
   State=UNKNOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
   BootTime=None SlurmdStartTime=None
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   jette@jette-desktop:~/Desktop/SLURM/install.linux/bin

Comment 8 Moe Jette 2015-11-27 04:12:04 MST

I forgot to say that in my example, there was no RealMemory value for the node in slurm.conf, so it defaulted to 1MB. Whatever is in slurm.conf will be taken as the size until the slurmd on a compute node tells the slurmctld on the head node a different value.

Comment 9 Akmal Madzlan 2015-11-27 10:39:25 MST

---------------------------------------
[root@kque0001 ~]# service slurm restart && scontrol show node kud13 && scontrol show partition kud13 && sbatch -N1 --mem=99999g --partition=kud13 --wrap="hostname"
stopping slurmctld:                                        [  OK  ]
slurmctld is stopped
slurmctld is stopped
starting slurmctld:                                        [  OK  ]
NodeName=kud13 CoresPerSocket=2
   CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=N/A Features=localdisk,nogpu,intel
   Gres=(null)
   NodeAddr=kud13 NodeHostName=kud13 Version=(null)
   RealMemory=1 AllocMem=0 Sockets=2 Boards=1
   State=UNKNOWN ThreadsPerCore=2 TmpDisk=0 Weight=1
   BootTime=None SlurmdStartTime=None
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   

PartitionName=kud13
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=01:00:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=kud13
   Priority=1 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF
   State=UP TotalCPUs=8 TotalNodes=1 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

Submitted batch job 3452375
-----------------------------------------

Job is succesfully submitted

--------------------------------
[root@kque0001 ~]# scontrol show jobs 3452375
JobId=3452375 JobName=wrap
   UserId=root(0) GroupId=root(0)
   Priority=100 Nice=0 Account=root QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2015-11-28T08:33:08 EligibleTime=2015-11-28T08:33:08
   StartTime=Unknown EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=kud13 AllocNode:Sid=kque0001:7143
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=97.66T MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/root
   StdErr=/root/slurm-3452375.out
   StdIn=/dev/null
   StdOut=/root/slurm-3452375.out
------------------------------

A few seconds after scontrol reconfigure, it seems to behave well again

-------------------------------
[root@kque0001 ~]# scontrol reconfigure
[root@kque0001 ~]# srun -pkud13 --mem=9999999999 hostname
srun: job 3452377 queued and waiting for resources
^Csrun: Job allocation 3452377 has been revoked
srun: Force Terminated job 3452377
[root@kque0001 ~]# srun -pkud13 --mem=9999999999 hostname
srun: job 3452378 queued and waiting for resources
^Csrun: Job allocation 3452378 has been revoked
[root@kque0001 ~]# srun -pkud13 --mem=9999999999 hostname
srun: error: Unable to allocate resources: Requested node configuration is not available
[root@kque0001 ~]# sbatch -N1 --mem=99999g --partition=kud13 --wrap="hostname"
sbatch: error: Batch job submission failed: Requested node configuration is not available
-----------------------------

Can you explain this behaviour?
Maybe I missed some configuration somewhere

Comment 10 Tim Wickberg 2015-11-27 11:43:51 MST

I'll look into this further next week, but I assume what you're seeing 
is that Slurm doesn't know the max memory on the nodes until they've 
checked in after the restart - this is an asynchronous process so that 
slurmctld startup is not delayed indefinitely.

Until kud13 reports in with an accurate memory count, Slurm is assuming 
that it may have sufficient memory to satisfy that job's request. Once 
slurmd on th enode has reported in to the slurmctld, it now knows such a 
job will never run given the current hardware, rejects the job and then 
prevents a new request.

As Moe has previously mentioned, setting a RealMemory value in 
slurm.conf for your nodes should prevent this. I will look into this 
further next week.

- Tim

On 11/27/2015 07:39 PM, bugs@schedmd.com wrote:
> *Comment # 9 <http://bugs.schedmd.com/show_bug.cgi?id=2193#c9> on bug
> 2193 <http://bugs.schedmd.com/show_bug.cgi?id=2193> from Akmal Madzlan
> <mailto:akmalm@dugeo.com> *
>
> ---------------------------------------
> [root@kque0001 ~]# service slurm restart && scontrol show node kud13 &&
> scontrol show partition kud13 && sbatch -N1 --mem=99999g --partition=kud13
> --wrap="hostname"
> stopping slurmctld:                                        [  OK  ]
> slurmctld is stopped
> slurmctld is stopped
> starting slurmctld:                                        [  OK  ]
> NodeName=kud13 CoresPerSocket=2
>     CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=N/A Features=localdisk,nogpu,intel
>     Gres=(null)
>     NodeAddr=kud13 NodeHostName=kud13 Version=(null)
>     RealMemory=1 AllocMem=0 Sockets=2 Boards=1
>     State=UNKNOWN ThreadsPerCore=2 TmpDisk=0 Weight=1
>     BootTime=None SlurmdStartTime=None
>     CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>     ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
>
> PartitionName=kud13
>     AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
>     AllocNodes=ALL Default=NO
>     DefaultTime=01:00:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
>     MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO
> MaxCPUsPerNode=UNLIMITED
>     Nodes=kud13
>     Priority=1 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF
>     State=UP TotalCPUs=8 TotalNodes=1 SelectTypeParameters=N/A
>     DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
>
> Submitted batch job 3452375
> -----------------------------------------
>
> Job is succesfully submitted
>
> --------------------------------
> [root@kque0001 ~]# scontrol show jobs 3452375
> JobId=3452375 JobName=wrap
>     UserId=root(0) GroupId=root(0)
>     Priority=100 Nice=0 Account=root QOS=normal
>     JobState=PENDING Reason=Resources Dependency=(null)
>     Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>     RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A
>     SubmitTime=2015-11-28T08:33:08 EligibleTime=2015-11-28T08:33:08
>     StartTime=Unknown EndTime=Unknown
>     PreemptTime=None SuspendTime=None SecsPreSuspend=0
>     Partition=kud13 AllocNode:Sid=kque0001:7143
>     ReqNodeList=(null) ExcNodeList=(null)
>     NodeList=(null)
>     NumNodes=1-1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>     Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>     MinCPUsNode=1 MinMemoryNode=97.66T MinTmpDiskNode=0
>     Features=(null) Gres=(null) Reservation=(null)
>     Shared=OK Contiguous=0 Licenses=(null) Network=(null)
>     Command=(null)
>     WorkDir=/root
>     StdErr=/root/slurm-3452375.out
>     StdIn=/dev/null
>     StdOut=/root/slurm-3452375.out
> ------------------------------
>
> A few seconds after scontrol reconfigure, it seems to behave well again
>
> -------------------------------
> [root@kque0001 ~]# scontrol reconfigure
> [root@kque0001 ~]# srun -pkud13 --mem=9999999999 hostname
> srun: job 3452377 queued and waiting for resources
> ^Csrun: Job allocation 3452377 has been revoked
> srun: Force Terminated job 3452377
> [root@kque0001 ~]# srun -pkud13 --mem=9999999999 hostname
> srun: job 3452378 queued and waiting for resources
> ^Csrun: Job allocation 3452378 has been revoked
> [root@kque0001 ~]# srun -pkud13 --mem=9999999999 hostname
> srun: error: Unable to allocate resources: Requested node configuration is not
> available
> [root@kque0001 ~]# sbatch -N1 --mem=99999g --partition=kud13 --wrap="hostname"
> sbatch: error: Batch job submission failed: Requested node configuration is not
> available
> -----------------------------
>
> Can you explain this behaviour?
> Maybe I missed some configuration somewhere
>
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>

Comment 13 Alejandro Sanchez 2015-12-10 00:23:39 MST

This behavior is also reproducible on 15.08.

I've also seen that if the slurmd's are initially down, then restart controller, submit job with --mem=9999g, job is submitted and after a while you can't submit more. If you execute squeue reason is (Resources) but if you then start the slurmd's, reason changes to (BadConstraint).

Comment 18 Alejandro Sanchez 2015-12-10 02:23:20 MST

Akmal,

With FastSchedule=0, Slurm doesn't have sufficient information to determine the memory size until the node (slurmd) registers and tells what are the real memory constraints. So Slurm allows jobs being submitted until it knows for sure that they'll never run, and from that point onwards it starts rejecting them.

So we suggest to configure some reasonable memory size on the nodes if you use FastSchedule=0.

Please, let us know if this makes sense to you and I'll close the ticket. Oherwise please tell us your concerns and we'll try to resolve/explain them with more detail.

Alex

Comment 19 Akmal Madzlan 2015-12-10 11:21:23 MST

> So we suggest to configure some reasonable memory size on the nodes if you use FastSchedule=0.

But the thing is, setting RealMemory to a reasonable amount doesnt prevent the jobs from being submitted

[root@klugy ~]# service slurm restart && scontrol show node kud13 && scontrol show partition kud13 && sbatch -N1 --mem=99999g --partition=kud13 --wrap="hostname"

# Slurm restarted
stopping slurmctld:                                        [  OK  ]
starting slurmctld:                                        [  OK  ]

# RealMemory is set to 15947
NodeName=kud13 CoresPerSocket=2
   CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=N/A Features=localdisk,nogpu,intel
   Gres=(null)
   NodeAddr=kud13 NodeHostName=kud13 Version=(null)
   RealMemory=15947 AllocMem=0 Sockets=2 Boards=1
   State=UNKNOWN ThreadsPerCore=2 TmpDisk=0 Weight=1
   BootTime=None SlurmdStartTime=None
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   
# This partition only contain this node
PartitionName=kud13
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=01:00:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=kud13
   Priority=1 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF
   State=UP TotalCPUs=8 TotalNodes=1 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

# Job submitted successfully
Submitted batch job 41207

# After a while, job is blocked
[root@klugy ~]# sbatch -N1 --mem=99999g --partition=kud13 --wrap="hostname"
sbatch: error: Batch job submission failed: Requested node configuration is not available
[root@klugy ~]# sbatch -N1 --mem=99999g --partition=kud13 --wrap="hostname"
sbatch: error: Batch job submission failed: Requested node configuration is not available

Comment 20 Alejandro Sanchez 2015-12-11 02:06:37 MST

Akmal,

because the RealMemory option tells Slurm the _minimum_ memory size on the node, but Slurm is not designed to reject jobs due to bad constraints unless it knows for sure (Slurmd information) that they really have bad constraints. And that's why until the controller talks with the node that it allows job's submission. Does it make sense?

Comment 22 Alejandro Sanchez 2015-12-11 06:39:40 MST

Akmal,

another option if you want to reject jobs requesting too much memory before slurm knows how much memory is on each node, is to use a job_submit plugin. It would look at each job's memory request and reject those oever some limit with an appropiate error code.

Don't know if you already implemented any job_submit plugin before. Here's some information http://slurm.schedmd.com/job_submit_plugins.html

Also if you need any help, please let us know. I think we can close this ticket if there aren't more questions.

Comment 23 Alejandro Sanchez 2015-12-13 21:11:38 MST

Closing the ticket. Please, reopen if any more issues are found.

Comment 24 Akmal Madzlan 2015-12-16 11:35:05 MST

Thanks Alejandro :D