Ticket 599

Summary: Exceeding MaxCPUs is indicated by an inappropriate message
Product: Slurm Reporter: Puenlap Lee <puen-lap.lee>
Component: SchedulingAssignee: David Bigagli <david>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: da
Version: 2.6.x   
Hardware: Linux   
OS: Linux   
Site: Atos/Eviden Sites Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Puenlap Lee 2014-02-26 04:10:21 MST
One of our testers was misled into thinking that the requested node configuration is not available when it is available.  Here is his test case.

I use the QOS GJ created like this :
sacctmgr -i create qos name=GJ MaxCPUs=4 MaxJobs=4 MaxNodes=2 MaxNodesPerUser=5 flags=DenyOnLimit

6 nodes are available in the default partition
[jouvin@valx0 slurmtest]$ sinfo -p mpi
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
mpi*         up   infinite      6   idle valx[31-34,41-42]

But when I request 2 nodes, the requested node configuration is not available :
[jouvin@valx0 slurmtest]$ srun -J GJ1 -N2 --exclusive --qos=gj hostname
srun: error: Unable to allocate resources: Requested node configuration is not available

This message is inappropriate, the nodes are available. Here, the problem is the number of cpus because of usage of --exclusive (2 * 8 cpus) > MaxCPUs.

Request two nodes is allowed :
[jouvin@valx0 slurmtest]$ srun -J GJ2 -N2 --qos=gj hostname
valx32
valx31

But request more than 4 cpus (MaxCpus) violates the policy :
[jouvin@valx0 slurmtest]$ srun -J GJ3 -n8 --qos=gj hostname
srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

Here the message is clear, it is that which should be issued upon submission of the first job (GJ1).
Comment 1 David Bigagli 2014-02-28 06:40:13 MST
Hi, 
   this depends on your configuration. The reason for the different message is that in the case of 'srun -J GJ1 -N2 --exclusive --qos=gj hostname' the select plugin must run to determine there are no hosts that cannot be used and that's why we see the 'Requested node configuration is not available' error message. On the other hand if your cluster had nodes with one or two cpus the job could have ran. A job request of type 'srun -J GJ3 -n8 --qos=gj hostname' is much easier
to account against the qos limit and thus the message is more direct.
We may consider this as a future enhancement.

David