Ticket 599 - Exceeding MaxCPUs is indicated by an inappropriate message
Summary: Exceeding MaxCPUs is indicated by an inappropriate message
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 2.6.x
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: David Bigagli
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2014-02-26 04:10 MST by Puenlap Lee
Modified: 2014-02-28 06:40 MST (History)
1 user (show)

See Also:
Site: Atos/Eviden Sites
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Puenlap Lee 2014-02-26 04:10:21 MST
One of our testers was misled into thinking that the requested node configuration is not available when it is available.  Here is his test case.

I use the QOS GJ created like this :
sacctmgr -i create qos name=GJ MaxCPUs=4 MaxJobs=4 MaxNodes=2 MaxNodesPerUser=5 flags=DenyOnLimit

6 nodes are available in the default partition
[jouvin@valx0 slurmtest]$ sinfo -p mpi
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
mpi*         up   infinite      6   idle valx[31-34,41-42]

But when I request 2 nodes, the requested node configuration is not available :
[jouvin@valx0 slurmtest]$ srun -J GJ1 -N2 --exclusive --qos=gj hostname
srun: error: Unable to allocate resources: Requested node configuration is not available

This message is inappropriate, the nodes are available. Here, the problem is the number of cpus because of usage of --exclusive (2 * 8 cpus) > MaxCPUs.

Request two nodes is allowed :
[jouvin@valx0 slurmtest]$ srun -J GJ2 -N2 --qos=gj hostname
valx32
valx31

But request more than 4 cpus (MaxCpus) violates the policy :
[jouvin@valx0 slurmtest]$ srun -J GJ3 -n8 --qos=gj hostname
srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

Here the message is clear, it is that which should be issued upon submission of the first job (GJ1).
Comment 1 David Bigagli 2014-02-28 06:40:13 MST
Hi, 
   this depends on your configuration. The reason for the different message is that in the case of 'srun -J GJ1 -N2 --exclusive --qos=gj hostname' the select plugin must run to determine there are no hosts that cannot be used and that's why we see the 'Requested node configuration is not available' error message. On the other hand if your cluster had nodes with one or two cpus the job could have ran. A job request of type 'srun -J GJ3 -n8 --qos=gj hostname' is much easier
to account against the qos limit and thus the message is more direct.
We may consider this as a future enhancement.

David