Ticket 207

Summary: QoS limits enforcement: 0-valued per user used limits does not prevent pending jobs to be executed
Product: Slurm Reporter: Puenlap Lee <puen-lap.lee>
Component: slurmctldAssignee: Moe Jette <jette>
Status: RESOLVED FIXED QA Contact:
Severity: 2 - High Impact    
Priority: --- CC: da
Version: 2.4.x   
Hardware: Linux   
OS: Linux   
Site: CEA - TGCC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: See the comments in the enclosed patch file.

Description Puenlap Lee 2013-01-15 02:11:00 MST
Created attachment 175 [details]
See the comments in the enclosed patch file.

Bug report by the CEA

It appears that pending jobs can be executed on cluster even when their associated QoS are configured with a 0-valued per user "used limits".

This problems can be reproduced with slurm-2.4.3 (cons_res + backfilling) by modifying the limits using a command like the following one :

sacctmgr -i update qos where name=normal set maxjobs=0 maxsubmit=0

and then :

- ensure that pending jobs have a reason of "(QOSResourceLimit)"
- stop slurmdbd using "service slurmdbd stop"
- stop slurmctld using "service slurm stop" on the controller node
- start slurmctld using "service slurm start" on the controller node
- wait a little (10s), the controller has to start without DBD support for a while
- start slurmdbd

After that the controller registers the dbd. After a while, the backfill logic is triggered and start one pending job per pending user.

If you redo the same protocol, a new batch of pending jobs is started regardless the associated QOS limits.


Enclosed is a proposed patch writtern by Matthieu Hautreux, could you take a look to see if it is Ok.

Thanks.
Comment 1 Moe Jette 2013-01-15 03:01:05 MST
I have confirmed the problem and Matthieu's patch is good. We will probably not be releasing another version 2.4 of slurm, but I have applied the patch to v2.5.2 (the logic in the new patch is unchanged, but things moved around a bit in v2.5). The commit for v2.5 is here:

https://github.com/SchedMD/slurm/commit/4136520d9481b1df7739d894319eddf50e68530e

Thanks to Matthieu for the patch.