Ticket 12779

Summary: Partition MaxCpusPerNode broken with select/cons_tres
Product: Slurm Reporter: Marshall Garey <marshall>
Component: LimitsAssignee: Marshall Garey <marshall>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: nick
Version: 21.08.0   
Hardware: Linux   
OS: Linux   
Site: SchedMD Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 21.08.4 22.05.0pre1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Marshall Garey 2021-10-28 14:49:32 MDT
slurm.conf:

* SelectType=select/cons_tres
* MaxCpusPerNode limit on a partition.

Easy reproducer:

* Set MaxCpusPerNode=4.
* Submit 4 single-core jobs (assuming 1 thread per core) to a specific node (-w flag).
* Three jobs run. The last job pends with reason RESOURCES because it thinks we've hit the limit.

If there is only one more CPU left until we hit the MaxCpusPerNode limit, then slurmctld doesn't run the job.

This only exists in select/cons_tres.

I have a patch and will submit it to the review queue.
Comment 16 Marshall Garey 2021-11-12 15:31:05 MST
We fixed this in 21.08 in commit 288631a9cf and we did a little bit of extra cleanup in master (for 22.05).

Closing this as fixed.