Ticket 12779

Summary:	Partition MaxCpusPerNode broken with select/cons_tres
Product:	Slurm	Reporter:	Marshall Garey <marshall>
Component:	Limits	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	nick
Version:	21.08.0
Hardware:	Linux
OS:	Linux
Site:	SchedMD	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	21.08.4 22.05.0pre1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Marshall Garey 2021-10-28 14:49:32 MDT

slurm.conf:

* SelectType=select/cons_tres
* MaxCpusPerNode limit on a partition.

Easy reproducer:

* Set MaxCpusPerNode=4.
* Submit 4 single-core jobs (assuming 1 thread per core) to a specific node (-w flag).
* Three jobs run. The last job pends with reason RESOURCES because it thinks we've hit the limit.

If there is only one more CPU left until we hit the MaxCpusPerNode limit, then slurmctld doesn't run the job.

This only exists in select/cons_tres.

I have a patch and will submit it to the review queue.

Comment 16 Marshall Garey 2021-11-12 15:31:05 MST

We fixed this in 21.08 in commit 288631a9cf and we did a little bit of extra cleanup in master (for 22.05).

Closing this as fixed.