Ticket 12779 - Partition MaxCpusPerNode broken with select/cons_tres
Summary: Partition MaxCpusPerNode broken with select/cons_tres
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Limits (show other tickets)
Version: 21.08.0
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Marshall Garey
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-10-28 14:49 MDT by Marshall Garey
Modified: 2021-11-12 15:31 MST (History)
1 user (show)

See Also:
Site: SchedMD
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 21.08.4 22.05.0pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Marshall Garey 2021-10-28 14:49:32 MDT
slurm.conf:

* SelectType=select/cons_tres
* MaxCpusPerNode limit on a partition.

Easy reproducer:

* Set MaxCpusPerNode=4.
* Submit 4 single-core jobs (assuming 1 thread per core) to a specific node (-w flag).
* Three jobs run. The last job pends with reason RESOURCES because it thinks we've hit the limit.

If there is only one more CPU left until we hit the MaxCpusPerNode limit, then slurmctld doesn't run the job.

This only exists in select/cons_tres.

I have a patch and will submit it to the review queue.
Comment 16 Marshall Garey 2021-11-12 15:31:05 MST
We fixed this in 21.08 in commit 288631a9cf and we did a little bit of extra cleanup in master (for 22.05).

Closing this as fixed.