Ticket 18251

Summary: Change in behavior with 23.02 and Slurm not assuming >1 nodes for ntasks and ntasks-per-node
Product: Slurm Reporter: Trey Dockendorf <tdockendorf>
Component: SchedulingAssignee: Tyler Connel <tyler>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: troy, tyler
Version: 23.02.6   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=18217
Site: Ohio State OSC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
OCF Sites: --- Recursion Pharma Sites: ---
SFW Sites: --- SNIC sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: slurm.conf

Description Trey Dockendorf 2023-11-21 08:38:03 MST
Created attachment 33404 [details]
slurm.conf

We are testing the 23.02.6 after running 22.05.x for a while and have noticed that doing "--ntasks=4 --ntasks-per-node=2" no longer requests a 2 node job which is causing issues on partitions where we have MinNodes=2.

Example:


$ salloc --ntasks=4 --ntasks-per-node=2 -A PZS0708 -p parallel srun --pty /bin/bash
salloc: error: Job submit/allocate failed: Node count specification invalid

The debug log

Nov 21 10:33:18 owens-slurm01-test slurmctld[82880]: debug2: _part_access_check: Job requested for nodes (1) smaller than partition parallel(2) min nodes

The partition:

PartitionName=parallel DefaultTime=01:00:00 DefMemPerCPU=4315 DenyAccounts=<OMIT LONG LIST> MaxCPUsPerNode=28 MaxMemPerCPU=4315 MaxNodes=81 MaxTime=4-00:00:00 MinNodes=2 Nodes=cpu OverSubscribe=EXCLUSIVE PriorityJobFactor=2000 State=UP

Is this change in behavior expected? I wasn't sure if we've run into a bug or if this is just a change in behavior.  I did a quick look through Release Notes and nothing jumped out at me.
Comment 1 Tyler Connel 2023-11-21 15:00:01 MST
Hello Trey,

I suspect the phenomenon you're experiencing is a duplicate of the linked ticket (18217).

The issue is that when --ntasks-per-node is provided it will recalculate and supersede the value passed by --ntasks. In this ticket, the issue was also found on 23.02.

I'll resolve this as a duplicate for now, but please do reach out if the fix provided through 18217 does not resolve your issue as well.

Best,
Tyler Connel

*** This ticket has been marked as a duplicate of ticket 18217 ***