Ticket 18251

Summary:	Change in behavior with 23.02 and Slurm not assuming >1 nodes for ntasks and ntasks-per-node
Product:	Slurm	Reporter:	Trey Dockendorf <tdockendorf>
Component:	Scheduling	Assignee:	Tyler Connel <tyler>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	troy, tyler
Version:	23.02.6
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=18217
Site:	Ohio State OSC	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf

Description Trey Dockendorf 2023-11-21 08:38:03 MST

Created attachment 33404 [details]
slurm.conf

We are testing the 23.02.6 after running 22.05.x for a while and have noticed that doing "--ntasks=4 --ntasks-per-node=2" no longer requests a 2 node job which is causing issues on partitions where we have MinNodes=2.

Example:


$ salloc --ntasks=4 --ntasks-per-node=2 -A PZS0708 -p parallel srun --pty /bin/bash
salloc: error: Job submit/allocate failed: Node count specification invalid

The debug log

Nov 21 10:33:18 owens-slurm01-test slurmctld[82880]: debug2: _part_access_check: Job requested for nodes (1) smaller than partition parallel(2) min nodes

The partition:

PartitionName=parallel DefaultTime=01:00:00 DefMemPerCPU=4315 DenyAccounts=<OMIT LONG LIST> MaxCPUsPerNode=28 MaxMemPerCPU=4315 MaxNodes=81 MaxTime=4-00:00:00 MinNodes=2 Nodes=cpu OverSubscribe=EXCLUSIVE PriorityJobFactor=2000 State=UP

Is this change in behavior expected? I wasn't sure if we've run into a bug or if this is just a change in behavior.  I did a quick look through Release Notes and nothing jumped out at me.

Comment 1 Tyler Connel 2023-11-21 15:00:01 MST

Hello Trey,

I suspect the phenomenon you're experiencing is a duplicate of the linked ticket (18217).

The issue is that when --ntasks-per-node is provided it will recalculate and supersede the value passed by --ntasks. In this ticket, the issue was also found on 23.02.

I'll resolve this as a duplicate for now, but please do reach out if the fix provided through 18217 does not resolve your issue as well.

Best,
Tyler Connel

*** This ticket has been marked as a duplicate of ticket 18217 ***