Summary: | Change in behavior with 23.02 and Slurm not assuming >1 nodes for ntasks and ntasks-per-node | ||
---|---|---|---|
Product: | Slurm | Reporter: | Trey Dockendorf <tdockendorf> |
Component: | Scheduling | Assignee: | Tyler Connel <tyler> |
Status: | RESOLVED DUPLICATE | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | troy, tyler |
Version: | 23.02.6 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=18217 | ||
Site: | Ohio State OSC | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: | slurm.conf |
Hello Trey, I suspect the phenomenon you're experiencing is a duplicate of the linked ticket (18217). The issue is that when --ntasks-per-node is provided it will recalculate and supersede the value passed by --ntasks. In this ticket, the issue was also found on 23.02. I'll resolve this as a duplicate for now, but please do reach out if the fix provided through 18217 does not resolve your issue as well. Best, Tyler Connel *** This ticket has been marked as a duplicate of ticket 18217 *** |
Created attachment 33404 [details] slurm.conf We are testing the 23.02.6 after running 22.05.x for a while and have noticed that doing "--ntasks=4 --ntasks-per-node=2" no longer requests a 2 node job which is causing issues on partitions where we have MinNodes=2. Example: $ salloc --ntasks=4 --ntasks-per-node=2 -A PZS0708 -p parallel srun --pty /bin/bash salloc: error: Job submit/allocate failed: Node count specification invalid The debug log Nov 21 10:33:18 owens-slurm01-test slurmctld[82880]: debug2: _part_access_check: Job requested for nodes (1) smaller than partition parallel(2) min nodes The partition: PartitionName=parallel DefaultTime=01:00:00 DefMemPerCPU=4315 DenyAccounts=<OMIT LONG LIST> MaxCPUsPerNode=28 MaxMemPerCPU=4315 MaxNodes=81 MaxTime=4-00:00:00 MinNodes=2 Nodes=cpu OverSubscribe=EXCLUSIVE PriorityJobFactor=2000 State=UP Is this change in behavior expected? I wasn't sure if we've run into a bug or if this is just a change in behavior. I did a quick look through Release Notes and nothing jumped out at me.