Ticket 10948

Summary: Jobs will get random amount of nodes occasionally
Product: Slurm Reporter: CSC sysadmins <csc-slurm-tickets>
Component: SchedulingAssignee: Marcin Stolarek <cinek>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: bart
Version: 20.02.6   
Hardware: Linux   
OS: Linux   
Site: CSC - IT Center for Science Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: Array job file used for job 5014488
slurm.conf
scheduler log
job_submit.lua
job submit util file

Description CSC sysadmins 2021-02-25 01:07:44 MST
Created attachment 18115 [details]
Array job file used for job 5014488

Hi,

Sometimes jobs will fail with following error: 

srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1

srun: error: Unable to create step for job 5014497: More processors requested than permitted

This was an array job where 8/10 steps failed and 2/10 ran like it should.

Job 5014488 settings:
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=6

but scheduler gives random amount of nodes/cpus:

[2021-02-24T18:03:43.950] _slurm_rpc_submit_batch_job: JobId=5014488 InitPrio=537 usec=5301
[2021-02-24T18:03:44.953] sched: Allocate JobId=5014488_1(5014489) NodeList=r13g[07-08] #CPUs=9 Partition=gpu
[2021-02-24T18:03:44.956] sched: Allocate JobId=5014488_2(5014490) NodeList=r04g01,r13g01 #CPUs=8 Partition=gpu
[2021-02-24T18:03:44.959] sched: Allocate JobId=5014488_3(5014491) NodeList=r04g04,r13g04 #CPUs=9 Partition=gpu
[2021-02-24T18:03:44.962] sched: Allocate JobId=5014488_4(5014492) NodeList=r04g06,r13g04 #CPUs=8 Partition=gpu
[2021-02-24T18:03:44.965] sched: Allocate JobId=5014488_5(5014493) NodeList=r13g07 #CPUs=6 Partition=gpu
Comment 1 CSC sysadmins 2021-02-25 01:14:01 MST
Created attachment 18117 [details]
slurm.conf
Comment 2 CSC sysadmins 2021-02-25 01:14:19 MST
Created attachment 18118 [details]
scheduler log
Comment 3 CSC sysadmins 2021-02-25 01:15:40 MST
Created attachment 18119 [details]
job_submit.lua
Comment 4 CSC sysadmins 2021-02-25 01:16:12 MST
Created attachment 18120 [details]
job submit util file
Comment 5 Marcin Stolarek 2021-02-26 02:55:47 MST
Tommi,

You may be hitting an issue related to Bug 10474. Unfortunately, this got fixed on 20.11 branch only, since barring security issues we won't release the next minor version of 20.02.

You can try fixing it locally backporting: 5c7d4471133[1] to your build - it's an easy one-line commit.

Looking forward to hearing back from you.

cheers,
Marcin
[1]https://github.com/SchedMD/slurm/commit/5c7d447113310984ca431c782ccd82b2f3200eda
Comment 6 CSC sysadmins 2021-02-26 06:02:49 MST
> You may be hitting an issue related to Bug 10474. Unfortunately, this got
> fixed on 20.11 branch only, since barring security issues we won't release
> the next minor version of 20.02.
> 
> You can try fixing it locally backporting: 5c7d4471133[1] to your build -
> it's an easy one-line commit.
> 
> Looking forward to hearing back from you.

Hi,

Indeed it looks like the issue we're facing, I'll patch our slurmctld on the next week. 20.02 situation is quite annoying, sites need to pick multiple patches like these:

https://bugs.schedmd.com/show_bug.cgi?id=10474 
https://bugs.schedmd.com/show_bug.cgi?id=9670
https://bugs.schedmd.com/show_bug.cgi?id=9724

Best Regards,
Tommi Tervo
Comment 7 Marcin Stolarek 2021-03-01 03:05:48 MST
Tommi,

 I'm marking this case as a duplicate of bug#10474. 

> 20.02 situation is quite annoying, sites need to pick multiple patches like these:
I understand your frustration and I hope to clarify how bug fixes, security and crashes are handled. I also hope to make future interaction less frustrating.
As soon as a new release occurs, (every 9 months), the previous version (20.02) enters legacy support. This means that the older versions (up to the past two major release, (20.02, 19.05), will only receive crash and security fixes. There is some exception here to that hinging off the feasibility of adding those changes without breaking that release.

If you'll notice the issue with the patch applied, please don't hesitate to reopen.

cheers,
Marcin

*** This ticket has been marked as a duplicate of ticket 10474 ***