Ticket 10948 - Jobs will get random amount of nodes occasionally
Summary: Jobs will get random amount of nodes occasionally
Status: RESOLVED DUPLICATE of ticket 10474
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 20.02.6
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Marcin Stolarek
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-02-25 01:07 MST by CSC sysadmins
Modified: 2021-03-01 03:05 MST (History)
1 user (show)

See Also:
Site: CSC - IT Center for Science
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Array job file used for job 5014488 (545 bytes, application/x-shellscript)
2021-02-25 01:07 MST, CSC sysadmins
Details
slurm.conf (11.81 KB, text/plain)
2021-02-25 01:14 MST, CSC sysadmins
Details
scheduler log (6.90 KB, text/x-log)
2021-02-25 01:14 MST, CSC sysadmins
Details
job_submit.lua (1.87 KB, text/x-lua)
2021-02-25 01:15 MST, CSC sysadmins
Details
job submit util file (4.09 KB, text/x-lua)
2021-02-25 01:16 MST, CSC sysadmins
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description CSC sysadmins 2021-02-25 01:07:44 MST
Created attachment 18115 [details]
Array job file used for job 5014488

Hi,

Sometimes jobs will fail with following error: 

srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1

srun: error: Unable to create step for job 5014497: More processors requested than permitted

This was an array job where 8/10 steps failed and 2/10 ran like it should.

Job 5014488 settings:
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=6

but scheduler gives random amount of nodes/cpus:

[2021-02-24T18:03:43.950] _slurm_rpc_submit_batch_job: JobId=5014488 InitPrio=537 usec=5301
[2021-02-24T18:03:44.953] sched: Allocate JobId=5014488_1(5014489) NodeList=r13g[07-08] #CPUs=9 Partition=gpu
[2021-02-24T18:03:44.956] sched: Allocate JobId=5014488_2(5014490) NodeList=r04g01,r13g01 #CPUs=8 Partition=gpu
[2021-02-24T18:03:44.959] sched: Allocate JobId=5014488_3(5014491) NodeList=r04g04,r13g04 #CPUs=9 Partition=gpu
[2021-02-24T18:03:44.962] sched: Allocate JobId=5014488_4(5014492) NodeList=r04g06,r13g04 #CPUs=8 Partition=gpu
[2021-02-24T18:03:44.965] sched: Allocate JobId=5014488_5(5014493) NodeList=r13g07 #CPUs=6 Partition=gpu
Comment 1 CSC sysadmins 2021-02-25 01:14:01 MST
Created attachment 18117 [details]
slurm.conf
Comment 2 CSC sysadmins 2021-02-25 01:14:19 MST
Created attachment 18118 [details]
scheduler log
Comment 3 CSC sysadmins 2021-02-25 01:15:40 MST
Created attachment 18119 [details]
job_submit.lua
Comment 4 CSC sysadmins 2021-02-25 01:16:12 MST
Created attachment 18120 [details]
job submit util file
Comment 5 Marcin Stolarek 2021-02-26 02:55:47 MST
Tommi,

You may be hitting an issue related to Bug 10474. Unfortunately, this got fixed on 20.11 branch only, since barring security issues we won't release the next minor version of 20.02.

You can try fixing it locally backporting: 5c7d4471133[1] to your build - it's an easy one-line commit.

Looking forward to hearing back from you.

cheers,
Marcin
[1]https://github.com/SchedMD/slurm/commit/5c7d447113310984ca431c782ccd82b2f3200eda
Comment 6 CSC sysadmins 2021-02-26 06:02:49 MST
> You may be hitting an issue related to Bug 10474. Unfortunately, this got
> fixed on 20.11 branch only, since barring security issues we won't release
> the next minor version of 20.02.
> 
> You can try fixing it locally backporting: 5c7d4471133[1] to your build -
> it's an easy one-line commit.
> 
> Looking forward to hearing back from you.

Hi,

Indeed it looks like the issue we're facing, I'll patch our slurmctld on the next week. 20.02 situation is quite annoying, sites need to pick multiple patches like these:

https://bugs.schedmd.com/show_bug.cgi?id=10474 
https://bugs.schedmd.com/show_bug.cgi?id=9670
https://bugs.schedmd.com/show_bug.cgi?id=9724

Best Regards,
Tommi Tervo
Comment 7 Marcin Stolarek 2021-03-01 03:05:48 MST
Tommi,

 I'm marking this case as a duplicate of bug#10474. 

> 20.02 situation is quite annoying, sites need to pick multiple patches like these:
I understand your frustration and I hope to clarify how bug fixes, security and crashes are handled. I also hope to make future interaction less frustrating.
As soon as a new release occurs, (every 9 months), the previous version (20.02) enters legacy support. This means that the older versions (up to the past two major release, (20.02, 19.05), will only receive crash and security fixes. There is some exception here to that hinging off the feasibility of adding those changes without breaking that release.

If you'll notice the issue with the patch applied, please don't hesitate to reopen.

cheers,
Marcin

*** This ticket has been marked as a duplicate of ticket 10474 ***