Created attachment 18115 [details] Array job file used for job 5014488 Hi, Sometimes jobs will fail with following error: srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: error: Unable to create step for job 5014497: More processors requested than permitted This was an array job where 8/10 steps failed and 2/10 ran like it should. Job 5014488 settings: #SBATCH --ntasks=1 #SBATCH --cpus-per-task=6 but scheduler gives random amount of nodes/cpus: [2021-02-24T18:03:43.950] _slurm_rpc_submit_batch_job: JobId=5014488 InitPrio=537 usec=5301 [2021-02-24T18:03:44.953] sched: Allocate JobId=5014488_1(5014489) NodeList=r13g[07-08] #CPUs=9 Partition=gpu [2021-02-24T18:03:44.956] sched: Allocate JobId=5014488_2(5014490) NodeList=r04g01,r13g01 #CPUs=8 Partition=gpu [2021-02-24T18:03:44.959] sched: Allocate JobId=5014488_3(5014491) NodeList=r04g04,r13g04 #CPUs=9 Partition=gpu [2021-02-24T18:03:44.962] sched: Allocate JobId=5014488_4(5014492) NodeList=r04g06,r13g04 #CPUs=8 Partition=gpu [2021-02-24T18:03:44.965] sched: Allocate JobId=5014488_5(5014493) NodeList=r13g07 #CPUs=6 Partition=gpu
Created attachment 18117 [details] slurm.conf
Created attachment 18118 [details] scheduler log
Created attachment 18119 [details] job_submit.lua
Created attachment 18120 [details] job submit util file
Tommi, You may be hitting an issue related to Bug 10474. Unfortunately, this got fixed on 20.11 branch only, since barring security issues we won't release the next minor version of 20.02. You can try fixing it locally backporting: 5c7d4471133[1] to your build - it's an easy one-line commit. Looking forward to hearing back from you. cheers, Marcin [1]https://github.com/SchedMD/slurm/commit/5c7d447113310984ca431c782ccd82b2f3200eda
> You may be hitting an issue related to Bug 10474. Unfortunately, this got > fixed on 20.11 branch only, since barring security issues we won't release > the next minor version of 20.02. > > You can try fixing it locally backporting: 5c7d4471133[1] to your build - > it's an easy one-line commit. > > Looking forward to hearing back from you. Hi, Indeed it looks like the issue we're facing, I'll patch our slurmctld on the next week. 20.02 situation is quite annoying, sites need to pick multiple patches like these: https://bugs.schedmd.com/show_bug.cgi?id=10474 https://bugs.schedmd.com/show_bug.cgi?id=9670 https://bugs.schedmd.com/show_bug.cgi?id=9724 Best Regards, Tommi Tervo
Tommi, I'm marking this case as a duplicate of bug#10474. > 20.02 situation is quite annoying, sites need to pick multiple patches like these: I understand your frustration and I hope to clarify how bug fixes, security and crashes are handled. I also hope to make future interaction less frustrating. As soon as a new release occurs, (every 9 months), the previous version (20.02) enters legacy support. This means that the older versions (up to the past two major release, (20.02, 19.05), will only receive crash and security fixes. There is some exception here to that hinging off the feasibility of adding those changes without breaking that release. If you'll notice the issue with the patch applied, please don't hesitate to reopen. cheers, Marcin *** This ticket has been marked as a duplicate of ticket 10474 ***