| Summary: | Jobs will get random amount of nodes occasionally | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | CSC sysadmins <csc-slurm-tickets> |
| Component: | Scheduling | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | bart |
| Version: | 20.02.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | CSC - IT Center for Science | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
Array job file used for job 5014488
slurm.conf scheduler log job_submit.lua job submit util file |
||
Created attachment 18117 [details]
slurm.conf
Created attachment 18118 [details]
scheduler log
Created attachment 18119 [details]
job_submit.lua
Created attachment 18120 [details]
job submit util file
Tommi, You may be hitting an issue related to Bug 10474. Unfortunately, this got fixed on 20.11 branch only, since barring security issues we won't release the next minor version of 20.02. You can try fixing it locally backporting: 5c7d4471133[1] to your build - it's an easy one-line commit. Looking forward to hearing back from you. cheers, Marcin [1]https://github.com/SchedMD/slurm/commit/5c7d447113310984ca431c782ccd82b2f3200eda > You may be hitting an issue related to Bug 10474. Unfortunately, this got > fixed on 20.11 branch only, since barring security issues we won't release > the next minor version of 20.02. > > You can try fixing it locally backporting: 5c7d4471133[1] to your build - > it's an easy one-line commit. > > Looking forward to hearing back from you. Hi, Indeed it looks like the issue we're facing, I'll patch our slurmctld on the next week. 20.02 situation is quite annoying, sites need to pick multiple patches like these: https://bugs.schedmd.com/show_bug.cgi?id=10474 https://bugs.schedmd.com/show_bug.cgi?id=9670 https://bugs.schedmd.com/show_bug.cgi?id=9724 Best Regards, Tommi Tervo Tommi, I'm marking this case as a duplicate of bug#10474. > 20.02 situation is quite annoying, sites need to pick multiple patches like these: I understand your frustration and I hope to clarify how bug fixes, security and crashes are handled. I also hope to make future interaction less frustrating. As soon as a new release occurs, (every 9 months), the previous version (20.02) enters legacy support. This means that the older versions (up to the past two major release, (20.02, 19.05), will only receive crash and security fixes. There is some exception here to that hinging off the feasibility of adding those changes without breaking that release. If you'll notice the issue with the patch applied, please don't hesitate to reopen. cheers, Marcin *** This ticket has been marked as a duplicate of ticket 10474 *** |
Created attachment 18115 [details] Array job file used for job 5014488 Hi, Sometimes jobs will fail with following error: srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: error: Unable to create step for job 5014497: More processors requested than permitted This was an array job where 8/10 steps failed and 2/10 ran like it should. Job 5014488 settings: #SBATCH --ntasks=1 #SBATCH --cpus-per-task=6 but scheduler gives random amount of nodes/cpus: [2021-02-24T18:03:43.950] _slurm_rpc_submit_batch_job: JobId=5014488 InitPrio=537 usec=5301 [2021-02-24T18:03:44.953] sched: Allocate JobId=5014488_1(5014489) NodeList=r13g[07-08] #CPUs=9 Partition=gpu [2021-02-24T18:03:44.956] sched: Allocate JobId=5014488_2(5014490) NodeList=r04g01,r13g01 #CPUs=8 Partition=gpu [2021-02-24T18:03:44.959] sched: Allocate JobId=5014488_3(5014491) NodeList=r04g04,r13g04 #CPUs=9 Partition=gpu [2021-02-24T18:03:44.962] sched: Allocate JobId=5014488_4(5014492) NodeList=r04g06,r13g04 #CPUs=8 Partition=gpu [2021-02-24T18:03:44.965] sched: Allocate JobId=5014488_5(5014493) NodeList=r13g07 #CPUs=6 Partition=gpu