Problem Description: Nodes sometimes finish running jobs but there TRES remain allocated. Restarting slurmd, slurmstepd, or the whole compute node, does not resolve the issue; restarting slurmctld does resolve, but is too big a hammer to routinely use to sort this when this occurs. This is what such a node looks like: (screen) [23-10-15 15:13:08] root@ncn-m001: /home/juhaj/scsd # scontrol show node nid006612 NodeName=nid006612 Arch=x86_64 CoresPerSocket=8 CPUAlloc=0 CPUEfctv=112 CPUTot=128 CPULoad=0.07 AvailableFeatures=AMD_EPYC_7A53,x1301 ActiveFeatures=AMD_EPYC_7A53,x1301 Gres=gpu:mi250:8(S:0) NodeAddr=nid006612 NodeHostName=nid006612 Version=22.05.8 OS=Linux 5.14.21-150400.24.46_12.0.73-cray_shasta_c #1 SMP Thu May 18 23:03:34 UTC 2023 (9c4698c) RealMemory=491520 AllocMem=0 FreeMem=489214 Sockets=8 Boards=1 CoreSpecCount=8 CPUSpecList=0-1,16-17,32-33,48-49,64-65,80-81,96-97,112-113 State=IDLE+RESERVED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=standard-g,bardpeak,bench BootTime=2023-08-22T02:05:59 SlurmdStartTime=2023-10-15T18:01:33 LastBusyTime=2023-10-15T18:13:08 CfgTRES=cpu=112,mem=480G,billing=112,gres/gpu:mi250=8 AllocTRES=gres/gpu:mi250=8 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s but curiously, it looks like slurmstepd has not reported that the job has finished: nid006612:~ # systemctl status slurmstepd.scope ● slurmstepd.scope Loaded: loaded (/run/systemd/transient/slurmstepd.scope; transient) Transient: yes Active: active (abandoned) since Tue 2023-08-22 02:23:35 EEST; 1 month 24 days ago Tasks: 1 CPU: 1y 10month 2w 3d 17h 52min 10.441s CGroup: /system.slice/slurmstepd.scope └─system └─ 123291 /usr/sbin/slurmstepd infinity Oct 13 22:50:33 nid006612 slurmstepd[82499]: task/cgroup: _memcg_initialize: job: alloc=0MB mem.limit=515269MB memsw.limit=unlimited Oct 13 22:50:33 nid006612 slurmstepd[82499]: task/cgroup: _memcg_initialize: step: alloc=0MB mem.limit=515269MB memsw.limit=unlimited Oct 13 22:50:39 nid006612 slurmstepd[82499]: get_exit_code task 0 died by signal: 15 Oct 13 22:50:42 nid006612 slurmstepd[82499]: done with job Oct 13 22:56:57 nid006612 slurmstepd[84884]: task/cgroup: _memcg_initialize: job: alloc=0MB mem.limit=515269MB memsw.limit=unlimited Oct 13 22:56:57 nid006612 slurmstepd[84884]: task/cgroup: _memcg_initialize: step: alloc=0MB mem.limit=515269MB memsw.limit=unlimited Oct 13 22:56:58 nid006612 slurmstepd[84884]: get_exit_code task 0 died by signal: 15 Oct 13 22:57:01 nid006612 slurmstepd[84884]: done with job Oct 13 23:02:02 nid006612 slurmstepd[87015]: task/cgroup: _memcg_initialize: job: alloc=0MB mem.limit=515269MB memsw.limit=unlimited Oct 13 23:02:02 nid006612 slurmstepd[87015]: task/cgroup: _memcg_initialize: step: alloc=0MB mem.limit=515269MB memsw.limit=unlimited nid006612:~ # Problem Analysis: This is what such a node looks like: (screen) [23-10-15 15:13:08] root@ncn-m001: /home/juhaj/scsd # scontrol show node nid006612 NodeName=nid006612 Arch=x86_64 CoresPerSocket=8 CPUAlloc=0 CPUEfctv=112 CPUTot=128 CPULoad=0.07 AvailableFeatures=AMD_EPYC_7A53,x1301 ActiveFeatures=AMD_EPYC_7A53,x1301 Gres=gpu:mi250:8(S:0) NodeAddr=nid006612 NodeHostName=nid006612 Version=22.05.8 OS=Linux 5.14.21-150400.24.46_12.0.73-cray_shasta_c #1 SMP Thu May 18 23:03:34 UTC 2023 (9c4698c) RealMemory=491520 AllocMem=0 FreeMem=489214 Sockets=8 Boards=1 CoreSpecCount=8 CPUSpecList=0-1,16-17,32-33,48-49,64-65,80-81,96-97,112-113 State=IDLE+RESERVED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=standard-g,bardpeak,bench BootTime=2023-08-22T02:05:59 SlurmdStartTime=2023-10-15T18:01:33 LastBusyTime=2023-10-15T18:13:08 CfgTRES=cpu=112,mem=480G,billing=112,gres/gpu:mi250=8 AllocTRES=gres/gpu:mi250=8 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s but curiously, it looks like slurmstepd has not reported that the job has finished: nid006612:~ # systemctl status slurmstepd.scope ● slurmstepd.scope Loaded: loaded (/run/systemd/transient/slurmstepd.scope; transient) Transient: yes Active: active (abandoned) since Tue 2023-08-22 02:23:35 EEST; 1 month 24 days ago Tasks: 1 CPU: 1y 10month 2w 3d 17h 52min 10.441s CGroup: /system.slice/slurmstepd.scope └─system └─ 123291 /usr/sbin/slurmstepd infinity Oct 13 22:50:33 nid006612 slurmstepd[82499]: task/cgroup: _memcg_initialize: job: alloc=0MB mem.limit=515269MB memsw.limit=unlimited Oct 13 22:50:33 nid006612 slurmstepd[82499]: task/cgroup: _memcg_initialize: step: alloc=0MB mem.limit=515269MB memsw.limit=unlimited Oct 13 22:50:39 nid006612 slurmstepd[82499]: get_exit_code task 0 died by signal: 15 Oct 13 22:50:42 nid006612 slurmstepd[82499]: done with job Oct 13 22:56:57 nid006612 slurmstepd[84884]: task/cgroup: _memcg_initialize: job: alloc=0MB mem.limit=515269MB memsw.limit=unlimited Oct 13 22:56:57 nid006612 slurmstepd[84884]: task/cgroup: _memcg_initialize: step: alloc=0MB mem.limit=515269MB memsw.limit=unlimited Oct 13 22:56:58 nid006612 slurmstepd[84884]: get_exit_code task 0 died by signal: 15 Oct 13 22:57:01 nid006612 slurmstepd[84884]: done with job Oct 13 23:02:02 nid006612 slurmstepd[87015]: task/cgroup: _memcg_initialize: job: alloc=0MB mem.limit=515269MB memsw.limit=unlimited Oct 13 23:02:02 nid006612 slurmstepd[87015]: task/cgroup: _memcg_initialize: step: alloc=0MB mem.limit=515269MB memsw.limit=unlimited nid006612:~ #
Brian, I believe that this issue has been fixed in 23.02.2 with bug 16121. -Scott *** This ticket has been marked as a duplicate of ticket 16121 ***
Can the fix be applied to 22.05?
Brian, We only push bug fixes to the current major version of slurm. Would applying commit 5c3a7f6aaf work on 22.05? Probably. But it would be an unsupported install. Upgrading to 23.02 is our official recommendation. -Scott
From the customer Slurm expert: I just read the commit 5c3a7f6aaf. Looks like we have a workaround available: turn off requeueing jobs (set JobRequeue=0 and use submit plugin to remove any requeue options from submission - not sure if this is enough to stop "scontrol update job=xxx requeue=1" though). Can we confirm that this would work as a workaround?
Brian, I think that workaround would work as you surmised but scontrol could bypass it. I can't think of anyway to completely forbid requeue off right now. But if that is your best option, I would rather you just apply 5c3a7f6aaf to 22.05. 5c3a7f6aaf is a very simple commit that doesn't rely on any recent changes. I am quite confident it should work. -Scott
I took a look and we did rename gres_ctld_job_clear() in 23.02. So the diff should look like this for you. diff --git a/src/slurmctld/node_scheduler.c b/src/slurmctld/node_scheduler.c index 60abfb9137..c268aad893 100644 --- a/src/slurmctld/node_scheduler.c +++ b/src/slurmctld/node_scheduler.c @@ -2228,7 +2228,7 @@ static void _end_null_job(job_record_t *job_ptr) job_ptr->exit_code = 0; gres_ctld_job_clear(job_ptr->gres_list_req); gres_ctld_job_clear(job_ptr->gres_list_req_accum); - gres_ctld_job_clear(job_ptr->gres_list_alloc); + FREE_NULL_LIST(job_ptr->gres_list_alloc); job_ptr->job_state = JOB_RUNNING; job_ptr->bit_flags |= JOB_WAS_RUNNING; FREE_NULL_BITMAP(job_ptr->node_bitmap); @@ -2652,7 +2652,7 @@ extern int select_nodes(job_record_t *job_ptr, bool test_only, job_ptr->exit_code = 0; gres_ctld_job_clear(job_ptr->gres_list_req); gres_ctld_job_clear(job_ptr->gres_list_req_accum); - gres_ctld_job_clear(job_ptr->gres_list_alloc); + FREE_NULL_LIST(job_ptr->gres_list_alloc); if (!job_ptr->step_list) job_ptr->step_list = list_create(free_step_record);
Thanks, Changing the version of Slurm at this point isn't an option.
This is from the user in response to a question of why can't they test the controller restart. "There is no opportunity to test it: we need the system 100% available from Friday evening, hence the urgency of this case and we have no reproducer for the issue. (Without the urgency, this would be a minor issue and no rush: if we lose a few dozen nodes a week and need to restart slurmctld during office hours, that's fine - it's just the critical acceptance weekend which is more complicated.) Also, as I mentioned in my previous comment, we'll go with the workaround untested, would just like to have a confirmation from SchedMD that we've understood the issue correctly and the workaround will indeed protect us. If not, it'll be a potential slurmctld restart in the middle of the night, which no one is comfortable with but we will do it - and ring the hotline, waking everyone up if the ctld does not come up as expected (I'm pretty confident it will, but it hasn't always done so)."
Brian, I guess, for now, stopping jobs from requeuing is probably their best option. Let me know if they have other questions on this topic. -Scott
Thanks, The customer got through the testing so this bug can be closed now.
Closing *** This ticket has been marked as a duplicate of ticket 16121 ***