Ticket 17917

Summary:	[CAST-34339] Slurmctld ghost AllocTRES
Product:	Slurm	Reporter:	Brian F Gilmer <brian.gilmer>
Component:	GPU	Assignee:	Scott Hilton <scott>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	22.05.9
Hardware:	Cray Shasta
OS:	Linux
Site:	CRAY	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	CSC COMPUTER SCIENCES LTD	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Brian F Gilmer 2023-10-16 10:17:12 MDT

Problem Description: Nodes sometimes finish running jobs but there TRES remain allocated. Restarting slurmd, slurmstepd, or the whole compute node, does not resolve the issue; restarting slurmctld does resolve, but is too big a hammer to routinely use to sort this when this occurs.

This is what such a node looks like:

(screen) [23-10-15 15:13:08] root@ncn-m001: /home/juhaj/scsd # scontrol show node nid006612

NodeName=nid006612 Arch=x86_64 CoresPerSocket=8

CPUAlloc=0 CPUEfctv=112 CPUTot=128 CPULoad=0.07

AvailableFeatures=AMD_EPYC_7A53,x1301

ActiveFeatures=AMD_EPYC_7A53,x1301

Gres=gpu:mi250:8(S:0)

NodeAddr=nid006612 NodeHostName=nid006612 Version=22.05.8

OS=Linux 5.14.21-150400.24.46_12.0.73-cray_shasta_c #1 SMP Thu May 18 23:03:34 UTC 2023 (9c4698c)

RealMemory=491520 AllocMem=0 FreeMem=489214 Sockets=8 Boards=1

CoreSpecCount=8 CPUSpecList=0-1,16-17,32-33,48-49,64-65,80-81,96-97,112-113

State=IDLE+RESERVED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A

Partitions=standard-g,bardpeak,bench

BootTime=2023-08-22T02:05:59 SlurmdStartTime=2023-10-15T18:01:33

LastBusyTime=2023-10-15T18:13:08

CfgTRES=cpu=112,mem=480G,billing=112,gres/gpu:mi250=8

AllocTRES=gres/gpu:mi250=8

CapWatts=n/a

CurrentWatts=0 AveWatts=0

ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

but curiously, it looks like slurmstepd has not reported that the job has finished:

nid006612:~ # systemctl status slurmstepd.scope

● slurmstepd.scope

Loaded: loaded (/run/systemd/transient/slurmstepd.scope; transient)

Transient: yes

Active: active (abandoned) since Tue 2023-08-22 02:23:35 EEST; 1 month 24 days ago

Tasks: 1

CPU: 1y 10month 2w 3d 17h 52min 10.441s

CGroup: /system.slice/slurmstepd.scope

└─system

└─ 123291 /usr/sbin/slurmstepd infinity

Oct 13 22:50:33 nid006612 slurmstepd[82499]: task/cgroup: _memcg_initialize: job: alloc=0MB mem.limit=515269MB memsw.limit=unlimited

Oct 13 22:50:33 nid006612 slurmstepd[82499]: task/cgroup: _memcg_initialize: step: alloc=0MB mem.limit=515269MB memsw.limit=unlimited

Oct 13 22:50:39 nid006612 slurmstepd[82499]: get_exit_code task 0 died by signal: 15

Oct 13 22:50:42 nid006612 slurmstepd[82499]: done with job

Oct 13 22:56:57 nid006612 slurmstepd[84884]: task/cgroup: _memcg_initialize: job: alloc=0MB mem.limit=515269MB memsw.limit=unlimited

Oct 13 22:56:57 nid006612 slurmstepd[84884]: task/cgroup: _memcg_initialize: step: alloc=0MB mem.limit=515269MB memsw.limit=unlimited

Oct 13 22:56:58 nid006612 slurmstepd[84884]: get_exit_code task 0 died by signal: 15

Oct 13 22:57:01 nid006612 slurmstepd[84884]: done with job

Oct 13 23:02:02 nid006612 slurmstepd[87015]: task/cgroup: _memcg_initialize: job: alloc=0MB mem.limit=515269MB memsw.limit=unlimited

Oct 13 23:02:02 nid006612 slurmstepd[87015]: task/cgroup: _memcg_initialize: step: alloc=0MB mem.limit=515269MB memsw.limit=unlimited

nid006612:~ #
Problem Analysis: This is what such a node looks like:

(screen) [23-10-15 15:13:08] root@ncn-m001: /home/juhaj/scsd # scontrol show node nid006612

NodeName=nid006612 Arch=x86_64 CoresPerSocket=8

CPUAlloc=0 CPUEfctv=112 CPUTot=128 CPULoad=0.07

AvailableFeatures=AMD_EPYC_7A53,x1301

ActiveFeatures=AMD_EPYC_7A53,x1301

Gres=gpu:mi250:8(S:0)

NodeAddr=nid006612 NodeHostName=nid006612 Version=22.05.8

OS=Linux 5.14.21-150400.24.46_12.0.73-cray_shasta_c #1 SMP Thu May 18 23:03:34 UTC 2023 (9c4698c)

RealMemory=491520 AllocMem=0 FreeMem=489214 Sockets=8 Boards=1

CoreSpecCount=8 CPUSpecList=0-1,16-17,32-33,48-49,64-65,80-81,96-97,112-113

State=IDLE+RESERVED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A

Partitions=standard-g,bardpeak,bench

BootTime=2023-08-22T02:05:59 SlurmdStartTime=2023-10-15T18:01:33

LastBusyTime=2023-10-15T18:13:08

CfgTRES=cpu=112,mem=480G,billing=112,gres/gpu:mi250=8

AllocTRES=gres/gpu:mi250=8

CapWatts=n/a

CurrentWatts=0 AveWatts=0

ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

but curiously, it looks like slurmstepd has not reported that the job has finished:

nid006612:~ # systemctl status slurmstepd.scope

● slurmstepd.scope

Loaded: loaded (/run/systemd/transient/slurmstepd.scope; transient)

Transient: yes

Active: active (abandoned) since Tue 2023-08-22 02:23:35 EEST; 1 month 24 days ago

Tasks: 1

CPU: 1y 10month 2w 3d 17h 52min 10.441s

CGroup: /system.slice/slurmstepd.scope

└─system

└─ 123291 /usr/sbin/slurmstepd infinity

Oct 13 22:50:33 nid006612 slurmstepd[82499]: task/cgroup: _memcg_initialize: job: alloc=0MB mem.limit=515269MB memsw.limit=unlimited

Oct 13 22:50:33 nid006612 slurmstepd[82499]: task/cgroup: _memcg_initialize: step: alloc=0MB mem.limit=515269MB memsw.limit=unlimited

Oct 13 22:50:39 nid006612 slurmstepd[82499]: get_exit_code task 0 died by signal: 15

Oct 13 22:50:42 nid006612 slurmstepd[82499]: done with job

Oct 13 22:56:57 nid006612 slurmstepd[84884]: task/cgroup: _memcg_initialize: job: alloc=0MB mem.limit=515269MB memsw.limit=unlimited

Oct 13 22:56:57 nid006612 slurmstepd[84884]: task/cgroup: _memcg_initialize: step: alloc=0MB mem.limit=515269MB memsw.limit=unlimited

Oct 13 22:56:58 nid006612 slurmstepd[84884]: get_exit_code task 0 died by signal: 15

Oct 13 22:57:01 nid006612 slurmstepd[84884]: done with job

Oct 13 23:02:02 nid006612 slurmstepd[87015]: task/cgroup: _memcg_initialize: job: alloc=0MB mem.limit=515269MB memsw.limit=unlimited

Oct 13 23:02:02 nid006612 slurmstepd[87015]: task/cgroup: _memcg_initialize: step: alloc=0MB mem.limit=515269MB memsw.limit=unlimited

nid006612:~ #

Comment 1 Scott Hilton 2023-10-16 13:46:17 MDT

Brian,

I believe that this issue has been fixed in 23.02.2 with bug 16121.

-Scott

*** This ticket has been marked as a duplicate of ticket 16121 ***

Comment 2 Brian F Gilmer 2023-10-16 13:48:48 MDT

Can the fix be applied to 22.05?

Comment 3 Scott Hilton 2023-10-16 14:00:13 MDT

Brian,

We only push bug fixes to the current major version of slurm.

Would applying commit 5c3a7f6aaf work on 22.05? Probably. But it would be an unsupported install. Upgrading to 23.02 is our official recommendation. 

-Scott

Comment 4 Brian F Gilmer 2023-10-17 07:13:52 MDT

From the customer Slurm expert:
 
I just read the commit 5c3a7f6aaf.
Looks like we have a workaround available: turn off requeueing jobs (set JobRequeue=0 and use submit plugin to remove any requeue options from submission - not sure if this is enough to stop "scontrol update job=xxx requeue=1" though).

Can we confirm that this would work as a workaround?

Comment 5 Scott Hilton 2023-10-18 10:16:18 MDT

Brian,

I think that workaround would work as you surmised but scontrol could bypass it. I can't think of anyway to completely forbid requeue off right now.

But if that is your best option, I would rather you just apply 5c3a7f6aaf to 22.05. 5c3a7f6aaf is a very simple commit that doesn't rely on any recent changes. I am quite confident it should work. 

-Scott

Comment 6 Scott Hilton 2023-10-18 10:18:46 MDT

I took a look and we did rename gres_ctld_job_clear() in 23.02. So the diff should look like this for you.

diff --git a/src/slurmctld/node_scheduler.c b/src/slurmctld/node_scheduler.c
index 60abfb9137..c268aad893 100644
--- a/src/slurmctld/node_scheduler.c
+++ b/src/slurmctld/node_scheduler.c
@@ -2228,7 +2228,7 @@ static void _end_null_job(job_record_t *job_ptr)
        job_ptr->exit_code = 0;
        gres_ctld_job_clear(job_ptr->gres_list_req);
        gres_ctld_job_clear(job_ptr->gres_list_req_accum);
-       gres_ctld_job_clear(job_ptr->gres_list_alloc);
+       FREE_NULL_LIST(job_ptr->gres_list_alloc);
        job_ptr->job_state = JOB_RUNNING;
        job_ptr->bit_flags |= JOB_WAS_RUNNING;
        FREE_NULL_BITMAP(job_ptr->node_bitmap);
@@ -2652,7 +2652,7 @@ extern int select_nodes(job_record_t *job_ptr, bool test_only,
        job_ptr->exit_code = 0;
        gres_ctld_job_clear(job_ptr->gres_list_req);
        gres_ctld_job_clear(job_ptr->gres_list_req_accum);
-       gres_ctld_job_clear(job_ptr->gres_list_alloc);
+       FREE_NULL_LIST(job_ptr->gres_list_alloc);
        if (!job_ptr->step_list)
                job_ptr->step_list = list_create(free_step_record);

Comment 7 Brian F Gilmer 2023-10-18 10:31:56 MDT

Thanks,

Changing the version of Slurm at this point isn't an option.

Comment 8 Brian F Gilmer 2023-10-18 10:33:47 MDT

This is from the user in response to a question of why can't they test the controller restart.

"There is no opportunity to test it: we need the system 100% available from Friday evening, hence the urgency of this case and we have no reproducer for the issue. (Without the urgency, this would be a minor issue and no rush: if we lose a few dozen nodes a week and need to restart slurmctld during office hours, that's fine - it's just the critical acceptance weekend which is more complicated.)

Also, as I mentioned in my previous comment, we'll go with the workaround untested, would just like to have a confirmation from SchedMD that we've understood the issue correctly and the workaround will indeed protect us. If not, it'll be a potential slurmctld restart in the middle of the night, which no one is comfortable with but we will do it - and ring the hotline, waking everyone up if the ctld does not come up as expected (I'm pretty confident it will, but it hasn't always done so)."

Comment 9 Scott Hilton 2023-10-23 15:38:35 MDT

Brian,

I guess, for now, stopping jobs from requeuing is probably their best option. Let me know if they have other questions on this topic.

-Scott

Comment 10 Brian F Gilmer 2023-10-24 06:50:19 MDT

Thanks, 

The customer got through the testing so this bug can be closed now.

Comment 11 Scott Hilton 2023-10-24 09:16:21 MDT

Closing

*** This ticket has been marked as a duplicate of ticket 16121 ***