Ticket 13600

Summary:	GPUs listed as allocated despite no jobs running on nodes
Product:	Slurm	Reporter:	Steve Ford <fordste5>
Component:	Scheduling	Assignee:	Director of Support <support>
Status:	RESOLVED CANNOTREPRODUCE	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	21.08.5
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=13215
Site:	MSU	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Configuration files and logs lac-087 log files

Description Steve Ford 2022-03-10 13:23:56 MST

Hello SchedMD,

Today we noticed a strange issue with some of our GPU nodes. The nodes are listed as idle, but SLURM is reporting that their GPUs are allocated.

Here is sinfo output for nodes in our GPU partition:

$ sinfo -p general-long-gpu -O nodehost,statelong,gres,gresused
HOSTNAMES           STATE               GRES                GRES_USED           
csn-021             mixed               gpu:k20:2(S:0-1)    gpu:k20:2(IDX:0-1)  
csn-022             mixed               gpu:k20:2(S:0-1)    gpu:k20:2(IDX:0-1)  
csn-023             mixed               gpu:k20:2(S:0-1)    gpu:k20:2(IDX:0-1)  
lac-030             mixed               gpu:k80:8(S:0)      gpu:k80:8(IDX:0-7)  
lac-195             mixed               gpu:k80:8(S:0)      gpu:k80:8(IDX:0-7)  
lac-290             mixed               gpu:k80:8(S:0)      gpu:k80:8(IDX:0-7)  
lac-291             mixed               gpu:k80:8(S:0)      gpu:k80:8(IDX:0-7)  
lac-293             mixed               gpu:k80:8(S:0)      gpu:k80:8(IDX:0-7)  
nvf-018             mixed               gpu:v100s:4(S:0-1)  gpu:v100s:4(IDX:0-3)
nvf-019             mixed               gpu:v100s:4(S:0-1)  gpu:v100s:1(IDX:1)  
nvf-020             mixed               gpu:v100s:4(S:0-1)  gpu:v100s:4(IDX:0-3)
nvl-005             mixed               gpu:v100:8(S:0-1)   gpu:v100:6(IDX:0-2,4
nvl-006             mixed               gpu:v100:8(S:0-1)   gpu:v100:8(IDX:0-7) 
csn-001             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-002             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-003             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-004             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-005             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-006             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-007             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-008             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-009             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-010             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-011             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-013             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-014             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-015             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-016             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-017             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-018             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-019             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-024             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-025             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-026             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-027             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-028             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-029             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-030             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-031             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-032             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-033             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-034             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-035             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
csn-036             idle                gpu:k20:2(S:0-1)    gpu:k20:0(IDX:N/A)  
lac-087             idle                gpu:k80:8(S:0)      gpu:k80:8(IDX:0-7)  
lac-143             idle                gpu:k80:8(S:0)      gpu:k80:8(IDX:0-7)  
lac-196             idle                gpu:k80:8(S:0)      gpu:k80:8(IDX:0-7)  
lac-198             idle                gpu:k80:8(S:0)      gpu:k80:8(IDX:0-7)  
lac-199             idle                gpu:k80:8(S:0)      gpu:k80:8(IDX:0-7)  
lac-288             idle                gpu:k80:8(S:0)      gpu:k80:8(IDX:0-7)  
lac-289             idle                gpu:k80:8(S:0)      gpu:k80:8(IDX:0-7)  
lac-292             idle                gpu:k80:8(S:0)      gpu:k80:8(IDX:0-7)  
lac-344             idle                gpu:k80:8(S:0)      gpu:k80:8(IDX:0-7)  
lac-348             idle                gpu:k80:8(S:0)      gpu:k80:8(IDX:0-7)  
nal-000             idle                gpu:a100:4(S:1,3,5,7gpu:a100:0(IDX:N/A) 
nal-001             idle                gpu:a100:4(S:1,3,5,7gpu:a100:0(IDX:N/A) 
nvl-007             idle                gpu:v100:8(S:0-1)   gpu:v100:8(IDX:0-7))

This output shows all GPUs allocated on the ten lac* nodes as well as nvl-007. Scontrol also shows the GPUs as allocated:

sinfo show node lac-087
NodeName=lac-087 Arch=x86_64 CoresPerSocket=14 
   CPUAlloc=0 CPUTot=28 CPULoad=0.42
   AvailableFeatures=lac,gbe,intel16,ib,edr16,k80,gpgpu
   ActiveFeatures=lac,gbe,intel16,ib,edr16,k80,gpgpu
   Gres=gpu:k80:8(S:0)
   NodeAddr=lac-087 NodeHostName=lac-087 Version=20.11.8
   OS=Linux 3.10.0-1160.36.2.el7.x86_64 #1 SMP Wed Jul 21 11:57:15 UTC 2021 
   RealMemory=246640 AllocMem=0 FreeMem=232252 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=174080 Weight=2000 Owner=N/A MCS_label=N/A
   Partitions=iceradmin,scavenger,general-short,general-long-gpu,christlibuyin-gpu,cmich-gpu,cmse-gpu,cvmaccess-gpu,davidroy-gpu,deyoungbuyin-gpu,eisenlohr-gpu,guowei-search-gpu,hmakmm-gpu,merzjrke-gpu,midi_lab-gpu,multiscaleml-gpu,piermaro-gpu,scbbuyin-gpu,planets-gpu,vermaaslab-gpu,cvl-hpcc-gpu 
   BootTime=2022-01-05T10:06:00 SlurmdStartTime=2022-02-16T10:44:00
   CfgTRES=cpu=28,mem=246640M,billing=37489,gres/gpu=8
   AllocTRES=gres/gpu=8
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Comment=(null)

We recently updated our slurmctld to version 21.08.5. Any insight you can provide is much appreciated.

Thanks,
Steve

Comment 1 Jason Booth 2022-03-10 13:31:47 MST

Would you attach your slurm.conf, gres.conf, and logs. This would include the slurmctld.log and slurmd.log from one of those nodes.

Also, are you able to duplicate this with some consistency?

Comment 4 Steve Ford 2022-03-14 11:20:00 MDT

Created attachment 23851 [details]
Configuration files and logs

Comment 5 Steve Ford 2022-03-14 11:33:23 MDT

Jason,

This node state clears out when slurmctld is restarted, but it does come back. We have not discerned a specific type of job that leaves nodes in this state.

Thanks,
Steve

Comment 7 Michael Hinton 2022-03-14 12:12:05 MDT

Hi Steve,

Can you detail how you update your cluster after you e.g. make a change to the # of GPUs on a node?

Thanks,
-Michael

Comment 8 Michael Hinton 2022-03-14 12:30:50 MDT

Can you attach the slurmd logs of lac-087 as well?

Comment 9 Steve Ford 2022-03-14 12:44:21 MDT

Created attachment 23853 [details]
lac-087 log files

Comment 10 Steve Ford 2022-03-14 12:49:16 MDT

Michael,

We have not needed to change the number of GPUs on a node, but our process when we add new nodes or remove nodes from the cluster is to shut down all slurmds and the slurmctld, update the configs, start the slurmctld, then start the slurmds.

Thanks,
Steve

Comment 11 Michael Hinton 2022-03-14 13:17:51 MDT

Can you attach the output of the following command?

sacct -p -D -j JobId=47895082,47895015,47895061,47531304,47895067,47531368 --format=jobid,jobidraw,jobname,partition,nodelist,nnodes,ntasks,reqcpus,alloccpus,ncpus,reqnodes,allocnodes,reqmem,reqtres,alloctres,state,exitcode,derivedexitcode,reason,elapsed,elapsedraw,submit,eligible,start,end,suspended,submitline,DBIndex,timelimit,TimelimitRaw,Flags,TotalCPU,MinCPU,MinCPUNode,MinCPUTask

There are a few types of errors that are concerning. The first type are like the following:

[2022-03-12T19:01:44.667] error: gres/gpu: job 47531368 dealloc node lac-087 type k80 gres count underflow (0 1)

I'm not 100% sure what is causing these errors, but sometimes they can occur after updating the cluster if GRES on a node is changed or when nodes are added or removed from a configuration.

At what times did you update the cluster configuration in the last few days?

For what it's worth, we recommend updating the cluster in the following way:

* Stop the slurmctld daemon
* Update the slurm.conf file on all nodes in the cluster
* Restart the slurmd daemons on all nodes
* Restart the slurmctld daemon

This may help reduce these types of dealloc/underflow errors. See https://slurm.schedmd.com/faq.html#add_nodes.

The second type of error that I see is the following:

[2022-03-14T05:22:10.745] error: gres/gpu: job 47895082 dealloc of node lac-293 bad node_offset 0 count is 0

In this second type, I think there must be something fishy going on when these jobs get requeued.

I'll keep investigating the logs.

Thanks!
-Michael

Comment 12 Michael Hinton 2022-03-14 13:33:51 MDT

(In reply to Michael Hinton from comment #11)
> At what times did you update the cluster configuration in the last few days?
And for each change in configuration, can you supply a diff? I want to see how the node record list was "shifted" to see if this can help explain why some of these nodes are confused.

Comment 13 Steve Ford 2022-03-15 07:34:09 MDT

Michael,

Sacct complained about an invalid option "submitline", here is the output without that parameter:

$ sacct -p -D -j 47895082,47895015,47895061,47531304,47895067,47531368 --format=jobid,jobidraw,jobname,partition,nodelist,nnodes,ntasks,reqcpus,alloccpus,ncpus,reqnodes,allocnodes,reqmem,reqtres,alloctres,state,exitcode,derivedexitcode,reason,elapsed,elapsedraw,submit,eligible,start,end,suspended,DBIndex,timelimit,TimelimitRaw,Flags,TotalCPU,MinCPU,MinCPUNode,MinCPUTask
JobID|JobIDRaw|JobName|Partition|NodeList|NNodes|NTasks|ReqCPUS|AllocCPUS|NCPUS|ReqNodes|AllocNodes|ReqMem|ReqTRES|AllocTRES|State|ExitCode|DerivedExitCode|Reason|Elapsed|ElapsedRaw|Submit|Eligible|Start|End|Suspended|DBIndex|Timelimit|TimelimitRaw|Flags|TotalCPU|MinCPU|MinCPUNode|MinCPUTask|
47531304|47531304|dyltask|general-long-gpu|lac-087|1||5|5|5|1|1|8Gc|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|TIMEOUT|0:0|0:0|ReqNodeNotAvail|10:00:29|36029|2022-03-11T12:28:01|2022-03-11T12:28:01|2022-03-12T11:28:03|2022-03-12T21:28:32|00:00:00|1149982473|10:00:00|600|SchedBackfill|1-21:13:00||||
47531304.batch|47531304.batch|batch||lac-087|1|1|5|5|5|1|1|8Gc||cpu=5,gres/gpu=1,mem=40G,node=1|CANCELLED|0:15|||10:00:30|36030|2022-03-12T11:28:03|2022-03-12T11:28:03|2022-03-12T11:28:03|2022-03-12T21:28:33|00:00:00|1149982473||||1-21:13:00|1-21:12:59|lac-087|0|
47531304.extern|47531304.extern|extern||lac-087|1|1|5|5|5|1|1|8Gc||billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|||10:00:29|36029|2022-03-12T11:28:03|2022-03-12T11:28:03|2022-03-12T11:28:03|2022-03-12T21:28:32|00:00:00|1149982473||||00:00.001|00:00:00|lac-087|0|
47531368|47531368|dyltask|general-long-gpu|lac-087|1||5|5|5|1|1|8Gc|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|TIMEOUT|0:0|0:0|ReqNodeNotAvail|10:00:29|36029|2022-03-11T12:28:04|2022-03-11T12:28:04|2022-03-12T13:03:52|2022-03-12T23:04:21|00:00:00|1149982537|10:00:00|600|SchedBackfill|1-18:08:07||||
47531368.batch|47531368.batch|batch||lac-087|1|1|5|5|5|1|1|8Gc||cpu=5,gres/gpu=1,mem=40G,node=1|CANCELLED|0:15|||10:00:30|36030|2022-03-12T13:03:52|2022-03-12T13:03:52|2022-03-12T13:03:52|2022-03-12T23:04:22|00:00:00|1149982537||||1-18:08:07|1-18:08:08|lac-087|0|
47531368.extern|47531368.extern|extern||lac-087|1|1|5|5|5|1|1|8Gc||billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|||10:00:29|36029|2022-03-12T13:03:52|2022-03-12T13:03:52|2022-03-12T13:03:52|2022-03-12T23:04:21|00:00:00|1149982537||||00:00.001|00:00:00|lac-087|0|
47895015|47895015|dyltask|general-long-gpu|lac-293|1||5|5|5|1|1|8Gc|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|0:0|ReqNodeNotAvail|00:01:22|82|2022-03-13T21:59:20|2022-03-13T21:59:20|2022-03-14T05:20:48|2022-03-14T05:22:10|00:00:00|1150426539|14:00:00|840|SchedBackfill|00:34.583||||
47895015.batch|47895015.batch|batch||lac-293|1|1|5|5|5|1|1|8Gc||cpu=5,mem=40G,node=1|COMPLETED|0:0|||01:36:36|5796|2022-03-14T03:45:34|2022-03-14T03:45:34|2022-03-14T03:45:34|2022-03-14T05:22:10|00:00:00|1150426539||||00:34.582|00:00:34|lac-293|0|
47895015.extern|47895015.extern|extern||lac-293|1|1|5|5|5|1|1|8Gc||billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|||01:36:36|5796|2022-03-14T03:45:34|2022-03-14T03:45:34|2022-03-14T03:45:34|2022-03-14T05:22:10|00:00:00|1150426539||||00:00:00|00:00:00|lac-293|0|
47895061|47895061|dyltask|general-long-gpu|lac-293|1||5|5|5|1|1|8Gc|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|0:0|ReqNodeNotAvail|00:13:11|791|2022-03-13T21:59:21|2022-03-13T21:59:21|2022-03-14T05:20:48|2022-03-14T05:33:59|00:00:00|1150426604|14:00:00|840|SchedBackfill|21:11.834||||
47895061.batch|47895061.batch|batch||lac-293|1|1|5|5|5|1|1|8Gc||cpu=5,mem=40G,node=1|COMPLETED|0:0|||00:54:22|3262|2022-03-14T04:39:37|2022-03-14T04:39:37|2022-03-14T04:39:37|2022-03-14T05:33:59|00:00:00|1150426604||||21:11.833|00:21:11|lac-293|0|
47895061.extern|47895061.extern|extern||lac-293|1|1|5|5|5|1|1|8Gc||billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|||00:54:22|3262|2022-03-14T04:39:37|2022-03-14T04:39:37|2022-03-14T04:39:37|2022-03-14T05:33:59|00:00:00|1150426604||||00:00.001|00:00:00|lac-293|0|
47895067|47895067|dyltask|general-long-gpu|lac-293|1||5|5|5|1|1|8Gc|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|0:0|ReqNodeNotAvail|00:15:30|930|2022-03-13T21:59:22|2022-03-13T21:59:22|2022-03-14T05:20:48|2022-03-14T05:36:18|00:00:00|1150426610|14:00:00|840|SchedBackfill|22:58.819||||
47895067.batch|47895067.batch|batch||lac-293|1|1|5|5|5|1|1|8Gc||cpu=5,mem=40G,node=1|COMPLETED|0:0|||00:54:27|3267|2022-03-14T04:41:51|2022-03-14T04:41:51|2022-03-14T04:41:51|2022-03-14T05:36:18|00:00:00|1150426610||||22:58.818|00:22:58|lac-293|0|
47895067.extern|47895067.extern|extern||lac-293|1|1|5|5|5|1|1|8Gc||billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|||00:54:27|3267|2022-03-14T04:41:51|2022-03-14T04:41:51|2022-03-14T04:41:51|2022-03-14T05:36:18|00:00:00|1150426610||||00:00.001|00:00:00|lac-293|0|
47895082|47895082|dyltask|general-long-gpu|lac-293|1||5|5|5|1|1|8Gc|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|0:0|ReqNodeNotAvail|00:01:22|82|2022-03-13T21:59:22|2022-03-13T21:59:22|2022-03-14T05:20:48|2022-03-14T05:22:10|00:00:00|1150426625|14:00:00|840|SchedBackfill|00:36.564||||
47895082.batch|47895082.batch|batch||lac-293|1|1|5|5|5|1|1|8Gc||cpu=5,mem=40G,node=1|COMPLETED|0:0|||00:17:15|1035|2022-03-14T05:04:55|2022-03-14T05:04:55|2022-03-14T05:04:55|2022-03-14T05:22:10|00:00:00|1150426625||||00:36.563|00:00:36|lac-293|0|
47895082.extern|47895082.extern|extern||lac-293|1|1|5|5|5|1|1|8Gc||billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|||00:17:15|1035|2022-03-14T05:04:55|2022-03-14T05:04:55|2022-03-14T05:04:55|2022-03-14T05:22:10|00:00:00|1150426625||||00:00.001|00:00:00|lac-293|0|

Thanks,
Steve

Comment 14 Michael Hinton 2022-03-15 09:32:13 MDT

(In reply to Steve Ford from comment #13)
> Sacct complained about an invalid option "submitline", here is the output
> without that parameter:
Are all your Slurm daemons on 21.08.5? What version is slurmdbd running at?

Comment 15 Michael Hinton 2022-03-15 09:35:08 MDT

I think that error means that sacct is still at an old version. Can you do this as well?:

    sacct --version

Comment 16 Steve Ford 2022-03-15 12:18:28 MDT

Michael,

You were correct, the SLURM client version I was using was behind (20.11.4). Our slurmd's, slurmctld, and slurmdbd are all on 21.08.5. Here is the output after updating the client:

sacct -p -D -j 47895082,47895015,47895061,47531304,47895067,47531368 --format=jobid,jobidraw,jobname,partition,nodelist,nnodes,ntasks,reqcpus,alloccpus,ncpus,reqnodes,allocnodes,reqmem,reqtres,alloctres,state,exitcode,derivedexitcode,reason,elapsed,elapsedraw,submit,eligible,start,end,suspended,submitline,DBIndex,timelimit,TimelimitRaw,Flags,TotalCPU,MinCPU,MinCPUNode,MinCPUTask
JobID|JobIDRaw|JobName|Partition|NodeList|NNodes|NTasks|ReqCPUS|AllocCPUS|NCPUS|ReqNodes|AllocNodes|ReqMem|ReqTRES|AllocTRES|State|ExitCode|DerivedExitCode|Reason|Elapsed|ElapsedRaw|Submit|Eligible|Start|End|Suspended|SubmitLine|DBIndex|Timelimit|TimelimitRaw|Flags|TotalCPU|MinCPU|MinCPUNode|MinCPUTask|
47531304|47531304|dyltask|general-long-gpu|lac-087|1||5|5|5|1|1|40G|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|TIMEOUT|0:0|0:0|ReqNodeNotAvail|10:00:29|36029|2022-03-11T12:28:01|2022-03-11T12:28:01|2022-03-12T11:28:03|2022-03-12T21:28:32|00:00:00||1149982473|10:00:00|600|SchedBackfill|1-21:13:00||||
47531304.batch|47531304.batch|batch||lac-087|1|1|5|5|5|1|1|||cpu=5,gres/gpu=1,mem=40G,node=1|CANCELLED|0:15|||10:00:30|36030|2022-03-12T11:28:03|2022-03-12T11:28:03|2022-03-12T11:28:03|2022-03-12T21:28:33|00:00:00||1149982473||||1-21:13:00|1-21:12:59|lac-087|0|
47531304.extern|47531304.extern|extern||lac-087|1|1|5|5|5|1|1|||billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|||10:00:29|36029|2022-03-12T11:28:03|2022-03-12T11:28:03|2022-03-12T11:28:03|2022-03-12T21:28:32|00:00:00||1149982473||||00:00.001|00:00:00|lac-087|0|
47531368|47531368|dyltask|general-long-gpu|lac-087|1||5|5|5|1|1|40G|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|TIMEOUT|0:0|0:0|ReqNodeNotAvail|10:00:29|36029|2022-03-11T12:28:04|2022-03-11T12:28:04|2022-03-12T13:03:52|2022-03-12T23:04:21|00:00:00||1149982537|10:00:00|600|SchedBackfill|1-18:08:07||||
47531368.batch|47531368.batch|batch||lac-087|1|1|5|5|5|1|1|||cpu=5,gres/gpu=1,mem=40G,node=1|CANCELLED|0:15|||10:00:30|36030|2022-03-12T13:03:52|2022-03-12T13:03:52|2022-03-12T13:03:52|2022-03-12T23:04:22|00:00:00||1149982537||||1-18:08:07|1-18:08:08|lac-087|0|
47531368.extern|47531368.extern|extern||lac-087|1|1|5|5|5|1|1|||billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|||10:00:29|36029|2022-03-12T13:03:52|2022-03-12T13:03:52|2022-03-12T13:03:52|2022-03-12T23:04:21|00:00:00||1149982537||||00:00.001|00:00:00|lac-087|0|
47895015|47895015|dyltask|general-long-gpu|lac-293|1||5|5|5|1|1|40G|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|0:0|ReqNodeNotAvail|00:01:22|82|2022-03-13T21:59:20|2022-03-13T21:59:20|2022-03-14T05:20:48|2022-03-14T05:22:10|00:00:00|sbatch submit_full.sb False 85 312 256 True 512 3 mean True|1150426539|14:00:00|840|SchedBackfill|00:34.583||||
47895015.batch|47895015.batch|batch||lac-293|1|1|5|5|5|1|1|||cpu=5,mem=40G,node=1|COMPLETED|0:0|||01:36:36|5796|2022-03-14T03:45:34|2022-03-14T03:45:34|2022-03-14T03:45:34|2022-03-14T05:22:10|00:00:00||1150426539||||00:34.582|00:00:34|lac-293|0|
47895015.extern|47895015.extern|extern||lac-293|1|1|5|5|5|1|1|||billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|||01:36:36|5796|2022-03-14T03:45:34|2022-03-14T03:45:34|2022-03-14T03:45:34|2022-03-14T05:22:10|00:00:00||1150426539||||00:00:00|00:00:00|lac-293|0|
47895061|47895061|dyltask|general-long-gpu|lac-293|1||5|5|5|1|1|40G|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|0:0|ReqNodeNotAvail|00:13:11|791|2022-03-13T21:59:21|2022-03-13T21:59:21|2022-03-14T05:20:48|2022-03-14T05:33:59|00:00:00|sbatch submit_full.sb False 85 312 400 True 512 3 sum False|1150426604|14:00:00|840|SchedBackfill|21:11.834||||
47895061.batch|47895061.batch|batch||lac-293|1|1|5|5|5|1|1|||cpu=5,mem=40G,node=1|COMPLETED|0:0|||00:54:22|3262|2022-03-14T04:39:37|2022-03-14T04:39:37|2022-03-14T04:39:37|2022-03-14T05:33:59|00:00:00||1150426604||||21:11.833|00:21:11|lac-293|0|
47895061.extern|47895061.extern|extern||lac-293|1|1|5|5|5|1|1|||billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|||00:54:22|3262|2022-03-14T04:39:37|2022-03-14T04:39:37|2022-03-14T04:39:37|2022-03-14T05:33:59|00:00:00||1150426604||||00:00.001|00:00:00|lac-293|0|
47895067|47895067|dyltask|general-long-gpu|lac-293|1||5|5|5|1|1|40G|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|0:0|ReqNodeNotAvail|00:15:30|930|2022-03-13T21:59:22|2022-03-13T21:59:22|2022-03-14T05:20:48|2022-03-14T05:36:18|00:00:00|sbatch submit_full.sb False 85 312 400 True 512 3 mean False|1150426610|14:00:00|840|SchedBackfill|22:58.819||||
47895067.batch|47895067.batch|batch||lac-293|1|1|5|5|5|1|1|||cpu=5,mem=40G,node=1|COMPLETED|0:0|||00:54:27|3267|2022-03-14T04:41:51|2022-03-14T04:41:51|2022-03-14T04:41:51|2022-03-14T05:36:18|00:00:00||1150426610||||22:58.818|00:22:58|lac-293|0|
47895067.extern|47895067.extern|extern||lac-293|1|1|5|5|5|1|1|||billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|||00:54:27|3267|2022-03-14T04:41:51|2022-03-14T04:41:51|2022-03-14T04:41:51|2022-03-14T05:36:18|00:00:00||1150426610||||00:00.001|00:00:00|lac-293|0|
47895082|47895082|dyltask|general-long-gpu|lac-293|1||5|5|5|1|1|40G|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|0:0|ReqNodeNotAvail|00:01:22|82|2022-03-13T21:59:22|2022-03-13T21:59:22|2022-03-14T05:20:48|2022-03-14T05:22:10|00:00:00|sbatch submit_full.sb False 85 312 400 False 512 2 mean True|1150426625|14:00:00|840|SchedBackfill|00:36.564||||
47895082.batch|47895082.batch|batch||lac-293|1|1|5|5|5|1|1|||cpu=5,mem=40G,node=1|COMPLETED|0:0|||00:17:15|1035|2022-03-14T05:04:55|2022-03-14T05:04:55|2022-03-14T05:04:55|2022-03-14T05:22:10|00:00:00||1150426625||||00:36.563|00:00:36|lac-293|0|
47895082.extern|47895082.extern|extern||lac-293|1|1|5|5|5|1|1|||billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|||00:17:15|1035|2022-03-14T05:04:55|2022-03-14T05:04:55|2022-03-14T05:04:55|2022-03-14T05:22:10|00:00:00||1150426625||||00:00.001|00:00:00|lac-293|0|

Thanks,
Steve

Comment 17 Michael Hinton 2022-03-15 12:48:30 MDT

(In reply to Michael Hinton from comment #11)
> At what times did you update the cluster configuration in the last few days?
Steve, can you give a brief summary on when you last upgraded Slurm, and also when you last changed the config in any way? Did these issues appear out of the blue on a stable system, or do you think they could possibly be related to recent config changes?

Comment 18 Michael Hinton 2022-03-16 10:17:01 MDT

As it is now, it is very tricky to understand how you system is getting into this state just based off the logs provided. If you are able to reproduce this on a test cluster, though, that would help us a lot.

How often are you hitting these issues? Are the issues going away after your upgrade to 21.08, or are they persisting?

Comment 19 Michael Hinton 2022-03-17 10:10:37 MDT

Reducing to severity 3. Feel free to respond to my last comments and bump this back to a sev 2 as needed.

Comment 20 Steve Ford 2022-03-21 09:59:02 MDT

Hello Michael,

Sorry for the delay. This node state has tapered of completely over the last week and I have not seen a node in this state since last Wednesday. I think this may have been more specifically related to the upgrade, perhaps to jobs that started prior to the nodes be updated and completed after they updated. This issue is no longer a priority for us. I will let you know if this issue resurfaces.

Thanks,
Steve

Comment 21 Michael Hinton 2022-03-21 10:12:47 MDT

(In reply to Steve Ford from comment #20)
> I think
> this may have been more specifically related to the upgrade, perhaps to jobs
> that started prior to the nodes be updated and completed after they updated.
Ok, good to know.

After looking back at comment 0, I noticed this:

(In reply to Steve Ford from comment #0)
> sinfo show node lac-087
> NodeName=lac-087 Arch=x86_64 CoresPerSocket=14 
> ...
>    Gres=gpu:k80:8(S:0)
>    NodeAddr=lac-087 NodeHostName=lac-087 Version=20.11.8
> ...
If `sinfo show node` is to be believed, then this slurmd was still on 20.11.8. But later you said your slurmds were all at 21.08.5:

(In reply to Steve Ford from comment #16)
> You were correct, the SLURM client version I was using was behind (20.11.4).
> Our slurmd's, slurmctld, and slurmdbd are all on 21.08.5.
Perhaps your upgrade process did not restart all slurmds properly.

I think we can chalk this up to the ctld being on 21.08, the job starting in 20.11, and the slurmd still being on 20.11. However, Slurm should still be able to handle this case. So that will be something we will try to look into. I'll go ahead and reduce the severity accordingly.

Thanks!
-Michael

Comment 22 Michael Hinton 2022-04-14 16:32:37 MDT

Hey Steve,

Since the problem has gone away, and since it's not clear how to reproduce this, I'm going to go ahead and mark this as resolved. But feel free to reopen if you can provide a reproducer.

Thanks!
-Michael