Hello SchedMD, Today we noticed a strange issue with some of our GPU nodes. The nodes are listed as idle, but SLURM is reporting that their GPUs are allocated. Here is sinfo output for nodes in our GPU partition: $ sinfo -p general-long-gpu -O nodehost,statelong,gres,gresused HOSTNAMES STATE GRES GRES_USED csn-021 mixed gpu:k20:2(S:0-1) gpu:k20:2(IDX:0-1) csn-022 mixed gpu:k20:2(S:0-1) gpu:k20:2(IDX:0-1) csn-023 mixed gpu:k20:2(S:0-1) gpu:k20:2(IDX:0-1) lac-030 mixed gpu:k80:8(S:0) gpu:k80:8(IDX:0-7) lac-195 mixed gpu:k80:8(S:0) gpu:k80:8(IDX:0-7) lac-290 mixed gpu:k80:8(S:0) gpu:k80:8(IDX:0-7) lac-291 mixed gpu:k80:8(S:0) gpu:k80:8(IDX:0-7) lac-293 mixed gpu:k80:8(S:0) gpu:k80:8(IDX:0-7) nvf-018 mixed gpu:v100s:4(S:0-1) gpu:v100s:4(IDX:0-3) nvf-019 mixed gpu:v100s:4(S:0-1) gpu:v100s:1(IDX:1) nvf-020 mixed gpu:v100s:4(S:0-1) gpu:v100s:4(IDX:0-3) nvl-005 mixed gpu:v100:8(S:0-1) gpu:v100:6(IDX:0-2,4 nvl-006 mixed gpu:v100:8(S:0-1) gpu:v100:8(IDX:0-7) csn-001 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-002 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-003 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-004 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-005 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-006 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-007 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-008 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-009 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-010 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-011 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-013 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-014 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-015 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-016 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-017 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-018 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-019 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-024 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-025 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-026 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-027 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-028 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-029 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-030 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-031 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-032 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-033 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-034 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-035 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) csn-036 idle gpu:k20:2(S:0-1) gpu:k20:0(IDX:N/A) lac-087 idle gpu:k80:8(S:0) gpu:k80:8(IDX:0-7) lac-143 idle gpu:k80:8(S:0) gpu:k80:8(IDX:0-7) lac-196 idle gpu:k80:8(S:0) gpu:k80:8(IDX:0-7) lac-198 idle gpu:k80:8(S:0) gpu:k80:8(IDX:0-7) lac-199 idle gpu:k80:8(S:0) gpu:k80:8(IDX:0-7) lac-288 idle gpu:k80:8(S:0) gpu:k80:8(IDX:0-7) lac-289 idle gpu:k80:8(S:0) gpu:k80:8(IDX:0-7) lac-292 idle gpu:k80:8(S:0) gpu:k80:8(IDX:0-7) lac-344 idle gpu:k80:8(S:0) gpu:k80:8(IDX:0-7) lac-348 idle gpu:k80:8(S:0) gpu:k80:8(IDX:0-7) nal-000 idle gpu:a100:4(S:1,3,5,7gpu:a100:0(IDX:N/A) nal-001 idle gpu:a100:4(S:1,3,5,7gpu:a100:0(IDX:N/A) nvl-007 idle gpu:v100:8(S:0-1) gpu:v100:8(IDX:0-7)) This output shows all GPUs allocated on the ten lac* nodes as well as nvl-007. Scontrol also shows the GPUs as allocated: sinfo show node lac-087 NodeName=lac-087 Arch=x86_64 CoresPerSocket=14 CPUAlloc=0 CPUTot=28 CPULoad=0.42 AvailableFeatures=lac,gbe,intel16,ib,edr16,k80,gpgpu ActiveFeatures=lac,gbe,intel16,ib,edr16,k80,gpgpu Gres=gpu:k80:8(S:0) NodeAddr=lac-087 NodeHostName=lac-087 Version=20.11.8 OS=Linux 3.10.0-1160.36.2.el7.x86_64 #1 SMP Wed Jul 21 11:57:15 UTC 2021 RealMemory=246640 AllocMem=0 FreeMem=232252 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=174080 Weight=2000 Owner=N/A MCS_label=N/A Partitions=iceradmin,scavenger,general-short,general-long-gpu,christlibuyin-gpu,cmich-gpu,cmse-gpu,cvmaccess-gpu,davidroy-gpu,deyoungbuyin-gpu,eisenlohr-gpu,guowei-search-gpu,hmakmm-gpu,merzjrke-gpu,midi_lab-gpu,multiscaleml-gpu,piermaro-gpu,scbbuyin-gpu,planets-gpu,vermaaslab-gpu,cvl-hpcc-gpu BootTime=2022-01-05T10:06:00 SlurmdStartTime=2022-02-16T10:44:00 CfgTRES=cpu=28,mem=246640M,billing=37489,gres/gpu=8 AllocTRES=gres/gpu=8 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Comment=(null) We recently updated our slurmctld to version 21.08.5. Any insight you can provide is much appreciated. Thanks, Steve
Would you attach your slurm.conf, gres.conf, and logs. This would include the slurmctld.log and slurmd.log from one of those nodes. Also, are you able to duplicate this with some consistency?
Created attachment 23851 [details] Configuration files and logs
Jason, This node state clears out when slurmctld is restarted, but it does come back. We have not discerned a specific type of job that leaves nodes in this state. Thanks, Steve
Hi Steve, Can you detail how you update your cluster after you e.g. make a change to the # of GPUs on a node? Thanks, -Michael
Can you attach the slurmd logs of lac-087 as well?
Created attachment 23853 [details] lac-087 log files
Michael, We have not needed to change the number of GPUs on a node, but our process when we add new nodes or remove nodes from the cluster is to shut down all slurmds and the slurmctld, update the configs, start the slurmctld, then start the slurmds. Thanks, Steve
Can you attach the output of the following command? sacct -p -D -j JobId=47895082,47895015,47895061,47531304,47895067,47531368 --format=jobid,jobidraw,jobname,partition,nodelist,nnodes,ntasks,reqcpus,alloccpus,ncpus,reqnodes,allocnodes,reqmem,reqtres,alloctres,state,exitcode,derivedexitcode,reason,elapsed,elapsedraw,submit,eligible,start,end,suspended,submitline,DBIndex,timelimit,TimelimitRaw,Flags,TotalCPU,MinCPU,MinCPUNode,MinCPUTask There are a few types of errors that are concerning. The first type are like the following: [2022-03-12T19:01:44.667] error: gres/gpu: job 47531368 dealloc node lac-087 type k80 gres count underflow (0 1) I'm not 100% sure what is causing these errors, but sometimes they can occur after updating the cluster if GRES on a node is changed or when nodes are added or removed from a configuration. At what times did you update the cluster configuration in the last few days? For what it's worth, we recommend updating the cluster in the following way: * Stop the slurmctld daemon * Update the slurm.conf file on all nodes in the cluster * Restart the slurmd daemons on all nodes * Restart the slurmctld daemon This may help reduce these types of dealloc/underflow errors. See https://slurm.schedmd.com/faq.html#add_nodes. The second type of error that I see is the following: [2022-03-14T05:22:10.745] error: gres/gpu: job 47895082 dealloc of node lac-293 bad node_offset 0 count is 0 In this second type, I think there must be something fishy going on when these jobs get requeued. I'll keep investigating the logs. Thanks! -Michael
(In reply to Michael Hinton from comment #11) > At what times did you update the cluster configuration in the last few days? And for each change in configuration, can you supply a diff? I want to see how the node record list was "shifted" to see if this can help explain why some of these nodes are confused.
Michael, Sacct complained about an invalid option "submitline", here is the output without that parameter: $ sacct -p -D -j 47895082,47895015,47895061,47531304,47895067,47531368 --format=jobid,jobidraw,jobname,partition,nodelist,nnodes,ntasks,reqcpus,alloccpus,ncpus,reqnodes,allocnodes,reqmem,reqtres,alloctres,state,exitcode,derivedexitcode,reason,elapsed,elapsedraw,submit,eligible,start,end,suspended,DBIndex,timelimit,TimelimitRaw,Flags,TotalCPU,MinCPU,MinCPUNode,MinCPUTask JobID|JobIDRaw|JobName|Partition|NodeList|NNodes|NTasks|ReqCPUS|AllocCPUS|NCPUS|ReqNodes|AllocNodes|ReqMem|ReqTRES|AllocTRES|State|ExitCode|DerivedExitCode|Reason|Elapsed|ElapsedRaw|Submit|Eligible|Start|End|Suspended|DBIndex|Timelimit|TimelimitRaw|Flags|TotalCPU|MinCPU|MinCPUNode|MinCPUTask| 47531304|47531304|dyltask|general-long-gpu|lac-087|1||5|5|5|1|1|8Gc|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|TIMEOUT|0:0|0:0|ReqNodeNotAvail|10:00:29|36029|2022-03-11T12:28:01|2022-03-11T12:28:01|2022-03-12T11:28:03|2022-03-12T21:28:32|00:00:00|1149982473|10:00:00|600|SchedBackfill|1-21:13:00|||| 47531304.batch|47531304.batch|batch||lac-087|1|1|5|5|5|1|1|8Gc||cpu=5,gres/gpu=1,mem=40G,node=1|CANCELLED|0:15|||10:00:30|36030|2022-03-12T11:28:03|2022-03-12T11:28:03|2022-03-12T11:28:03|2022-03-12T21:28:33|00:00:00|1149982473||||1-21:13:00|1-21:12:59|lac-087|0| 47531304.extern|47531304.extern|extern||lac-087|1|1|5|5|5|1|1|8Gc||billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|||10:00:29|36029|2022-03-12T11:28:03|2022-03-12T11:28:03|2022-03-12T11:28:03|2022-03-12T21:28:32|00:00:00|1149982473||||00:00.001|00:00:00|lac-087|0| 47531368|47531368|dyltask|general-long-gpu|lac-087|1||5|5|5|1|1|8Gc|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|TIMEOUT|0:0|0:0|ReqNodeNotAvail|10:00:29|36029|2022-03-11T12:28:04|2022-03-11T12:28:04|2022-03-12T13:03:52|2022-03-12T23:04:21|00:00:00|1149982537|10:00:00|600|SchedBackfill|1-18:08:07|||| 47531368.batch|47531368.batch|batch||lac-087|1|1|5|5|5|1|1|8Gc||cpu=5,gres/gpu=1,mem=40G,node=1|CANCELLED|0:15|||10:00:30|36030|2022-03-12T13:03:52|2022-03-12T13:03:52|2022-03-12T13:03:52|2022-03-12T23:04:22|00:00:00|1149982537||||1-18:08:07|1-18:08:08|lac-087|0| 47531368.extern|47531368.extern|extern||lac-087|1|1|5|5|5|1|1|8Gc||billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|||10:00:29|36029|2022-03-12T13:03:52|2022-03-12T13:03:52|2022-03-12T13:03:52|2022-03-12T23:04:21|00:00:00|1149982537||||00:00.001|00:00:00|lac-087|0| 47895015|47895015|dyltask|general-long-gpu|lac-293|1||5|5|5|1|1|8Gc|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|0:0|ReqNodeNotAvail|00:01:22|82|2022-03-13T21:59:20|2022-03-13T21:59:20|2022-03-14T05:20:48|2022-03-14T05:22:10|00:00:00|1150426539|14:00:00|840|SchedBackfill|00:34.583|||| 47895015.batch|47895015.batch|batch||lac-293|1|1|5|5|5|1|1|8Gc||cpu=5,mem=40G,node=1|COMPLETED|0:0|||01:36:36|5796|2022-03-14T03:45:34|2022-03-14T03:45:34|2022-03-14T03:45:34|2022-03-14T05:22:10|00:00:00|1150426539||||00:34.582|00:00:34|lac-293|0| 47895015.extern|47895015.extern|extern||lac-293|1|1|5|5|5|1|1|8Gc||billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|||01:36:36|5796|2022-03-14T03:45:34|2022-03-14T03:45:34|2022-03-14T03:45:34|2022-03-14T05:22:10|00:00:00|1150426539||||00:00:00|00:00:00|lac-293|0| 47895061|47895061|dyltask|general-long-gpu|lac-293|1||5|5|5|1|1|8Gc|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|0:0|ReqNodeNotAvail|00:13:11|791|2022-03-13T21:59:21|2022-03-13T21:59:21|2022-03-14T05:20:48|2022-03-14T05:33:59|00:00:00|1150426604|14:00:00|840|SchedBackfill|21:11.834|||| 47895061.batch|47895061.batch|batch||lac-293|1|1|5|5|5|1|1|8Gc||cpu=5,mem=40G,node=1|COMPLETED|0:0|||00:54:22|3262|2022-03-14T04:39:37|2022-03-14T04:39:37|2022-03-14T04:39:37|2022-03-14T05:33:59|00:00:00|1150426604||||21:11.833|00:21:11|lac-293|0| 47895061.extern|47895061.extern|extern||lac-293|1|1|5|5|5|1|1|8Gc||billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|||00:54:22|3262|2022-03-14T04:39:37|2022-03-14T04:39:37|2022-03-14T04:39:37|2022-03-14T05:33:59|00:00:00|1150426604||||00:00.001|00:00:00|lac-293|0| 47895067|47895067|dyltask|general-long-gpu|lac-293|1||5|5|5|1|1|8Gc|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|0:0|ReqNodeNotAvail|00:15:30|930|2022-03-13T21:59:22|2022-03-13T21:59:22|2022-03-14T05:20:48|2022-03-14T05:36:18|00:00:00|1150426610|14:00:00|840|SchedBackfill|22:58.819|||| 47895067.batch|47895067.batch|batch||lac-293|1|1|5|5|5|1|1|8Gc||cpu=5,mem=40G,node=1|COMPLETED|0:0|||00:54:27|3267|2022-03-14T04:41:51|2022-03-14T04:41:51|2022-03-14T04:41:51|2022-03-14T05:36:18|00:00:00|1150426610||||22:58.818|00:22:58|lac-293|0| 47895067.extern|47895067.extern|extern||lac-293|1|1|5|5|5|1|1|8Gc||billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|||00:54:27|3267|2022-03-14T04:41:51|2022-03-14T04:41:51|2022-03-14T04:41:51|2022-03-14T05:36:18|00:00:00|1150426610||||00:00.001|00:00:00|lac-293|0| 47895082|47895082|dyltask|general-long-gpu|lac-293|1||5|5|5|1|1|8Gc|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|0:0|ReqNodeNotAvail|00:01:22|82|2022-03-13T21:59:22|2022-03-13T21:59:22|2022-03-14T05:20:48|2022-03-14T05:22:10|00:00:00|1150426625|14:00:00|840|SchedBackfill|00:36.564|||| 47895082.batch|47895082.batch|batch||lac-293|1|1|5|5|5|1|1|8Gc||cpu=5,mem=40G,node=1|COMPLETED|0:0|||00:17:15|1035|2022-03-14T05:04:55|2022-03-14T05:04:55|2022-03-14T05:04:55|2022-03-14T05:22:10|00:00:00|1150426625||||00:36.563|00:00:36|lac-293|0| 47895082.extern|47895082.extern|extern||lac-293|1|1|5|5|5|1|1|8Gc||billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|||00:17:15|1035|2022-03-14T05:04:55|2022-03-14T05:04:55|2022-03-14T05:04:55|2022-03-14T05:22:10|00:00:00|1150426625||||00:00.001|00:00:00|lac-293|0| Thanks, Steve
(In reply to Steve Ford from comment #13) > Sacct complained about an invalid option "submitline", here is the output > without that parameter: Are all your Slurm daemons on 21.08.5? What version is slurmdbd running at?
I think that error means that sacct is still at an old version. Can you do this as well?: sacct --version
Michael, You were correct, the SLURM client version I was using was behind (20.11.4). Our slurmd's, slurmctld, and slurmdbd are all on 21.08.5. Here is the output after updating the client: sacct -p -D -j 47895082,47895015,47895061,47531304,47895067,47531368 --format=jobid,jobidraw,jobname,partition,nodelist,nnodes,ntasks,reqcpus,alloccpus,ncpus,reqnodes,allocnodes,reqmem,reqtres,alloctres,state,exitcode,derivedexitcode,reason,elapsed,elapsedraw,submit,eligible,start,end,suspended,submitline,DBIndex,timelimit,TimelimitRaw,Flags,TotalCPU,MinCPU,MinCPUNode,MinCPUTask JobID|JobIDRaw|JobName|Partition|NodeList|NNodes|NTasks|ReqCPUS|AllocCPUS|NCPUS|ReqNodes|AllocNodes|ReqMem|ReqTRES|AllocTRES|State|ExitCode|DerivedExitCode|Reason|Elapsed|ElapsedRaw|Submit|Eligible|Start|End|Suspended|SubmitLine|DBIndex|Timelimit|TimelimitRaw|Flags|TotalCPU|MinCPU|MinCPUNode|MinCPUTask| 47531304|47531304|dyltask|general-long-gpu|lac-087|1||5|5|5|1|1|40G|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|TIMEOUT|0:0|0:0|ReqNodeNotAvail|10:00:29|36029|2022-03-11T12:28:01|2022-03-11T12:28:01|2022-03-12T11:28:03|2022-03-12T21:28:32|00:00:00||1149982473|10:00:00|600|SchedBackfill|1-21:13:00|||| 47531304.batch|47531304.batch|batch||lac-087|1|1|5|5|5|1|1|||cpu=5,gres/gpu=1,mem=40G,node=1|CANCELLED|0:15|||10:00:30|36030|2022-03-12T11:28:03|2022-03-12T11:28:03|2022-03-12T11:28:03|2022-03-12T21:28:33|00:00:00||1149982473||||1-21:13:00|1-21:12:59|lac-087|0| 47531304.extern|47531304.extern|extern||lac-087|1|1|5|5|5|1|1|||billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|||10:00:29|36029|2022-03-12T11:28:03|2022-03-12T11:28:03|2022-03-12T11:28:03|2022-03-12T21:28:32|00:00:00||1149982473||||00:00.001|00:00:00|lac-087|0| 47531368|47531368|dyltask|general-long-gpu|lac-087|1||5|5|5|1|1|40G|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|TIMEOUT|0:0|0:0|ReqNodeNotAvail|10:00:29|36029|2022-03-11T12:28:04|2022-03-11T12:28:04|2022-03-12T13:03:52|2022-03-12T23:04:21|00:00:00||1149982537|10:00:00|600|SchedBackfill|1-18:08:07|||| 47531368.batch|47531368.batch|batch||lac-087|1|1|5|5|5|1|1|||cpu=5,gres/gpu=1,mem=40G,node=1|CANCELLED|0:15|||10:00:30|36030|2022-03-12T13:03:52|2022-03-12T13:03:52|2022-03-12T13:03:52|2022-03-12T23:04:22|00:00:00||1149982537||||1-18:08:07|1-18:08:08|lac-087|0| 47531368.extern|47531368.extern|extern||lac-087|1|1|5|5|5|1|1|||billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|||10:00:29|36029|2022-03-12T13:03:52|2022-03-12T13:03:52|2022-03-12T13:03:52|2022-03-12T23:04:21|00:00:00||1149982537||||00:00.001|00:00:00|lac-087|0| 47895015|47895015|dyltask|general-long-gpu|lac-293|1||5|5|5|1|1|40G|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|0:0|ReqNodeNotAvail|00:01:22|82|2022-03-13T21:59:20|2022-03-13T21:59:20|2022-03-14T05:20:48|2022-03-14T05:22:10|00:00:00|sbatch submit_full.sb False 85 312 256 True 512 3 mean True|1150426539|14:00:00|840|SchedBackfill|00:34.583|||| 47895015.batch|47895015.batch|batch||lac-293|1|1|5|5|5|1|1|||cpu=5,mem=40G,node=1|COMPLETED|0:0|||01:36:36|5796|2022-03-14T03:45:34|2022-03-14T03:45:34|2022-03-14T03:45:34|2022-03-14T05:22:10|00:00:00||1150426539||||00:34.582|00:00:34|lac-293|0| 47895015.extern|47895015.extern|extern||lac-293|1|1|5|5|5|1|1|||billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|||01:36:36|5796|2022-03-14T03:45:34|2022-03-14T03:45:34|2022-03-14T03:45:34|2022-03-14T05:22:10|00:00:00||1150426539||||00:00:00|00:00:00|lac-293|0| 47895061|47895061|dyltask|general-long-gpu|lac-293|1||5|5|5|1|1|40G|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|0:0|ReqNodeNotAvail|00:13:11|791|2022-03-13T21:59:21|2022-03-13T21:59:21|2022-03-14T05:20:48|2022-03-14T05:33:59|00:00:00|sbatch submit_full.sb False 85 312 400 True 512 3 sum False|1150426604|14:00:00|840|SchedBackfill|21:11.834|||| 47895061.batch|47895061.batch|batch||lac-293|1|1|5|5|5|1|1|||cpu=5,mem=40G,node=1|COMPLETED|0:0|||00:54:22|3262|2022-03-14T04:39:37|2022-03-14T04:39:37|2022-03-14T04:39:37|2022-03-14T05:33:59|00:00:00||1150426604||||21:11.833|00:21:11|lac-293|0| 47895061.extern|47895061.extern|extern||lac-293|1|1|5|5|5|1|1|||billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|||00:54:22|3262|2022-03-14T04:39:37|2022-03-14T04:39:37|2022-03-14T04:39:37|2022-03-14T05:33:59|00:00:00||1150426604||||00:00.001|00:00:00|lac-293|0| 47895067|47895067|dyltask|general-long-gpu|lac-293|1||5|5|5|1|1|40G|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|0:0|ReqNodeNotAvail|00:15:30|930|2022-03-13T21:59:22|2022-03-13T21:59:22|2022-03-14T05:20:48|2022-03-14T05:36:18|00:00:00|sbatch submit_full.sb False 85 312 400 True 512 3 mean False|1150426610|14:00:00|840|SchedBackfill|22:58.819|||| 47895067.batch|47895067.batch|batch||lac-293|1|1|5|5|5|1|1|||cpu=5,mem=40G,node=1|COMPLETED|0:0|||00:54:27|3267|2022-03-14T04:41:51|2022-03-14T04:41:51|2022-03-14T04:41:51|2022-03-14T05:36:18|00:00:00||1150426610||||22:58.818|00:22:58|lac-293|0| 47895067.extern|47895067.extern|extern||lac-293|1|1|5|5|5|1|1|||billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|||00:54:27|3267|2022-03-14T04:41:51|2022-03-14T04:41:51|2022-03-14T04:41:51|2022-03-14T05:36:18|00:00:00||1150426610||||00:00.001|00:00:00|lac-293|0| 47895082|47895082|dyltask|general-long-gpu|lac-293|1||5|5|5|1|1|40G|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|0:0|ReqNodeNotAvail|00:01:22|82|2022-03-13T21:59:22|2022-03-13T21:59:22|2022-03-14T05:20:48|2022-03-14T05:22:10|00:00:00|sbatch submit_full.sb False 85 312 400 False 512 2 mean True|1150426625|14:00:00|840|SchedBackfill|00:36.564|||| 47895082.batch|47895082.batch|batch||lac-293|1|1|5|5|5|1|1|||cpu=5,mem=40G,node=1|COMPLETED|0:0|||00:17:15|1035|2022-03-14T05:04:55|2022-03-14T05:04:55|2022-03-14T05:04:55|2022-03-14T05:22:10|00:00:00||1150426625||||00:36.563|00:00:36|lac-293|0| 47895082.extern|47895082.extern|extern||lac-293|1|1|5|5|5|1|1|||billing=6225,cpu=5,gres/gpu=1,mem=40G,node=1|COMPLETED|0:0|||00:17:15|1035|2022-03-14T05:04:55|2022-03-14T05:04:55|2022-03-14T05:04:55|2022-03-14T05:22:10|00:00:00||1150426625||||00:00.001|00:00:00|lac-293|0| Thanks, Steve
(In reply to Michael Hinton from comment #11) > At what times did you update the cluster configuration in the last few days? Steve, can you give a brief summary on when you last upgraded Slurm, and also when you last changed the config in any way? Did these issues appear out of the blue on a stable system, or do you think they could possibly be related to recent config changes?
As it is now, it is very tricky to understand how you system is getting into this state just based off the logs provided. If you are able to reproduce this on a test cluster, though, that would help us a lot. How often are you hitting these issues? Are the issues going away after your upgrade to 21.08, or are they persisting?
Reducing to severity 3. Feel free to respond to my last comments and bump this back to a sev 2 as needed.
Hello Michael, Sorry for the delay. This node state has tapered of completely over the last week and I have not seen a node in this state since last Wednesday. I think this may have been more specifically related to the upgrade, perhaps to jobs that started prior to the nodes be updated and completed after they updated. This issue is no longer a priority for us. I will let you know if this issue resurfaces. Thanks, Steve
(In reply to Steve Ford from comment #20) > I think > this may have been more specifically related to the upgrade, perhaps to jobs > that started prior to the nodes be updated and completed after they updated. Ok, good to know. After looking back at comment 0, I noticed this: (In reply to Steve Ford from comment #0) > sinfo show node lac-087 > NodeName=lac-087 Arch=x86_64 CoresPerSocket=14 > ... > Gres=gpu:k80:8(S:0) > NodeAddr=lac-087 NodeHostName=lac-087 Version=20.11.8 > ... If `sinfo show node` is to be believed, then this slurmd was still on 20.11.8. But later you said your slurmds were all at 21.08.5: (In reply to Steve Ford from comment #16) > You were correct, the SLURM client version I was using was behind (20.11.4). > Our slurmd's, slurmctld, and slurmdbd are all on 21.08.5. Perhaps your upgrade process did not restart all slurmds properly. I think we can chalk this up to the ctld being on 21.08, the job starting in 20.11, and the slurmd still being on 20.11. However, Slurm should still be able to handle this case. So that will be something we will try to look into. I'll go ahead and reduce the severity accordingly. Thanks! -Michael
Hey Steve, Since the problem has gone away, and since it's not clear how to reproduce this, I'm going to go ahead and mark this as resolved. But feel free to reopen if you can provide a reproducer. Thanks! -Michael