Our issue appears to be the same as #15891 and #13600, where the GPUs are listed as allocated even though the node itself is idle. Subsequent attempts to use that node only eternally PEND as (RESOURCES). $ sinfo -p general -O nodehost,statelong,gres,gresused <snipped, but provided in full as attachment> g004 idle gpu:a100:4 gpu:a100:4(IDX:0-3) g007 idle gpu:a100:4 gpu:a100:0(IDX:N/A) g011 idle gpu:a100:4 gpu:a100:4(IDX:0-3) g012 idle gpu:a100:4 gpu:a100:0(IDX:N/A) g014 idle gpu:a100:4 gpu:a100:4(IDX:0-3) g015 idle gpu:a100:4 gpu:a100:4(IDX:0-3) g016 idle gpu:a100:4 gpu:a100:4(IDX:0-3) g017 idle gpu:a100:4 gpu:a100:0(IDX:N/A) g025 idle gpu:a100:4 gpu:a100:0(IDX:N/A) g026 idle gpu:a100:4 gpu:a100:0(IDX:N/A) g035 idle gpu:a100:4 gpu:a100:0(IDX:N/A) g036 idle gpu:a100:4 gpu:a100:0(IDX:N/A) g037 idle gpu:a100:4 gpu:a100:0(IDX:N/A) g038 idle gpu:a100:4 gpu:a100:0(IDX:N/A) g039 idle gpu:a100:4 gpu:a100:0(IDX:N/A) g040 idle gpu:a100:4 gpu:a100:0(IDX:N/A) g041 idle gpu:a100:4 gpu:a100:0(IDX:N/A) g042 idle gpu:a100:4 gpu:a100:0(IDX:N/A) g043 idle gpu:a100:4 gpu:a100:0(IDX:N/A) g044 idle gpu:a100:4 gpu:a100:0(IDX:N/A) g045 idle gpu:a100:4 gpu:a100:0(IDX:N/A) g046 idle gpu:a100:4 gpu:a100:0(IDX:N/A) g047 idle gpu:a100:4 gpu:a100:0(IDX:N/A) g048 idle gpu:a100:4 gpu:a100:0(IDX:N/A) g049 idle gpu:a100:4 gpu:a100:0(IDX:N/A) g050 idle gpu:a100:4 gpu:a100:0(IDX:N/A) g051 idle gpu:a100:4 gpu:a100:0(IDX:N/A) g230 idle gpu:a30:3 gpu:a30:0(IDX:N/A) g231 idle gpu:a30:3 gpu:a30:0(IDX:N/A) g232 idle gpu:a30:3 gpu:a30:0(IDX:N/A) g233 idle gpu:a30:3 gpu:a30:0(IDX:N/A) g234 idle gpu:a30:3 gpu:a30:0(IDX:N/A) g239 idle gpu:a100:4 gpu:a100:4(IDX:0-3) Scontrol output for this node is alike the other nodes experiencing this issue: $ scontrol show node g014 NodeName=g014 Arch=x86_64 CoresPerSocket=24 CPUAlloc=0 CPUEfctv=48 CPUTot=48 CPULoad=0.00 AvailableFeatures=public,debug,long,epyc ActiveFeatures=public,debug,long,epyc Gres=gpu:a100:4 NodeAddr=g014 NodeHostName=g014 Version=22.05.7 OS=Linux 4.18.0-348.el8.0.2.x86_64 #1 SMP Sun Nov 14 00:51:12 UTC 2021 RealMemory=515300 AllocMem=0 FreeMem=506165 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=4096 Weight=2500 Owner=N/A MCS_label=N/A Partitions=general,htc BootTime=2023-02-13T10:57:24 SlurmdStartTime=2023-02-21T10:32:26 LastBusyTime=2023-02-23T15:52:47 CfgTRES=cpu=48,mem=515300M,billing=353,gres/gpu=4,gres/gpu:a100=4 AllocTRES=gres/gpu=4,gres/gpu:a100=4 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s We have been using 22.05.7 since it's release, but we have added 51 GPU nodes (g0[01-51]) since then. All slurmd versions are the same across the cluster. Our slurmctld.log is enormous, so here are some snippets: [root@slurm01 slurm]# grep -rn dealloc slurmctld.log:28348:[2022-05-06T11:06:09.909] error: deallocate_nodes: JobId=661 allocated no nodes to be killed on slurmctld.log:14474110:[2023-01-05T19:22:08.797] error: gres/gpu: job 1614821 dealloc of node g106 bad node_offset 0 count is 0 slurmctld.log:14523432:[2023-01-06T22:29:25.758] error: gres/gpu: job 1616654 dealloc of node g106 bad node_offset 0 count is 0 slurmctld.log:14566623:[2023-01-07T21:07:31.705] error: gres/gpu: job 1617927 dealloc of node g239 bad node_offset 0 count is 0 slurmctld.log:16393192:[2023-02-02T11:14:04.003] error: gres/gpu: job 1696811 dealloc of node cg002 bad node_offset 0 count is 0 slurmctld.log:16413210:[2023-02-02T16:37:29.709] error: gres/gpu: job 1699088 dealloc of node cg004 bad node_offset 0 count is 0 slurmctld.log:16449401:[2023-02-03T02:31:29.049] error: gres/gpu: job 1702489 dealloc of node g239 bad node_offset 0 count is 0 slurmctld.log:16469242:[2023-02-03T08:36:01.366] error: gres/gpu: job 1699011 dealloc of node cg002 bad node_offset 0 count is 0 slurmctld.log:16679082:[2023-02-06T12:39:41.415] error: gres/gpu: job 1721794 dealloc of node cg001 bad node_offset 0 count is 0 slurmctld.log:16679118:[2023-02-06T12:40:33.241] error: gres/gpu: job 1721595 dealloc of node cg001 bad node_offset 0 count is 0 slurmctld.log:16679121:[2023-02-06T12:40:33.242] error: gres/gpu: job 1721802 dealloc of node cg001 bad node_offset 0 count is 0 slurmctld.log:16840647:[2023-02-08T10:43:08.467] error: gres/gpu: job 1719682 dealloc of node g237 bad node_offset 0 count is 0 slurmctld.log:16994250:[2023-02-11T01:53:18.555] error: gres/gpu: job 1703873 dealloc of node cg004 bad node_offset 0 count is 0 slurmctld.log:17272435:[2023-02-14T08:44:06.037] error: gres/gpu: job 1742877 dealloc of node g239 bad node_offset 0 count is 0 slurmctld.log:17638218:[2023-02-19T01:22:05.837] error: gres/gpu: job 1774139 dealloc of node g009 bad node_offset 0 count is 0 slurmctld.log:17640942:[2023-02-19T03:11:40.629] error: gres/gpu: job 1807480 dealloc of node g005 bad node_offset 0 count is 0 slurmctld.log:17721724:[2023-02-21T10:50:25.242] error: gres/gpu: job 1779081 dealloc of node g235 bad node_offset 0 count is 0 slurmctld.log:17755597:[2023-02-22T05:06:58.882] error: gres/gpu: job 1826767 dealloc of node g239 bad node_offset 0 count is 0 slurmctld.log:17763523:[2023-02-22T12:02:42.020] error: gres/gpu: job 1829514 dealloc of node g016 bad node_offset 0 count is 0 slurmctld.log:17786124:[2023-02-22T21:44:10.586] error: gres/gpu: job 1831266 dealloc of node g011 bad node_offset 0 count is 0 slurmctld.log:17791561:[2023-02-23T01:04:05.048] error: gres/gpu: job 1831638 dealloc of node g004 bad node_offset 0 count is 0 slurmctld.log:17793734:[2023-02-23T01:56:25.739] error: gres/gpu: job 1832132 dealloc of node g015 bad node_offset 0 count is 0 slurmctld.log:17816174:[2023-02-23T13:00:03.628] error: gres/gpu: job 1831900 dealloc of node g033 bad node_offset 0 count is 0 slurmctld.log:17832248:[2023-02-23T15:52:47.141] error: gres/gpu: job 1833842 dealloc of node g014 bad node_offset 0 count is 0 These GPU hosts were added over a week ago (~14th) and the issue came up on a few nodes. scontrol reconfigure brought them back online, but the issue returned again. There have been no other nodes added since then, and there have been no gres.conf changes, slurm.conf changes, or hardware changes since that initial issue cropped up. Also attached is three files corresponding to three jobs that were using the GPUs at the time of the failure for various reasons. The jobs ended and the accounting seems finalized, but again just the GPUs never returned to usable. These files contain the relevant slurmctld logs, slurmd logs, and jobscript specifics.
Created attachment 29041 [details] sinfo output
Created attachment 29042 [details] gres.conf
Created attachment 29043 [details] slurm.conf
Created attachment 29044 [details] job_example_1
Created attachment 29045 [details] job_example_2
Created attachment 29046 [details] job_example_3
This error message is from a bug that we fixed in 23.02. (Which will be released this month). >slurmctld.log:14474110:[2023-01-05T19:22:08.797] error: gres/gpu: job 1614821 dealloc of node g106 bad node_offset 0 count is 0 I don't know for sure if your issue with "GRES in use" is associated with the bug or not. Though it does look related. Has this issue occurred more than once? Do you have a way of reproducing the issue? If you restart the slurmcltd and slurmd's are you able to use those nodes again? commit 56bbd738f77fe7fbfc856d0247aa1d9429ec2b1c Author: Scott Hilton <scott@schedmd.com> Date: Wed Dec 7 16:32:53 2022 -0700 Remove unused parameted from job_res_rm_job() Bug 15145 commit d3053cf0d66732fb13ad047eb32f983ca6dc2f38 Author: Scott Hilton <scott@schedmd.com> Date: Wed Dec 7 15:44:02 2022 -0700 Use gres_list_alloc for gres_ctld_job_dealloc() When testing for backfill or preempt. This is ok because job_gres_list is not altered in gres_ctld_job_dealloc() unless resize is true (which it isn't in select_g_job_test()). Furthermore, if there is gres to dealloc the gres_list_alloc must already exist. Bug 15145
Yes, all the nodes in the SINFO output with IDLE and gpu:a100:4(IDX:0-3) are all nodes that are impacted by this; that means within the last 24 hours or so, at least six nodes exhibited this behavior. scontrol reconfig fixes the issue for all nodes, as it did the first time (~ a week ago), but the issue reemerged. I have provided the SBATCH scripts for three of the jobs that caused this issue--three jobs which all requested GPUs, failed to some degree, and then caused this issue.
Created attachment 29074 [details] test patch I have so far been unable to reproduce this issue. This patch may fix it but I am unsure. This change was added to the 23.02 release and should be stable, but we didn't add it to 22.05 because we didn't see any immediate bugs it fixed. If you choose to test it, let me know if it seems to stop the issue. Otherwise, procedures on how to reliably reproduce this behavior would be very useful. -Scott
William, Did you try testing with the patch? Have you seen the issue since then? Any other updates? -Scott
Hi Scott, thanks for the follow up. This issue only seems to occur in the wild in our prod environment, and never in our dev, so I haven't applied the patch as of yet. But with that said, we are performing a scheduled maintenance on the 20th wherein we are upgrading to 23.02, so if the patch functionality exists in there, it will be in effect soon. Can report back with results.
William, Ok, let me know if you have any new information on this. How often are you seeing this issue come up in your cluster? Was it just the once, or are you seeing it reoccur often? -Scott
William, How did the upgrade go? -Scott
A mixed bag, but largely not super positive. It took until yesterday for a repeat: g003 idle gpu:a100:4 gpu:a100:4(IDX:0-3) g003 slurmd.log: [2023-04-04T14:20:45.891] [2281651.batch] error: *** JOB 2281651 ON g003 CANCELLED AT 2023-04-04T14:20:45 *** [2023-04-04T14:20:45.891] [2281651.0] error: *** STEP 2281651.0 ON g003 CANCELLED AT 2023-04-04T14:20:45 *** [2023-04-04T14:20:45.920] [2281651.extern] done with job [2023-04-04T14:20:47.068] [2281651.batch] done with job [2023-04-04T14:21:19.287] [2281651.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused [2023-04-04T14:21:19.293] [2281651.0] done with job slurmctld.log [2023-04-04T14:20:45.870] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=2281651 uid 1311736 This ^^ is all from April 4. Upon further review of the g003 node, I see these, which I didn't expect to see since the upgrade back in March: [root@slurm01 ~]# grep "dealloc" /var/log/slurm/slurmctld.log <snipped old> [2023-03-24T13:26:45.554] error: gres/gpu: job 2129875 dealloc of node g003 bad node_offset 0 count is 0 [2023-03-24T13:26:56.468] error: gres/gpu: job 2129874 dealloc of node g003 bad node_offset 0 count is 0 [2023-04-02T05:37:17.435] error: gres/gpu: job 2281080 dealloc of node g003 bad node_offset 0 count is 0 Here's the timeframe for the upgrade: [root@slurm01 ~]# grep "slurmctld version" /var/log/slurm/slurmctld.log <snipped older> [2023-03-20T15:19:57.674] slurmctld version 22.05.7 started on cluster sol [2023-03-21T17:16:47.222] slurmctld version 23.02.0 started on cluster sol So within 3 days of the upgrade--all nodes in cluster successfully upgraded to 23.02.0--the dealloc issue returned. `scontrol reconfig` successfully brought it back up; no other corrections necessary.
William, Thanks for the report. I guess there must be another root cause. I will look into other ways this could be happening. -Scott
William, Could you send me all the logs around this one. Perhaps an hour before and after. >[2023-04-02T05:37:17.435] error: gres/gpu: job 2281080 dealloc of node g003 bad node_offset 0 count is 0 -Scott
William, Could I also get your cgroup.conf as well as the full slurmctld log and slurmd log for the failed node on a day with an error like this. >slurmctld.log:17272435:[2023-02-14T08:44:06.037] error: gres/gpu: job 1742877 dealloc of node g239 bad node_offset 0 count is 0 -Scott
[root@slurm01 slurm]# cat cgroup.conf ### # # Slurm cgroup support configuration file # # See man slurm.conf and man cgroup.conf for further # information on cgroup configuration parameters #-- CgroupAutomount=yes ConstrainCores=yes ConstrainRAMSpace=yes ConstrainDevices=yes
Created attachment 30020 [details] g003 slurmd.log
Created attachment 30021 [details] slurmctld.log
William, I found a way to reproduce the issue. From that I think I found and fixed the bug. The fix is now on github and should be a part of the 23.02.2 release. See commit 5c3a7f6aaf. -Scott
*** Ticket 16084 has been marked as a duplicate of this ticket. ***
I am going to close this issue as fixed. If you see this issue again after upgrading to 23.02.2 please reopen this ticket. -Scott
*** Ticket 16224 has been marked as a duplicate of this ticket. ***
*** Ticket 17917 has been marked as a duplicate of this ticket. ***