Ticket 16121 - GPU showing allocated even when node is idle
Summary: GPU showing allocated even when node is idle
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: GPU (show other tickets)
Version: 22.05.7
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Scott Hilton
QA Contact:
URL:
: 16084 16224 17917 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2023-02-24 14:03 MST by William Dizon
Modified: 2023-10-24 09:16 MDT (History)
3 users (show)

See Also:
Site: ASU
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: Rocky Linux
Machine Name:
CLE Version:
Version Fixed: 23.02.2
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
sinfo output (13.97 KB, text/plain)
2023-02-24 14:04 MST, William Dizon
Details
gres.conf (3.59 KB, text/plain)
2023-02-24 14:04 MST, William Dizon
Details
slurm.conf (4.98 KB, text/plain)
2023-02-24 14:04 MST, William Dizon
Details
job_example_1 (8.78 KB, text/plain)
2023-02-24 14:04 MST, William Dizon
Details
job_example_2 (10.87 KB, text/plain)
2023-02-24 14:05 MST, William Dizon
Details
job_example_3 (7.20 KB, text/plain)
2023-02-24 14:05 MST, William Dizon
Details
test patch (2.73 KB, patch)
2023-02-27 16:27 MST, Scott Hilton
Details | Diff
g003 slurmd.log (5.92 KB, text/plain)
2023-04-26 10:36 MDT, William Dizon
Details
slurmctld.log (1.27 MB, text/plain)
2023-04-26 10:36 MDT, William Dizon
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description William Dizon 2023-02-24 14:03:47 MST
Our issue appears to be the same as #15891 and #13600, where the GPUs are listed as allocated even though the node itself is idle. Subsequent attempts to use that node only eternally PEND as (RESOURCES).

$ sinfo -p general -O nodehost,statelong,gres,gresused
<snipped, but provided in full as attachment>
g004                idle                gpu:a100:4          gpu:a100:4(IDX:0-3) 
g007                idle                gpu:a100:4          gpu:a100:0(IDX:N/A) 
g011                idle                gpu:a100:4          gpu:a100:4(IDX:0-3) 
g012                idle                gpu:a100:4          gpu:a100:0(IDX:N/A) 
g014                idle                gpu:a100:4          gpu:a100:4(IDX:0-3) 
g015                idle                gpu:a100:4          gpu:a100:4(IDX:0-3) 
g016                idle                gpu:a100:4          gpu:a100:4(IDX:0-3) 
g017                idle                gpu:a100:4          gpu:a100:0(IDX:N/A) 
g025                idle                gpu:a100:4          gpu:a100:0(IDX:N/A) 
g026                idle                gpu:a100:4          gpu:a100:0(IDX:N/A) 
g035                idle                gpu:a100:4          gpu:a100:0(IDX:N/A) 
g036                idle                gpu:a100:4          gpu:a100:0(IDX:N/A) 
g037                idle                gpu:a100:4          gpu:a100:0(IDX:N/A) 
g038                idle                gpu:a100:4          gpu:a100:0(IDX:N/A) 
g039                idle                gpu:a100:4          gpu:a100:0(IDX:N/A) 
g040                idle                gpu:a100:4          gpu:a100:0(IDX:N/A) 
g041                idle                gpu:a100:4          gpu:a100:0(IDX:N/A) 
g042                idle                gpu:a100:4          gpu:a100:0(IDX:N/A) 
g043                idle                gpu:a100:4          gpu:a100:0(IDX:N/A) 
g044                idle                gpu:a100:4          gpu:a100:0(IDX:N/A) 
g045                idle                gpu:a100:4          gpu:a100:0(IDX:N/A) 
g046                idle                gpu:a100:4          gpu:a100:0(IDX:N/A) 
g047                idle                gpu:a100:4          gpu:a100:0(IDX:N/A) 
g048                idle                gpu:a100:4          gpu:a100:0(IDX:N/A) 
g049                idle                gpu:a100:4          gpu:a100:0(IDX:N/A) 
g050                idle                gpu:a100:4          gpu:a100:0(IDX:N/A) 
g051                idle                gpu:a100:4          gpu:a100:0(IDX:N/A) 
g230                idle                gpu:a30:3           gpu:a30:0(IDX:N/A)  
g231                idle                gpu:a30:3           gpu:a30:0(IDX:N/A)  
g232                idle                gpu:a30:3           gpu:a30:0(IDX:N/A)  
g233                idle                gpu:a30:3           gpu:a30:0(IDX:N/A)  
g234                idle                gpu:a30:3           gpu:a30:0(IDX:N/A)  
g239                idle                gpu:a100:4          gpu:a100:4(IDX:0-3)



Scontrol output for this node is alike the other nodes experiencing this issue:

$ scontrol show node g014
NodeName=g014 Arch=x86_64 CoresPerSocket=24 
   CPUAlloc=0 CPUEfctv=48 CPUTot=48 CPULoad=0.00
   AvailableFeatures=public,debug,long,epyc
   ActiveFeatures=public,debug,long,epyc
   Gres=gpu:a100:4
   NodeAddr=g014 NodeHostName=g014 Version=22.05.7
   OS=Linux 4.18.0-348.el8.0.2.x86_64 #1 SMP Sun Nov 14 00:51:12 UTC 2021 
   RealMemory=515300 AllocMem=0 FreeMem=506165 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=4096 Weight=2500 Owner=N/A MCS_label=N/A
   Partitions=general,htc 
   BootTime=2023-02-13T10:57:24 SlurmdStartTime=2023-02-21T10:32:26
   LastBusyTime=2023-02-23T15:52:47
   CfgTRES=cpu=48,mem=515300M,billing=353,gres/gpu=4,gres/gpu:a100=4
   AllocTRES=gres/gpu=4,gres/gpu:a100=4
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


We have been using 22.05.7 since it's release, but we have added 51 GPU nodes (g0[01-51]) since then.  All slurmd versions are the same across the cluster.


Our slurmctld.log is enormous, so here are some snippets:

[root@slurm01 slurm]# grep -rn dealloc
slurmctld.log:28348:[2022-05-06T11:06:09.909] error: deallocate_nodes: JobId=661 allocated no nodes to be killed on
slurmctld.log:14474110:[2023-01-05T19:22:08.797] error: gres/gpu: job 1614821 dealloc of node g106 bad node_offset 0 count is 0
slurmctld.log:14523432:[2023-01-06T22:29:25.758] error: gres/gpu: job 1616654 dealloc of node g106 bad node_offset 0 count is 0
slurmctld.log:14566623:[2023-01-07T21:07:31.705] error: gres/gpu: job 1617927 dealloc of node g239 bad node_offset 0 count is 0
slurmctld.log:16393192:[2023-02-02T11:14:04.003] error: gres/gpu: job 1696811 dealloc of node cg002 bad node_offset 0 count is 0
slurmctld.log:16413210:[2023-02-02T16:37:29.709] error: gres/gpu: job 1699088 dealloc of node cg004 bad node_offset 0 count is 0
slurmctld.log:16449401:[2023-02-03T02:31:29.049] error: gres/gpu: job 1702489 dealloc of node g239 bad node_offset 0 count is 0
slurmctld.log:16469242:[2023-02-03T08:36:01.366] error: gres/gpu: job 1699011 dealloc of node cg002 bad node_offset 0 count is 0
slurmctld.log:16679082:[2023-02-06T12:39:41.415] error: gres/gpu: job 1721794 dealloc of node cg001 bad node_offset 0 count is 0
slurmctld.log:16679118:[2023-02-06T12:40:33.241] error: gres/gpu: job 1721595 dealloc of node cg001 bad node_offset 0 count is 0
slurmctld.log:16679121:[2023-02-06T12:40:33.242] error: gres/gpu: job 1721802 dealloc of node cg001 bad node_offset 0 count is 0
slurmctld.log:16840647:[2023-02-08T10:43:08.467] error: gres/gpu: job 1719682 dealloc of node g237 bad node_offset 0 count is 0
slurmctld.log:16994250:[2023-02-11T01:53:18.555] error: gres/gpu: job 1703873 dealloc of node cg004 bad node_offset 0 count is 0
slurmctld.log:17272435:[2023-02-14T08:44:06.037] error: gres/gpu: job 1742877 dealloc of node g239 bad node_offset 0 count is 0
slurmctld.log:17638218:[2023-02-19T01:22:05.837] error: gres/gpu: job 1774139 dealloc of node g009 bad node_offset 0 count is 0
slurmctld.log:17640942:[2023-02-19T03:11:40.629] error: gres/gpu: job 1807480 dealloc of node g005 bad node_offset 0 count is 0
slurmctld.log:17721724:[2023-02-21T10:50:25.242] error: gres/gpu: job 1779081 dealloc of node g235 bad node_offset 0 count is 0
slurmctld.log:17755597:[2023-02-22T05:06:58.882] error: gres/gpu: job 1826767 dealloc of node g239 bad node_offset 0 count is 0
slurmctld.log:17763523:[2023-02-22T12:02:42.020] error: gres/gpu: job 1829514 dealloc of node g016 bad node_offset 0 count is 0
slurmctld.log:17786124:[2023-02-22T21:44:10.586] error: gres/gpu: job 1831266 dealloc of node g011 bad node_offset 0 count is 0
slurmctld.log:17791561:[2023-02-23T01:04:05.048] error: gres/gpu: job 1831638 dealloc of node g004 bad node_offset 0 count is 0
slurmctld.log:17793734:[2023-02-23T01:56:25.739] error: gres/gpu: job 1832132 dealloc of node g015 bad node_offset 0 count is 0
slurmctld.log:17816174:[2023-02-23T13:00:03.628] error: gres/gpu: job 1831900 dealloc of node g033 bad node_offset 0 count is 0
slurmctld.log:17832248:[2023-02-23T15:52:47.141] error: gres/gpu: job 1833842 dealloc of node g014 bad node_offset 0 count is 0


These GPU hosts were added over a week ago (~14th) and the issue came up on a few nodes.   scontrol reconfigure brought them back online, but the issue returned again.  There have been no other nodes added since then, and there have been no gres.conf changes, slurm.conf changes, or hardware changes since that initial issue cropped up.

Also attached is three files corresponding to three jobs that were using the GPUs at the time of the failure for various reasons.  The jobs ended and the accounting seems finalized, but again just the GPUs never returned to usable.

These files contain the relevant slurmctld logs, slurmd logs, and jobscript specifics.
Comment 1 William Dizon 2023-02-24 14:04:09 MST
Created attachment 29041 [details]
sinfo output
Comment 2 William Dizon 2023-02-24 14:04:20 MST
Created attachment 29042 [details]
gres.conf
Comment 3 William Dizon 2023-02-24 14:04:35 MST
Created attachment 29043 [details]
slurm.conf
Comment 4 William Dizon 2023-02-24 14:04:51 MST
Created attachment 29044 [details]
job_example_1
Comment 5 William Dizon 2023-02-24 14:05:03 MST
Created attachment 29045 [details]
job_example_2
Comment 6 William Dizon 2023-02-24 14:05:16 MST
Created attachment 29046 [details]
job_example_3
Comment 7 Jason Booth 2023-02-24 15:33:41 MST
This error message is from a bug that we fixed in 23.02. (Which will be released this month).
>slurmctld.log:14474110:[2023-01-05T19:22:08.797] error: gres/gpu: job 1614821 dealloc of node g106 bad node_offset 0 count is 0

I don't know for sure if your issue with "GRES in use" is associated with the bug or not. Though it does look related. 

Has this issue occurred more than once? Do you have a way of reproducing the issue?

If you restart the slurmcltd and slurmd's are you able to use those nodes again?



commit 56bbd738f77fe7fbfc856d0247aa1d9429ec2b1c
Author: Scott Hilton <scott@schedmd.com>
Date:   Wed Dec 7 16:32:53 2022 -0700

    Remove unused parameted from job_res_rm_job()

    Bug 15145

commit d3053cf0d66732fb13ad047eb32f983ca6dc2f38
Author: Scott Hilton <scott@schedmd.com>
Date:   Wed Dec 7 15:44:02 2022 -0700

    Use gres_list_alloc for gres_ctld_job_dealloc()

    When testing for backfill or preempt.
    This is ok because job_gres_list is not altered in
    gres_ctld_job_dealloc() unless resize is true (which it isn't in
    select_g_job_test()). Furthermore, if there is gres to dealloc the
    gres_list_alloc must already exist.

    Bug 15145
Comment 8 William Dizon 2023-02-24 15:40:17 MST
Yes, all the nodes in the SINFO output with IDLE and gpu:a100:4(IDX:0-3) are all nodes that are impacted by this; that means within the last 24 hours or so, at least six nodes exhibited this behavior.

scontrol reconfig fixes the issue for all nodes, as it did the first time (~ a week ago), but the issue reemerged.  I have provided the SBATCH scripts for three of the jobs that caused this issue--three jobs which all requested GPUs, failed to some degree, and then caused this issue.
Comment 9 Scott Hilton 2023-02-27 16:27:50 MST
Created attachment 29074 [details]
test patch

I have so far been unable to reproduce this issue.

This patch may fix it but I am unsure. This change was added to the 23.02 release and should be stable, but we didn't add it to 22.05 because we didn't see any immediate bugs it fixed. 

If you choose to test it, let me know if it seems to stop the issue.

Otherwise, procedures on how to reliably reproduce this behavior would be very useful.

-Scott
Comment 10 Scott Hilton 2023-03-08 15:42:01 MST
William,

Did you try testing with the patch? Have you seen the issue since then? Any other updates?

-Scott
Comment 11 William Dizon 2023-03-09 08:39:38 MST
Hi Scott, thanks for the follow up.  This issue only seems to occur in the wild in our prod environment, and never in our dev, so I haven't applied the patch as of yet.  But with that said, we are performing a scheduled maintenance on the 20th wherein we are upgrading to 23.02, so if the patch functionality exists in there, it will be in effect soon. Can report back with results.
Comment 12 Scott Hilton 2023-03-13 12:57:39 MDT
William,

Ok, let me know if you have any new information on this.

How often are you seeing this issue come up in your cluster? Was it just the once, or are you seeing it reoccur often?

-Scott
Comment 13 Scott Hilton 2023-03-24 13:20:03 MDT
William, 

How did the upgrade go?

-Scott
Comment 14 William Dizon 2023-04-05 10:55:22 MDT
A mixed bag, but largely not super positive.

It took until yesterday for a repeat:

g003                idle                gpu:a100:4          gpu:a100:4(IDX:0-3)

g003 slurmd.log:

[2023-04-04T14:20:45.891] [2281651.batch] error: *** JOB 2281651 ON g003 CANCELLED AT 2023-04-04T14:20:45 ***
[2023-04-04T14:20:45.891] [2281651.0] error: *** STEP 2281651.0 ON g003 CANCELLED AT 2023-04-04T14:20:45 ***
[2023-04-04T14:20:45.920] [2281651.extern] done with job
[2023-04-04T14:20:47.068] [2281651.batch] done with job
[2023-04-04T14:21:19.287] [2281651.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused
[2023-04-04T14:21:19.293] [2281651.0] done with job


slurmctld.log

[2023-04-04T14:20:45.870] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=2281651 uid 1311736


This ^^ is all from April 4.

Upon further review of the g003 node, I see these, which I didn't expect to see since the upgrade back in March:

[root@slurm01 ~]# grep "dealloc" /var/log/slurm/slurmctld.log
<snipped old>
[2023-03-24T13:26:45.554] error: gres/gpu: job 2129875 dealloc of node g003 bad node_offset 0 count is 0
[2023-03-24T13:26:56.468] error: gres/gpu: job 2129874 dealloc of node g003 bad node_offset 0 count is 0
[2023-04-02T05:37:17.435] error: gres/gpu: job 2281080 dealloc of node g003 bad node_offset 0 count is 0


Here's the timeframe for the upgrade:

[root@slurm01 ~]# grep "slurmctld version" /var/log/slurm/slurmctld.log
<snipped older>
[2023-03-20T15:19:57.674] slurmctld version 22.05.7 started on cluster sol
[2023-03-21T17:16:47.222] slurmctld version 23.02.0 started on cluster sol

So within 3 days of the upgrade--all nodes in cluster successfully upgraded to 23.02.0--the dealloc issue returned.  `scontrol reconfig` successfully brought it back up; no other corrections necessary.
Comment 15 Scott Hilton 2023-04-07 12:33:43 MDT
William,

Thanks for the report. I guess there must be another root cause. I will look into other ways this could be happening.

-Scott
Comment 16 Scott Hilton 2023-04-12 12:52:15 MDT
William, 

Could you send me all the logs around this one. Perhaps an hour before and after.
>[2023-04-02T05:37:17.435] error: gres/gpu: job 2281080 dealloc of node g003 bad node_offset 0 count is 0

-Scott
Comment 17 Scott Hilton 2023-04-25 09:17:13 MDT
William,

Could I also get your cgroup.conf as well as the full slurmctld log and slurmd log for the failed node on a day with an error like this.
>slurmctld.log:17272435:[2023-02-14T08:44:06.037] error: gres/gpu: job 1742877 dealloc of node g239 bad node_offset 0 count is 0 

-Scott
Comment 19 William Dizon 2023-04-26 10:35:54 MDT
[root@slurm01 slurm]# cat cgroup.conf
###
#
# Slurm cgroup support configuration file
#
# See man slurm.conf and man cgroup.conf for further
# information on cgroup configuration parameters
#--
CgroupAutomount=yes

ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes
Comment 20 William Dizon 2023-04-26 10:36:20 MDT
Created attachment 30020 [details]
g003 slurmd.log
Comment 21 William Dizon 2023-04-26 10:36:36 MDT
Created attachment 30021 [details]
slurmctld.log
Comment 28 Scott Hilton 2023-04-28 16:52:31 MDT
William,

I found a way to reproduce the issue. From that I think I found and fixed the bug.

The fix is now on github and should be a part of the 23.02.2 release. See commit 5c3a7f6aaf.

-Scott
Comment 29 Scott Hilton 2023-05-08 10:34:29 MDT
*** Ticket 16084 has been marked as a duplicate of this ticket. ***
Comment 30 Scott Hilton 2023-05-08 10:36:58 MDT
I am going to close this issue as fixed. If you see this issue again after upgrading to 23.02.2 please reopen this ticket.

-Scott
Comment 31 Scott Hilton 2023-05-10 12:58:00 MDT
*** Ticket 16224 has been marked as a duplicate of this ticket. ***
Comment 32 Scott Hilton 2023-10-16 13:46:17 MDT
*** Ticket 17917 has been marked as a duplicate of this ticket. ***
Comment 33 Scott Hilton 2023-10-24 09:16:21 MDT
*** Ticket 17917 has been marked as a duplicate of this ticket. ***