| Summary: | GPU showing allocated even when node is idle | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | William Dizon <wdizon> |
| Component: | GPU | Assignee: | Scott Hilton <scott> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | brian.gilmer, cschwarz, jdamicis |
| Version: | 22.05.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | ASU | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | Rocky Linux | Machine Name: | |
| CLE Version: | Version Fixed: | 23.02.2 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
sinfo output
gres.conf slurm.conf job_example_1 job_example_2 job_example_3 test patch g003 slurmd.log slurmctld.log |
||
|
Description
William Dizon
2023-02-24 14:03:47 MST
Created attachment 29041 [details]
sinfo output
Created attachment 29042 [details]
gres.conf
Created attachment 29043 [details]
slurm.conf
Created attachment 29044 [details]
job_example_1
Created attachment 29045 [details]
job_example_2
Created attachment 29046 [details]
job_example_3
This error message is from a bug that we fixed in 23.02. (Which will be released this month). >slurmctld.log:14474110:[2023-01-05T19:22:08.797] error: gres/gpu: job 1614821 dealloc of node g106 bad node_offset 0 count is 0 I don't know for sure if your issue with "GRES in use" is associated with the bug or not. Though it does look related. Has this issue occurred more than once? Do you have a way of reproducing the issue? If you restart the slurmcltd and slurmd's are you able to use those nodes again? commit 56bbd738f77fe7fbfc856d0247aa1d9429ec2b1c Author: Scott Hilton <scott@schedmd.com> Date: Wed Dec 7 16:32:53 2022 -0700 Remove unused parameted from job_res_rm_job() Bug 15145 commit d3053cf0d66732fb13ad047eb32f983ca6dc2f38 Author: Scott Hilton <scott@schedmd.com> Date: Wed Dec 7 15:44:02 2022 -0700 Use gres_list_alloc for gres_ctld_job_dealloc() When testing for backfill or preempt. This is ok because job_gres_list is not altered in gres_ctld_job_dealloc() unless resize is true (which it isn't in select_g_job_test()). Furthermore, if there is gres to dealloc the gres_list_alloc must already exist. Bug 15145 Yes, all the nodes in the SINFO output with IDLE and gpu:a100:4(IDX:0-3) are all nodes that are impacted by this; that means within the last 24 hours or so, at least six nodes exhibited this behavior. scontrol reconfig fixes the issue for all nodes, as it did the first time (~ a week ago), but the issue reemerged. I have provided the SBATCH scripts for three of the jobs that caused this issue--three jobs which all requested GPUs, failed to some degree, and then caused this issue. Created attachment 29074 [details]
test patch
I have so far been unable to reproduce this issue.
This patch may fix it but I am unsure. This change was added to the 23.02 release and should be stable, but we didn't add it to 22.05 because we didn't see any immediate bugs it fixed.
If you choose to test it, let me know if it seems to stop the issue.
Otherwise, procedures on how to reliably reproduce this behavior would be very useful.
-Scott
William, Did you try testing with the patch? Have you seen the issue since then? Any other updates? -Scott Hi Scott, thanks for the follow up. This issue only seems to occur in the wild in our prod environment, and never in our dev, so I haven't applied the patch as of yet. But with that said, we are performing a scheduled maintenance on the 20th wherein we are upgrading to 23.02, so if the patch functionality exists in there, it will be in effect soon. Can report back with results. William, Ok, let me know if you have any new information on this. How often are you seeing this issue come up in your cluster? Was it just the once, or are you seeing it reoccur often? -Scott William, How did the upgrade go? -Scott A mixed bag, but largely not super positive. It took until yesterday for a repeat: g003 idle gpu:a100:4 gpu:a100:4(IDX:0-3) g003 slurmd.log: [2023-04-04T14:20:45.891] [2281651.batch] error: *** JOB 2281651 ON g003 CANCELLED AT 2023-04-04T14:20:45 *** [2023-04-04T14:20:45.891] [2281651.0] error: *** STEP 2281651.0 ON g003 CANCELLED AT 2023-04-04T14:20:45 *** [2023-04-04T14:20:45.920] [2281651.extern] done with job [2023-04-04T14:20:47.068] [2281651.batch] done with job [2023-04-04T14:21:19.287] [2281651.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused [2023-04-04T14:21:19.293] [2281651.0] done with job slurmctld.log [2023-04-04T14:20:45.870] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=2281651 uid 1311736 This ^^ is all from April 4. Upon further review of the g003 node, I see these, which I didn't expect to see since the upgrade back in March: [root@slurm01 ~]# grep "dealloc" /var/log/slurm/slurmctld.log <snipped old> [2023-03-24T13:26:45.554] error: gres/gpu: job 2129875 dealloc of node g003 bad node_offset 0 count is 0 [2023-03-24T13:26:56.468] error: gres/gpu: job 2129874 dealloc of node g003 bad node_offset 0 count is 0 [2023-04-02T05:37:17.435] error: gres/gpu: job 2281080 dealloc of node g003 bad node_offset 0 count is 0 Here's the timeframe for the upgrade: [root@slurm01 ~]# grep "slurmctld version" /var/log/slurm/slurmctld.log <snipped older> [2023-03-20T15:19:57.674] slurmctld version 22.05.7 started on cluster sol [2023-03-21T17:16:47.222] slurmctld version 23.02.0 started on cluster sol So within 3 days of the upgrade--all nodes in cluster successfully upgraded to 23.02.0--the dealloc issue returned. `scontrol reconfig` successfully brought it back up; no other corrections necessary. William, Thanks for the report. I guess there must be another root cause. I will look into other ways this could be happening. -Scott William,
Could you send me all the logs around this one. Perhaps an hour before and after.
>[2023-04-02T05:37:17.435] error: gres/gpu: job 2281080 dealloc of node g003 bad node_offset 0 count is 0
-Scott
William,
Could I also get your cgroup.conf as well as the full slurmctld log and slurmd log for the failed node on a day with an error like this.
>slurmctld.log:17272435:[2023-02-14T08:44:06.037] error: gres/gpu: job 1742877 dealloc of node g239 bad node_offset 0 count is 0
-Scott
[root@slurm01 slurm]# cat cgroup.conf ### # # Slurm cgroup support configuration file # # See man slurm.conf and man cgroup.conf for further # information on cgroup configuration parameters #-- CgroupAutomount=yes ConstrainCores=yes ConstrainRAMSpace=yes ConstrainDevices=yes Created attachment 30020 [details]
g003 slurmd.log
Created attachment 30021 [details]
slurmctld.log
William, I found a way to reproduce the issue. From that I think I found and fixed the bug. The fix is now on github and should be a part of the 23.02.2 release. See commit 5c3a7f6aaf. -Scott *** Ticket 16084 has been marked as a duplicate of this ticket. *** I am going to close this issue as fixed. If you see this issue again after upgrading to 23.02.2 please reopen this ticket. -Scott *** Ticket 16224 has been marked as a duplicate of this ticket. *** *** Ticket 17917 has been marked as a duplicate of this ticket. *** *** Ticket 17917 has been marked as a duplicate of this ticket. *** |