Skyler asked me to open this issue as another bug: It seems like the scheduler isn't tracking allocated resources, for example, this seems to be spamming: Sep 2 08:21:42 scheduler2.grace.hpc.yale.internal slurmctld[203336]: error: gres/gpu: job 1426844 dealloc node c39n02 type v100 gres count underflow (0 1) because these two jobs are running (edited for privacy): [root@mgt2.grace ~]# scontrol show job 1427334 JobId=1427334 JobName=6xez_25 JobState=RUNNING Reason=None Dependency=(null) Partition=gpu AllocNode:Sid=grace1:38298 NodeList=c39n02 BatchHost=c39n02 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:1 TRES=cpu=1,mem=10G,node=1,billing=10,gres/gpu=1,gres/gpu:v100=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=* MinCPUsNode=1 MinMemoryNode=10G MinTmpDiskNode=0 Features=v100|a100 DelayBoot=00:00:00 TresPerJob=gres:gpu:1 [root@mgt2.grace ~]# scontrol show job 1426774 JobId=1426774 JobName=25tre2 JobState=RUNNING Reason=None Dependency=(null) Partition=gpu AllocNode:Sid=grace1:38298 NodeList=c39n02 BatchHost=c39n02 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:1 TRES=cpu=1,mem=10G,node=1,billing=10,gres/gpu=1,gres/gpu:v100=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=* MinCPUsNode=1 MinMemoryNode=10G MinTmpDiskNode=0 Features=v100|a100 DelayBoot=00:00:00 TresPerJob=gres:gpu:1 .. ^ so each of those requested 1 GPU. But the scheduler sees: [root@mgt2.grace ~]# scontrol show node c39n02 NodeName=c39n02 Arch=x86_64 CoresPerSocket=12 CPUAlloc=2 CPUEfctv=24 CPUTot=24 CPULoad=2.04 AvailableFeatures=skylake,avx512,6136,doubleprecision,common,bigtmp,v100 ActiveFeatures=skylake,avx512,6136,doubleprecision,common,bigtmp,v100 Gres=gpu:v100:2(S:0-1) NodeAddr=c39n02 NodeHostName=c39n02 Version=22.05.2 OS=Linux 3.10.0-1160.71.1.el7.x86_64 #1 SMP Wed Jun 15 08:55:08 UTC 2022 RealMemory=92160 AllocMem=20480 FreeMem=14255 Sockets=2 Boards=1 State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=11203 Owner=N/A MCS_label=N/A Partitions=admintest,gpu,scavenge,scavenge_gpu BootTime=2022-08-03T15:22:56 SlurmdStartTime=2022-08-03T15:27:28 LastBusyTime=2022-09-01T07:46:38 CfgTRES=cpu=24,mem=90G,billing=24,gres/gpu=2,gres/gpu:v100=2 AllocTRES=cpu=2,mem=20G,gres/gpu=2,gres/gpu:v100=1 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s ^ only one "gres/gpu:v100" allocated, when it should be 2. This is a bug? Thanks, Adam
Please see: https://bugs.schedmd.com/show_bug.cgi?id=14813 for conf.
Thanks for logging the bug Adam. Can you give me a reproducer? What are you doing to submit the jobs?
Hi Caden, I'm sorry but I cannot give you a reproducer at this time. I've attempted using: " #!/bin/bash #SBATCH --partition=gpu #SBATCH --job-name=ralloc_test #SBATCH --gpus=1 #SBATCH -C "v100|a100" #SBATCH --mem=10G #SBATCH --time=24:00:00 #SBATCH --mail-type=END #SBATCH --nodes=1 module purge module load CUDA echo "hello"; sleep 300; " ..as a reproducer (this is basically a copy of the same sbatch script used by a researcher that resulted in the errors). The job did not result in a scheduler error or apparent resource tracking problem: manifestation may be dependent on the running state of slurmctld, ..which will be far more difficult for us to reproduce. Thanks, Adam
Adam, I am going to close this bug out, since you are dealing with controller issues with Skyler right now and this bug may be an unrelated side effect of that (and since you don't have a reproducer). If that issue gets resolved and you still see this problem, or you get a reproducer, please open this bug back up. Caden
Adam, I found some more info on this type of bug so I am opening it back up. Have you touched the PriorityCalcPeriod parameter at all? Is it less than 60 seconds? What are the durations of the jobs that are triggering the underflow message? Caden
Hi Caden, Sorry for not getting back to you sooner, PriorityCalcPeriod is set to 60 seconds. I'm not sure on the average time, but in the most recent cases the range was from 8.5 to 15.5 hours. Here's a list of them (all the same user: note that not all jobs submitted by this user resulted in the error): JobID Elapsed Submit Start ------------ ---------- ------------------- ------------------- 2211717 15:23:22 2022-09-09T11:05:12 2022-09-11T14:43:10 -- 1 error occurred on Sep 12 06:06:32 2211718 15:20:05 2022-09-09T11:05:12 2022-09-11T14:43:40 -- 36 errors occurred within the same second on Sep 11 14:43:47 2211719 15:21:05 2022-09-09T11:05:12 2022-09-11T14:43:40 -- 36 errors occurred within the same second on Sep 11 14:43:47 2211720 15:21:52 2022-09-09T11:05:12 2022-09-11T14:43:40 -- 1 error occurred on Sep 12 06:05:32 2211727 10:44:24 2022-09-09T11:05:13 2022-09-11T22:49:41 -- 1 error occurred on Sep 12 09:34:05 2211730 09:05:54 2022-09-09T11:05:13 2022-09-12T00:28:16 -- 1 error occurred on Sep 12 09:34:10 2211737 08:38:58 2022-09-09T11:05:13 2022-09-12T00:55:22 -- 1 error occurred on Sep 12 09:34:20 2211738 08:32:49 2022-09-09T11:05:13 2022-09-12T01:01:13 -- 1 error occurred on Sep 12 09:34:02 2211740 07:49:32 2022-09-09T11:05:13 2022-09-12T01:44:44 -- 1 error occurred on Sep 12 09:34:16 2211744 03:31:50 2022-09-09T11:05:13 2022-09-12T06:02:49 -- 1 error occurred on Sep 12 09:34:39 2211745 03:30:33 2022-09-09T11:05:13 2022-09-12T06:03:47 -- 1 error occurred on Sep 12 09:34:20 Every single error in the above cases was of the format: <DATE> scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job <JOBID> dealloc node <NODE> type p100 gres count underflow (0 1) Where <DATE> and <JOBID> are as defined above, and <NODE> ranged from gpu[03,05-07,09-10]: [ACTIVE@scheduler2:/var/log] cat /opt/slurm/current/etc/gres.conf |egrep "gpu\[0" NodeName=gpu[02-10] Name=gpu Type=p100 File=/dev/nvidia[0-1] CPUs=[0-11] NodeName=gpu[02-10] Name=gpu Type=p100 File=/dev/nvidia[2-3] CPUs=[12-23] [ACTIVE@scheduler2:/var/log] cat /opt/slurm/current/etc/slurm.conf.d/nodes |egrep "gpu\[0" NodeName=gpu[02-10] Weight=10403 Feature=skylake,avx512,6136,doubleprecision,pi,p100 RealMemory=185344 Gres=gpu:p100:4 As you might have guessed based on the job IDs, these were not the only p100 jobs that user submitted at around that time, for example we saw no errors from: [ACTIVE@scheduler2:/var/log] sacct -ojobid,Elapsed,Submit,Start,NodeList -j 2211716,2211721 JobID Elapsed Submit Start NodeList ------------ ---------- ------------------- ------------------- --------------- 2211716 15:21:33 2022-09-09T11:05:12 2022-09-11T14:43:10 gpu09 2211716.bat+ 15:21:33 2022-09-11T14:43:10 2022-09-11T14:43:10 gpu09 2211716.ext+ 15:21:33 2022-09-11T14:43:10 2022-09-11T14:43:10 gpu09 2211721 15:20:49 2022-09-09T11:05:12 2022-09-11T14:43:40 gpu10 2211721.bat+ 15:20:49 2022-09-11T14:43:40 2022-09-11T14:43:40 gpu10 2211721.ext+ 15:20:49 2022-09-11T14:43:40 2022-09-11T14:43:40 gpu10 [ACTIVE@scheduler2:/var/log] egrep "2211716|2211721" messages-20220918 Sep 11 14:43:10 scheduler2.grace.hpc.yale.internal slurmctld[91791]: sched: Allocate JobId=2211716 NodeList=gpu09 #CPUs=1 Partition=pi_ohern Sep 11 14:43:40 scheduler2.grace.hpc.yale.internal slurmctld[91791]: sched: Allocate JobId=2211721 NodeList=gpu10 #CPUs=1 Partition=pi_ohern Sep 12 06:04:29 scheduler2.grace.hpc.yale.internal slurmctld[91791]: _job_complete: JobId=2211721 WEXITSTATUS 0 Sep 12 06:04:29 scheduler2.grace.hpc.yale.internal slurmctld[91791]: _job_complete: JobId=2211721 done Sep 12 06:04:43 scheduler2.grace.hpc.yale.internal slurmctld[91791]: _job_complete: JobId=2211716 WEXITSTATUS 0 Sep 12 06:04:43 scheduler2.grace.hpc.yale.internal slurmctld[91791]: _job_complete: JobId=2211716 done vrs: [ACTIVE@scheduler2:/var/log] egrep "2211717|2211718" messages-20220918 Sep 11 14:43:10 scheduler2.grace.hpc.yale.internal slurmctld[91791]: sched: Allocate JobId=2211717 NodeList=gpu09 #CPUs=1 Partition=pi_ohern Sep 11 14:43:40 scheduler2.grace.hpc.yale.internal slurmctld[91791]: sched: Allocate JobId=2211718 NodeList=gpu09 #CPUs=1 Partition=pi_ohern Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 12 06:03:45 scheduler2.grace.hpc.yale.internal slurmctld[91791]: _job_complete: JobId=2211718 WEXITSTATUS 0 Sep 12 06:03:45 scheduler2.grace.hpc.yale.internal slurmctld[91791]: _job_complete: JobId=2211718 done Sep 12 06:06:32 scheduler2.grace.hpc.yale.internal slurmctld[91791]: _job_complete: JobId=2211717 WEXITSTATUS 0 Sep 12 06:06:32 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211717 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 12 06:06:32 scheduler2.grace.hpc.yale.internal slurmctld[91791]: _job_complete: JobId=2211717 done ..I don't have the batch script but I would assume all 4 of those were being asked to the same thing. They all requested the same resource-capacity, from sacct: JobID|NodeList|AllocCPUS|ReqMem|ReqTRES|AllocTRES| 2211716|gpu09|1|5G|cpu=1,gres/gpu=1,mem=5G,node=1|cpu=1,gres/gpu:p100=1,gres/gpu=1,mem=5G,node=1| 2211717|gpu09|1|5G|cpu=1,gres/gpu=1,mem=5G,node=1|cpu=1,gres/gpu:p100=1,gres/gpu=1,mem=5G,node=1| 2211718|gpu09|1|5G|cpu=1,gres/gpu=1,mem=5G,node=1|cpu=1,gres/gpu:p100=1,gres/gpu=1,mem=5G,node=1| 2211721|gpu10|1|5G|cpu=1,gres/gpu=1,mem=5G,node=1|cpu=1,gres/gpu:p100=1,gres/gpu=1,mem=5G,node=1| Thank you, Adam
Correction: "8.5 to 15.5 hours" should have been "3.5 to 15.5 hours"
Hi, Our cluster is reporting that nodes are overbooked which is not the situation (gpu:v100=7): CfgTRES=cpu=40,mem=382000M,billing=40,gres/gpu:v100=4,gres/nvme=3600 AllocTRES=cpu=19,mem=360768M,gres/gpu:v100=7,gres/nvme=200 Should I open separate bug or is it fine to reuse this one? BR, Tommi
Hi Caden, Just as an update, since we applied the patch (Sep 22 16:48) to 22.05.02 we still see the "gres count underflow" errors occur, ..so these may not be related to the backfill deadlock problem, and we still have a problem. Thank you, Adam
Hey Adam and Tommi, Adam, I am attempting to reproduce your error messages still. Do you have a gres.conf, and if so could you attach it? Have you ever seen something like this, along with your underflow messages? [2022-09-29T17:51:33.287] error: gres/gpu: job 9897 dealloc of node n15 bad node_offset 0 count is 0 Tommi, go ahead and log another bug. When you log it, attach a slurmctld.log for reference. I will let you know as soon as I have an update. Caden Ellis
Adam, I see the output from your gres.conf, no need to attach that.
Hi Caden, We have no cases of "dealloc of node" in any of our messages files. Thank you, Adam
Adam, I haven't been able to reproduce the same errors you are getting. If you can consistently make those same error messages appear with a certain workload or job script, could you set debug level to "debug" (or debug2, if your cluster isn't very busy at that moment) and also enable the "gres" debug flag right before running those jobs? This can be done with scontrol. After it is finished, be sure to revert the debug level and debug flags, as those will add extra strain on a production system. If you cannot reproduce with a specific script, if you could even predict a time frame within a day or less that it would happen and set those flags, that would help. If your system is too busy for the debug flag, just the gres flag could provide some hints. If you can get that done, attach the slurmctld.log here. Caden Ellis
Hello Adam, any update on this? Has it continued being an issue? If so, please refer to my previous comment
Hi Caden, I have some concerns about generating too much log data as a result of modifying the logging level in our larger environment, as you suggested, because we don't have a method available to easily reproduce the problem and so can't efficiently manage the duration of enhanced logging. It has been almost a month since the last time we saw these errors in the log files. It's possible that user behavior has changed since then, or that the patched 22.05.2 addressed (if not immediately) the problem. I think we can close or "hold" the matter for now, until we see the problem resurface and may then reconsider additional debug-level logging. Thank you, Adam
Adam, I'll go ahead and close this then, for now. You can reopen it if it becomes in issue again. Caden