14895 – scheduler isn't tracking allocated resources

Ticket 14895 - scheduler isn't tracking allocated resources

Summary: scheduler isn't tracking allocated resources

Status:	RESOLVED CANNOTREPRODUCE

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	22.05.2
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Director of Support
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2022-09-06 08:50 MDT by Adam
Modified:	2022-10-11 10:02 MDT (History)
CC List:	2 users (show)

See Also:	15142
Site:	Yale
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Adam 2022-09-06 08:50:14 MDT

Skyler asked me to open this issue as another bug:

It seems like the scheduler isn't tracking allocated resources, for example, this seems to be spamming:

Sep  2 08:21:42 scheduler2.grace.hpc.yale.internal slurmctld[203336]: error: gres/gpu: job 1426844 dealloc node c39n02 type v100 gres count underflow (0 1)

because these two jobs are running (edited for privacy):

[root@mgt2.grace ~]# scontrol show job 1427334
JobId=1427334 JobName=6xez_25
   JobState=RUNNING Reason=None Dependency=(null)
   Partition=gpu AllocNode:Sid=grace1:38298
   NodeList=c39n02
   BatchHost=c39n02
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
   TRES=cpu=1,mem=10G,node=1,billing=10,gres/gpu=1,gres/gpu:v100=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=10G MinTmpDiskNode=0
   Features=v100|a100 DelayBoot=00:00:00
   TresPerJob=gres:gpu:1  

[root@mgt2.grace ~]# scontrol show job 1426774
JobId=1426774 JobName=25tre2
   JobState=RUNNING Reason=None Dependency=(null)
   Partition=gpu AllocNode:Sid=grace1:38298
   NodeList=c39n02
   BatchHost=c39n02
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
   TRES=cpu=1,mem=10G,node=1,billing=10,gres/gpu=1,gres/gpu:v100=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=10G MinTmpDiskNode=0
   Features=v100|a100 DelayBoot=00:00:00
   TresPerJob=gres:gpu:1

.. ^ so each of those requested 1 GPU. But the scheduler sees:

[root@mgt2.grace ~]# scontrol show node c39n02
NodeName=c39n02 Arch=x86_64 CoresPerSocket=12 
   CPUAlloc=2 CPUEfctv=24 CPUTot=24 CPULoad=2.04
   AvailableFeatures=skylake,avx512,6136,doubleprecision,common,bigtmp,v100
   ActiveFeatures=skylake,avx512,6136,doubleprecision,common,bigtmp,v100
   Gres=gpu:v100:2(S:0-1)
   NodeAddr=c39n02 NodeHostName=c39n02 Version=22.05.2
   OS=Linux 3.10.0-1160.71.1.el7.x86_64 #1 SMP Wed Jun 15 08:55:08 UTC 2022 
   RealMemory=92160 AllocMem=20480 FreeMem=14255 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=11203 Owner=N/A MCS_label=N/A
   Partitions=admintest,gpu,scavenge,scavenge_gpu 
   BootTime=2022-08-03T15:22:56 SlurmdStartTime=2022-08-03T15:27:28
   LastBusyTime=2022-09-01T07:46:38
   CfgTRES=cpu=24,mem=90G,billing=24,gres/gpu=2,gres/gpu:v100=2
   AllocTRES=cpu=2,mem=20G,gres/gpu=2,gres/gpu:v100=1
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

^ only one "gres/gpu:v100" allocated, when it should be 2. 

This is a bug?

Thanks,
Adam

Comment 1 Adam 2022-09-06 08:51:37 MDT

Please see: https://bugs.schedmd.com/show_bug.cgi?id=14813

for conf.

Comment 2 Caden Ellis 2022-09-08 14:26:15 MDT

Thanks for logging the bug Adam. 
Can you give me a reproducer? What are you doing to submit the jobs?

Comment 3 Adam 2022-09-12 11:49:43 MDT

Hi Caden,

I'm sorry but I cannot give you a reproducer at this time. I've attempted using:

"
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --job-name=ralloc_test
#SBATCH --gpus=1
#SBATCH -C "v100|a100"
#SBATCH --mem=10G
#SBATCH --time=24:00:00
#SBATCH --mail-type=END
#SBATCH --nodes=1

module purge
module load CUDA

echo "hello";
sleep 300; 
"

..as a reproducer (this is basically a copy of the same sbatch script used by a researcher that resulted in the errors). The job did not result in a scheduler error or apparent resource tracking problem: manifestation may be dependent on the running state of slurmctld, ..which will be far more difficult for us to reproduce. 

Thanks,
Adam

Comment 4 Caden Ellis 2022-09-13 16:13:06 MDT

Adam, 

I am going to close this bug out, since you are dealing with controller issues with Skyler right now and this bug may be an unrelated side effect of that (and since you don't have a reproducer). 

If that issue gets resolved and you still see this problem, or you get a reproducer, please open this bug back up.

Caden

Comment 5 Caden Ellis 2022-09-14 11:00:29 MDT

Adam, 

I found some more info on this type of bug so I am opening it back up. Have you  touched the PriorityCalcPeriod parameter at all? Is it less than 60 seconds?

What are the durations of the jobs that are triggering the underflow message?

Caden

Comment 7 Adam 2022-09-19 08:37:58 MDT

Hi Caden,

Sorry for not getting back to you sooner,

PriorityCalcPeriod is set to 60 seconds. 

I'm not sure on the average time, but in the most recent cases the range was from 8.5 to 15.5 hours. Here's a list of them (all the same user: note that not all jobs submitted by this user resulted in the error):

JobID           Elapsed              Submit               Start 
------------ ---------- ------------------- ------------------- 
2211717        15:23:22 2022-09-09T11:05:12 2022-09-11T14:43:10 -- 1 error occurred on Sep 12 06:06:32
2211718        15:20:05 2022-09-09T11:05:12 2022-09-11T14:43:40 -- 36 errors occurred within the same second on Sep 11 14:43:47
2211719        15:21:05 2022-09-09T11:05:12 2022-09-11T14:43:40 -- 36 errors occurred within the same second on Sep 11 14:43:47
2211720        15:21:52 2022-09-09T11:05:12 2022-09-11T14:43:40 -- 1 error occurred on Sep 12 06:05:32
2211727        10:44:24 2022-09-09T11:05:13 2022-09-11T22:49:41 -- 1 error occurred on Sep 12 09:34:05
2211730        09:05:54 2022-09-09T11:05:13 2022-09-12T00:28:16 -- 1 error occurred on Sep 12 09:34:10
2211737        08:38:58 2022-09-09T11:05:13 2022-09-12T00:55:22 -- 1 error occurred on Sep 12 09:34:20
2211738        08:32:49 2022-09-09T11:05:13 2022-09-12T01:01:13 -- 1 error occurred on Sep 12 09:34:02
2211740        07:49:32 2022-09-09T11:05:13 2022-09-12T01:44:44 -- 1 error occurred on Sep 12 09:34:16
2211744        03:31:50 2022-09-09T11:05:13 2022-09-12T06:02:49 -- 1 error occurred on Sep 12 09:34:39
2211745        03:30:33 2022-09-09T11:05:13 2022-09-12T06:03:47 -- 1 error occurred on Sep 12 09:34:20

Every single error in the above cases was of the format:

<DATE> scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job <JOBID> dealloc node <NODE> type p100 gres count underflow (0 1)

Where <DATE> and <JOBID> are as defined above, and <NODE> ranged from gpu[03,05-07,09-10]:

[ACTIVE@scheduler2:/var/log] cat /opt/slurm/current/etc/gres.conf |egrep "gpu\[0"
NodeName=gpu[02-10] Name=gpu Type=p100 File=/dev/nvidia[0-1] CPUs=[0-11]
NodeName=gpu[02-10] Name=gpu Type=p100 File=/dev/nvidia[2-3] CPUs=[12-23]
[ACTIVE@scheduler2:/var/log] cat /opt/slurm/current/etc/slurm.conf.d/nodes |egrep "gpu\[0"
NodeName=gpu[02-10] Weight=10403 Feature=skylake,avx512,6136,doubleprecision,pi,p100 RealMemory=185344 Gres=gpu:p100:4

As you might have guessed based on the job IDs, these were not the only p100 jobs that user submitted at around that time, for example we saw no errors from:

[ACTIVE@scheduler2:/var/log] sacct -ojobid,Elapsed,Submit,Start,NodeList -j 2211716,2211721
JobID           Elapsed              Submit               Start        NodeList 
------------ ---------- ------------------- ------------------- --------------- 
2211716        15:21:33 2022-09-09T11:05:12 2022-09-11T14:43:10           gpu09 
2211716.bat+   15:21:33 2022-09-11T14:43:10 2022-09-11T14:43:10           gpu09 
2211716.ext+   15:21:33 2022-09-11T14:43:10 2022-09-11T14:43:10           gpu09 
2211721        15:20:49 2022-09-09T11:05:12 2022-09-11T14:43:40           gpu10 
2211721.bat+   15:20:49 2022-09-11T14:43:40 2022-09-11T14:43:40           gpu10 
2211721.ext+   15:20:49 2022-09-11T14:43:40 2022-09-11T14:43:40           gpu10

[ACTIVE@scheduler2:/var/log] egrep "2211716|2211721"  messages-20220918
Sep 11 14:43:10 scheduler2.grace.hpc.yale.internal slurmctld[91791]: sched: Allocate JobId=2211716 NodeList=gpu09 #CPUs=1 Partition=pi_ohern
Sep 11 14:43:40 scheduler2.grace.hpc.yale.internal slurmctld[91791]: sched: Allocate JobId=2211721 NodeList=gpu10 #CPUs=1 Partition=pi_ohern
Sep 12 06:04:29 scheduler2.grace.hpc.yale.internal slurmctld[91791]: _job_complete: JobId=2211721 WEXITSTATUS 0
Sep 12 06:04:29 scheduler2.grace.hpc.yale.internal slurmctld[91791]: _job_complete: JobId=2211721 done
Sep 12 06:04:43 scheduler2.grace.hpc.yale.internal slurmctld[91791]: _job_complete: JobId=2211716 WEXITSTATUS 0
Sep 12 06:04:43 scheduler2.grace.hpc.yale.internal slurmctld[91791]: _job_complete: JobId=2211716 done

vrs:

[ACTIVE@scheduler2:/var/log] egrep "2211717|2211718"  messages-20220918
Sep 11 14:43:10 scheduler2.grace.hpc.yale.internal slurmctld[91791]: sched: Allocate JobId=2211717 NodeList=gpu09 #CPUs=1 Partition=pi_ohern
Sep 11 14:43:40 scheduler2.grace.hpc.yale.internal slurmctld[91791]: sched: Allocate JobId=2211718 NodeList=gpu09 #CPUs=1 Partition=pi_ohern
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 12 06:03:45 scheduler2.grace.hpc.yale.internal slurmctld[91791]: _job_complete: JobId=2211718 WEXITSTATUS 0
Sep 12 06:03:45 scheduler2.grace.hpc.yale.internal slurmctld[91791]: _job_complete: JobId=2211718 done
Sep 12 06:06:32 scheduler2.grace.hpc.yale.internal slurmctld[91791]: _job_complete: JobId=2211717 WEXITSTATUS 0
Sep 12 06:06:32 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211717 dealloc node gpu09 type p100 gres count underflow (0 1)
Sep 12 06:06:32 scheduler2.grace.hpc.yale.internal slurmctld[91791]: _job_complete: JobId=2211717 done

..I don't have the batch script but I would assume all 4 of those were being asked to the same thing. They all requested the same resource-capacity, from sacct:

JobID|NodeList|AllocCPUS|ReqMem|ReqTRES|AllocTRES|
2211716|gpu09|1|5G|cpu=1,gres/gpu=1,mem=5G,node=1|cpu=1,gres/gpu:p100=1,gres/gpu=1,mem=5G,node=1|
2211717|gpu09|1|5G|cpu=1,gres/gpu=1,mem=5G,node=1|cpu=1,gres/gpu:p100=1,gres/gpu=1,mem=5G,node=1|
2211718|gpu09|1|5G|cpu=1,gres/gpu=1,mem=5G,node=1|cpu=1,gres/gpu:p100=1,gres/gpu=1,mem=5G,node=1|
2211721|gpu10|1|5G|cpu=1,gres/gpu=1,mem=5G,node=1|cpu=1,gres/gpu:p100=1,gres/gpu=1,mem=5G,node=1|

Thank you,
Adam

Comment 8 Adam 2022-09-19 08:39:40 MDT

Correction: "8.5 to 15.5 hours" should have been "3.5 to 15.5 hours"

Comment 9 CSC sysadmins 2022-09-27 03:10:40 MDT

Hi,

Our cluster is reporting that nodes are overbooked which is not the situation (gpu:v100=7): 

   CfgTRES=cpu=40,mem=382000M,billing=40,gres/gpu:v100=4,gres/nvme=3600
   AllocTRES=cpu=19,mem=360768M,gres/gpu:v100=7,gres/nvme=200

Should I open separate bug or is it fine to reuse this one?

BR,
Tommi

Comment 10 Adam 2022-09-29 09:15:44 MDT

Hi Caden,

Just as an update, since we applied the patch (Sep 22 16:48) to 22.05.02 we still see the "gres count underflow"  errors occur, ..so these may not be related to the backfill deadlock problem, and we still have a problem. 

Thank you,
Adam

Comment 11 Caden Ellis 2022-09-29 18:13:35 MDT

Hey Adam and Tommi,

Adam, I am attempting to reproduce your error messages still. Do you have a gres.conf, and if so could you attach it?

Have you ever seen something like this, along with your underflow messages?

[2022-09-29T17:51:33.287] error: gres/gpu: job 9897 dealloc of node n15 bad node_offset 0 count is 0

Tommi, go ahead and log another bug. When you log it, attach a slurmctld.log for reference. 

I will let you know as soon as I have an update. 

Caden Ellis

Comment 12 Caden Ellis 2022-09-29 18:15:45 MDT

Adam, I see the output from your gres.conf, no need to attach that.

Comment 13 Adam 2022-09-30 13:57:50 MDT

Hi Caden,

We have no cases of "dealloc of node" in any of our messages files.

Thank you,
Adam

Comment 14 Caden Ellis 2022-09-30 17:21:28 MDT

Adam,

I haven't been able to reproduce the same errors you are getting. 

If you can consistently make those same error messages appear with a certain workload or job script, could you set debug level to "debug" (or debug2, if your cluster isn't very busy at that moment) and also enable the "gres" debug flag right before running those jobs? This can be done with scontrol. 

After it is finished, be sure to revert the debug level and debug flags, as those will add extra strain on a production system. 

If you cannot reproduce with a specific script, if you could even predict a time frame within a day or less that it would happen and set those flags, that would help. If your system is too busy for the debug flag, just the gres flag could provide some hints.

If you can get that done, attach the slurmctld.log here.

Caden Ellis

Comment 15 Caden Ellis 2022-10-10 22:21:31 MDT

Hello Adam, any update on this? Has it continued being an issue? If so, please refer to my previous comment

Comment 16 Adam 2022-10-11 09:33:41 MDT

Hi Caden,

I have some concerns about generating too much log data as a result of modifying the logging level in our larger environment, as you suggested, because we don't have a method available to easily reproduce the problem and so can't efficiently manage the duration of enhanced logging. 

It has been almost a month since the last time we saw these errors in the log files. It's possible that user behavior has changed since then, or that the patched 22.05.2 addressed (if not immediately) the problem.

I think we can close or "hold" the matter for now, until we see the problem resurface and may then reconsider additional debug-level logging. 

Thank you,
Adam

Comment 17 Caden Ellis 2022-10-11 09:55:44 MDT

Adam, 

I'll go ahead and close this then, for now. You can reopen it if it becomes in issue again. 

Caden