| Summary: | scheduler isn't tracking allocated resources | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Adam <adam.munro> |
| Component: | slurmctld | Assignee: | Director of Support <support> |
| Status: | RESOLVED CANNOTREPRODUCE | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | csc-slurm-tickets, marshall |
| Version: | 22.05.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=15142 | ||
| Site: | Yale | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Adam
2022-09-06 08:50:14 MDT
Please see: https://bugs.schedmd.com/show_bug.cgi?id=14813 for conf. Thanks for logging the bug Adam. Can you give me a reproducer? What are you doing to submit the jobs? Hi Caden, I'm sorry but I cannot give you a reproducer at this time. I've attempted using: " #!/bin/bash #SBATCH --partition=gpu #SBATCH --job-name=ralloc_test #SBATCH --gpus=1 #SBATCH -C "v100|a100" #SBATCH --mem=10G #SBATCH --time=24:00:00 #SBATCH --mail-type=END #SBATCH --nodes=1 module purge module load CUDA echo "hello"; sleep 300; " ..as a reproducer (this is basically a copy of the same sbatch script used by a researcher that resulted in the errors). The job did not result in a scheduler error or apparent resource tracking problem: manifestation may be dependent on the running state of slurmctld, ..which will be far more difficult for us to reproduce. Thanks, Adam Adam, I am going to close this bug out, since you are dealing with controller issues with Skyler right now and this bug may be an unrelated side effect of that (and since you don't have a reproducer). If that issue gets resolved and you still see this problem, or you get a reproducer, please open this bug back up. Caden Adam, I found some more info on this type of bug so I am opening it back up. Have you touched the PriorityCalcPeriod parameter at all? Is it less than 60 seconds? What are the durations of the jobs that are triggering the underflow message? Caden Hi Caden, Sorry for not getting back to you sooner, PriorityCalcPeriod is set to 60 seconds. I'm not sure on the average time, but in the most recent cases the range was from 8.5 to 15.5 hours. Here's a list of them (all the same user: note that not all jobs submitted by this user resulted in the error): JobID Elapsed Submit Start ------------ ---------- ------------------- ------------------- 2211717 15:23:22 2022-09-09T11:05:12 2022-09-11T14:43:10 -- 1 error occurred on Sep 12 06:06:32 2211718 15:20:05 2022-09-09T11:05:12 2022-09-11T14:43:40 -- 36 errors occurred within the same second on Sep 11 14:43:47 2211719 15:21:05 2022-09-09T11:05:12 2022-09-11T14:43:40 -- 36 errors occurred within the same second on Sep 11 14:43:47 2211720 15:21:52 2022-09-09T11:05:12 2022-09-11T14:43:40 -- 1 error occurred on Sep 12 06:05:32 2211727 10:44:24 2022-09-09T11:05:13 2022-09-11T22:49:41 -- 1 error occurred on Sep 12 09:34:05 2211730 09:05:54 2022-09-09T11:05:13 2022-09-12T00:28:16 -- 1 error occurred on Sep 12 09:34:10 2211737 08:38:58 2022-09-09T11:05:13 2022-09-12T00:55:22 -- 1 error occurred on Sep 12 09:34:20 2211738 08:32:49 2022-09-09T11:05:13 2022-09-12T01:01:13 -- 1 error occurred on Sep 12 09:34:02 2211740 07:49:32 2022-09-09T11:05:13 2022-09-12T01:44:44 -- 1 error occurred on Sep 12 09:34:16 2211744 03:31:50 2022-09-09T11:05:13 2022-09-12T06:02:49 -- 1 error occurred on Sep 12 09:34:39 2211745 03:30:33 2022-09-09T11:05:13 2022-09-12T06:03:47 -- 1 error occurred on Sep 12 09:34:20 Every single error in the above cases was of the format: <DATE> scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job <JOBID> dealloc node <NODE> type p100 gres count underflow (0 1) Where <DATE> and <JOBID> are as defined above, and <NODE> ranged from gpu[03,05-07,09-10]: [ACTIVE@scheduler2:/var/log] cat /opt/slurm/current/etc/gres.conf |egrep "gpu\[0" NodeName=gpu[02-10] Name=gpu Type=p100 File=/dev/nvidia[0-1] CPUs=[0-11] NodeName=gpu[02-10] Name=gpu Type=p100 File=/dev/nvidia[2-3] CPUs=[12-23] [ACTIVE@scheduler2:/var/log] cat /opt/slurm/current/etc/slurm.conf.d/nodes |egrep "gpu\[0" NodeName=gpu[02-10] Weight=10403 Feature=skylake,avx512,6136,doubleprecision,pi,p100 RealMemory=185344 Gres=gpu:p100:4 As you might have guessed based on the job IDs, these were not the only p100 jobs that user submitted at around that time, for example we saw no errors from: [ACTIVE@scheduler2:/var/log] sacct -ojobid,Elapsed,Submit,Start,NodeList -j 2211716,2211721 JobID Elapsed Submit Start NodeList ------------ ---------- ------------------- ------------------- --------------- 2211716 15:21:33 2022-09-09T11:05:12 2022-09-11T14:43:10 gpu09 2211716.bat+ 15:21:33 2022-09-11T14:43:10 2022-09-11T14:43:10 gpu09 2211716.ext+ 15:21:33 2022-09-11T14:43:10 2022-09-11T14:43:10 gpu09 2211721 15:20:49 2022-09-09T11:05:12 2022-09-11T14:43:40 gpu10 2211721.bat+ 15:20:49 2022-09-11T14:43:40 2022-09-11T14:43:40 gpu10 2211721.ext+ 15:20:49 2022-09-11T14:43:40 2022-09-11T14:43:40 gpu10 [ACTIVE@scheduler2:/var/log] egrep "2211716|2211721" messages-20220918 Sep 11 14:43:10 scheduler2.grace.hpc.yale.internal slurmctld[91791]: sched: Allocate JobId=2211716 NodeList=gpu09 #CPUs=1 Partition=pi_ohern Sep 11 14:43:40 scheduler2.grace.hpc.yale.internal slurmctld[91791]: sched: Allocate JobId=2211721 NodeList=gpu10 #CPUs=1 Partition=pi_ohern Sep 12 06:04:29 scheduler2.grace.hpc.yale.internal slurmctld[91791]: _job_complete: JobId=2211721 WEXITSTATUS 0 Sep 12 06:04:29 scheduler2.grace.hpc.yale.internal slurmctld[91791]: _job_complete: JobId=2211721 done Sep 12 06:04:43 scheduler2.grace.hpc.yale.internal slurmctld[91791]: _job_complete: JobId=2211716 WEXITSTATUS 0 Sep 12 06:04:43 scheduler2.grace.hpc.yale.internal slurmctld[91791]: _job_complete: JobId=2211716 done vrs: [ACTIVE@scheduler2:/var/log] egrep "2211717|2211718" messages-20220918 Sep 11 14:43:10 scheduler2.grace.hpc.yale.internal slurmctld[91791]: sched: Allocate JobId=2211717 NodeList=gpu09 #CPUs=1 Partition=pi_ohern Sep 11 14:43:40 scheduler2.grace.hpc.yale.internal slurmctld[91791]: sched: Allocate JobId=2211718 NodeList=gpu09 #CPUs=1 Partition=pi_ohern Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 11 14:43:47 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211718 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 12 06:03:45 scheduler2.grace.hpc.yale.internal slurmctld[91791]: _job_complete: JobId=2211718 WEXITSTATUS 0 Sep 12 06:03:45 scheduler2.grace.hpc.yale.internal slurmctld[91791]: _job_complete: JobId=2211718 done Sep 12 06:06:32 scheduler2.grace.hpc.yale.internal slurmctld[91791]: _job_complete: JobId=2211717 WEXITSTATUS 0 Sep 12 06:06:32 scheduler2.grace.hpc.yale.internal slurmctld[91791]: error: gres/gpu: job 2211717 dealloc node gpu09 type p100 gres count underflow (0 1) Sep 12 06:06:32 scheduler2.grace.hpc.yale.internal slurmctld[91791]: _job_complete: JobId=2211717 done ..I don't have the batch script but I would assume all 4 of those were being asked to the same thing. They all requested the same resource-capacity, from sacct: JobID|NodeList|AllocCPUS|ReqMem|ReqTRES|AllocTRES| 2211716|gpu09|1|5G|cpu=1,gres/gpu=1,mem=5G,node=1|cpu=1,gres/gpu:p100=1,gres/gpu=1,mem=5G,node=1| 2211717|gpu09|1|5G|cpu=1,gres/gpu=1,mem=5G,node=1|cpu=1,gres/gpu:p100=1,gres/gpu=1,mem=5G,node=1| 2211718|gpu09|1|5G|cpu=1,gres/gpu=1,mem=5G,node=1|cpu=1,gres/gpu:p100=1,gres/gpu=1,mem=5G,node=1| 2211721|gpu10|1|5G|cpu=1,gres/gpu=1,mem=5G,node=1|cpu=1,gres/gpu:p100=1,gres/gpu=1,mem=5G,node=1| Thank you, Adam Correction: "8.5 to 15.5 hours" should have been "3.5 to 15.5 hours" Hi, Our cluster is reporting that nodes are overbooked which is not the situation (gpu:v100=7): CfgTRES=cpu=40,mem=382000M,billing=40,gres/gpu:v100=4,gres/nvme=3600 AllocTRES=cpu=19,mem=360768M,gres/gpu:v100=7,gres/nvme=200 Should I open separate bug or is it fine to reuse this one? BR, Tommi Hi Caden, Just as an update, since we applied the patch (Sep 22 16:48) to 22.05.02 we still see the "gres count underflow" errors occur, ..so these may not be related to the backfill deadlock problem, and we still have a problem. Thank you, Adam Hey Adam and Tommi, Adam, I am attempting to reproduce your error messages still. Do you have a gres.conf, and if so could you attach it? Have you ever seen something like this, along with your underflow messages? [2022-09-29T17:51:33.287] error: gres/gpu: job 9897 dealloc of node n15 bad node_offset 0 count is 0 Tommi, go ahead and log another bug. When you log it, attach a slurmctld.log for reference. I will let you know as soon as I have an update. Caden Ellis Adam, I see the output from your gres.conf, no need to attach that. Hi Caden, We have no cases of "dealloc of node" in any of our messages files. Thank you, Adam Adam, I haven't been able to reproduce the same errors you are getting. If you can consistently make those same error messages appear with a certain workload or job script, could you set debug level to "debug" (or debug2, if your cluster isn't very busy at that moment) and also enable the "gres" debug flag right before running those jobs? This can be done with scontrol. After it is finished, be sure to revert the debug level and debug flags, as those will add extra strain on a production system. If you cannot reproduce with a specific script, if you could even predict a time frame within a day or less that it would happen and set those flags, that would help. If your system is too busy for the debug flag, just the gres flag could provide some hints. If you can get that done, attach the slurmctld.log here. Caden Ellis Hello Adam, any update on this? Has it continued being an issue? If so, please refer to my previous comment Hi Caden, I have some concerns about generating too much log data as a result of modifying the logging level in our larger environment, as you suggested, because we don't have a method available to easily reproduce the problem and so can't efficiently manage the duration of enhanced logging. It has been almost a month since the last time we saw these errors in the log files. It's possible that user behavior has changed since then, or that the patched 22.05.2 addressed (if not immediately) the problem. I think we can close or "hold" the matter for now, until we see the problem resurface and may then reconsider additional debug-level logging. Thank you, Adam Adam, I'll go ahead and close this then, for now. You can reopen it if it becomes in issue again. Caden |