Not a breaking issue but I wanted to log this in case you haven't seen it. I'm seeing the following error in our slurmctld.log: Jan 3 13:30:52 holy-slurm02 slurmctld: slurmctld: error: gres/gpu: job 38139223 dealloc node holygpu7c1311 type nvidia_a100_1g.5gb gres count underflow (0 1) It looks like what happened was that a job on a compute node with nvidia_al00_1g.5b gres was requeued onto a node that didn't have that available. The job is now throwing this error. This short of requeue happens a lot for us as we have a gpu_requeue partition that contains all our gpu hardware regardless of type. We've only recently started being specific instead of just using gres/gpu for the new MIG feature. Anyways its not causing any problems in the scheduler itself its just spewing the error so I wanted to let you know.
What is your slurm.conf and gres.conf?
Created attachment 28330 [details] slurm.conf
Created attachment 28332 [details] topology.conf
Created attachment 28333 [details] gres.conf
I've uploaded them.
Paul, What was the batch request. Spcifically what was the gpu request exactly? Are there any more details that could help me reproduce the issue? -Scott
Here is an example: [root@holy7c22501 ~]# scontrol show job 38200645 JobId=38200645 JobName=aerosynth_batch.sbatch UserId=rcloete(62479) GroupId=loeb_lab(34746) MCS_label=N/A Priority=2186350 Nice=0 Account=loeb_lab QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:08:25 TimeLimit=00:30:00 TimeMin=N/A SubmitTime=2023-01-04T13:13:09 EligibleTime=2023-01-04T13:13:09 AccrueTime=2023-01-04T13:13:09 StartTime=2023-01-04T13:13:30 EndTime=2023-01-04T13:43:30 Deadline=N/A PreemptEligibleTime=2023-01-04T13:13:30 PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-01-04T13:13:30 Scheduler=Backfill Partition=gpu_requeue AllocNode:Sid=seasdgx104:7006 ReqNodeList=(null) ExcNodeList=(null) NodeList=holygpu8a29104 BatchHost=holygpu8a29104 NumNodes=1 NumCPUs=8 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:* TRES=cpu=8,mem=16000M,node=1,billing=110,gres/gpu=1,gres/gpu:nvidia_a100_1g.10gb=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=8 MinMemoryNode=16000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/n/home08/rcloete/fasrc/data/sys/myjobs/projects/default/170/aerosynth_batch.sbatch WorkDir=/n/home08/rcloete/fasrc/data/sys/myjobs/projects/default/170 StdErr=/n/home08/rcloete/fasrc/data/sys/myjobs/projects/default/170/stderr.txt StdIn=/dev/null StdOut=/n/home08/rcloete/fasrc/data/sys/myjobs/projects/default/170/stdout.txt Power= MemPerTres=gpu:100 TresPerNode=gres:gpu:1 [root@holy7c22501 ~]# sacct -B -j 38200645 Batch Script for 38200645 -------------------------------------------------------------------------------- #!/bin/bash declare -a types=("light_airplane" "standard_drone") #declare -a types=("standard_airplane" "helicopter" "military_drone" "balloon" "blimp" "bird" "hotair_balloon") #declare -a types=("light_airplane") #declare -a types=("standard_airplane") #declare -a types=("standard_drone") #declare -a types=("military_drone") #declare -a types=("helicopter") #declare -a types=("balloon") #declare -a types=("blimp") #declare -a types=("hotair_balloon") #declare -a types=("bird") #declare -a types=("blank") declare param_file="/n/holylabs/LABS/loeb_lab/Users/rcloete/dev/AeroSynth/src/2.0/configs/param_vis_near.yml" declare output_dir="/n/holylabs/LABS/loeb_lab/Users/rcloete/data/raw/synthentic/aerosynth/wide_field/vis/near/100/" mkdir -p $output_dir cp $param_file $output_dir #echo "Processing: ${SLURM_ARRAY_TASK_ID}" for i in {1..10000} do for model_type in "${types[@]}" do #if [ "$(ls -1q $output_dir$model_type/*.png | wc -l)" -lt 2000 ]; then /n/holylabs/LABS/loeb_lab/Users/rcloete/apps/blender-3.3.1-linux-x64/blender --background --python /n/holylabs/LABS/loeb_lab/Users/rcloete/dev/AeroSynth/src/2.0/capture_sky_rich.py -- $param_file $model_type $output_dir #fi done done [root@holy-slurm02 log]# grep 38200645 messages Jan 4 13:13:09 holy-slurm02 slurmctld[148001]: _slurm_rpc_submit_batch_job: JobId=38200645 InitPrio=2186350 usec=2901 Jan 4 13:13:30 holy-slurm02 slurmctld[148001]: sched/backfill: _start_job: Started JobId=38200645 in gpu_requeue on holygpu8a29104 Jan 4 13:13:30 holy-slurm02 slurmctld: slurmctld: sched/backfill: _start_job: Started JobId=38200645 in gpu_requeue on holygpu8a29104 Jan 4 13:14:01 holy-slurm02 slurmctld: slurmctld: error: gres/gpu: job 38200645 dealloc node holygpu8a29104 type nvidia_a100_1g.10gb gres count underflow (0 1) Jan 4 13:14:01 holy-slurm02 slurmctld: slurmctld: error: gres/gpu: job 38200645 dealloc node holygpu8a29104 type nvidia_a100_1g.10gb gres count underflow (0 1) Jan 4 13:14:01 holy-slurm02 slurmctld: slurmctld: error: gres/gpu: job 38200645 dealloc node holygpu8a29104 type nvidia_a100_1g.10gb gres count underflow (0 1) [root@holy-slurm02 log]# scontrol show node holygpu8a29104 NodeName=holygpu8a29104 Arch=x86_64 CoresPerSocket=32 CPUAlloc=45 CPUEfctv=64 CPUTot=64 CPULoad=15.20 AvailableFeatures=intel,holyhdr,icelake,avx,avx2,avx512,gpu,a100-mig,cc8.0 ActiveFeatures=intel,holyhdr,icelake,avx,avx2,avx512,gpu,a100-mig,cc8.0 Gres=gpu:nvidia_a100_3g.39gb:4(S:0-1),gpu:nvidia_a100_1g.10gb:16(S:0-1) NodeAddr=holygpu8a29104 NodeHostName=holygpu8a29104 Version=22.05.6 OS=Linux 3.10.0-1160.36.2.el7.x86_64 #1 SMP Wed Jul 21 11:57:15 UTC 2021 RealMemory=515458 AllocMem=361904 FreeMem=273154 Sockets=2 Boards=1 MemSpecLimit=4096 State=MIXED ThreadsPerCore=1 TmpDisk=405861 Weight=1 Owner=N/A MCS_label=N/A Partitions=arguelles_delgado_gpu,gpu_requeue,serial_requeue BootTime=2022-12-08T09:19:14 SlurmdStartTime=2022-12-19T10:21:12 LastBusyTime=2023-01-03T11:49:24 CfgTRES=cpu=64,mem=515458M,billing=2186,gres/gpu=20,gres/gpu:nvidia_a100_1g.10gb=16,gres/gpu:nvidia_a100_3g.39gb=4 AllocTRES=cpu=45,mem=361904M,gres/gpu=20,gres/gpu:nvidia_a100_1g.10gb=8,gres/gpu:nvidia_a100_3g.39gb=2 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s -Paul Edmon- On 1/4/2023 1:20 PM, bugs@schedmd.com wrote: > > *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=15726#c7> on > bug 15726 <https://bugs.schedmd.com/show_bug.cgi?id=15726> from Scott > Hilton <mailto:scott@schedmd.com> * > Paul, > > What was the batch request. Spcifically what was the gpu request exactly? > > Are there any more details that could help me reproduce the issue? > > -Scott > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Paul, Sorry for the delayed update, I was unable to reproduce the issue. We did change some things in 23.02 that may fix it. If you upgrade and see this error again, please let us know. Let me know if you have any questions. -Scott
Paul, Are you still seeing this issue? Have you upgraded to 23.02, if so which point release? We fixed another similar issue in 23.02.2 which may be related to this issue. See bug 16121. -Scott
We are still on 22.05.7. We won't be upgrading to 23.02 until September as we are currently changing our operating system to Rocky 8 and wanted to keep the same version of slurm through the transition. -Paul Edmon- On 5/15/23 1:30 PM, bugs@schedmd.com wrote: > > *Comment # 10 <https://bugs.schedmd.com/show_bug.cgi?id=15726#c10> on > bug 15726 <https://bugs.schedmd.com/show_bug.cgi?id=15726> from Scott > Hilton <mailto:scott@schedmd.com> * > Paul, > > Are you still seeing this issue? Have you upgraded to 23.02, if so which point > release? > > We fixed another similar issue in 23.02.2 which may be related to this issue. > Seebug 16121 <show_bug.cgi?id=16121>. > > -Scott > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Paul, Thanks for letting us know. If you still see this issue after upgrading, please let us know. For now I will close this bug. -Scott