15726 – Deallocate GRES

Ticket 15726 - Deallocate GRES

Summary: Deallocate GRES

Status:	RESOLVED CANNOTREPRODUCE

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	22.05.6
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Scott Hilton
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2023-01-03 11:34 MST by Paul Edmon
Modified:	2023-05-16 10:32 MDT (History)
CC List:	2 users (show)

See Also:	15142
Site:	Harvard University
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (63.92 KB, text/x-matlab) 2023-01-04 07:34 MST, Paul Edmon	Details
topology.conf (4.33 KB, text/x-matlab) 2023-01-04 07:34 MST, Paul Edmon	Details
gres.conf (16 bytes, text/plain) 2023-01-04 07:34 MST, Paul Edmon	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Paul Edmon 2023-01-03 11:34:15 MST

Not a breaking issue but I wanted to log this in case you haven't seen it. I'm seeing the following error in our slurmctld.log:

Jan  3 13:30:52 holy-slurm02 slurmctld: slurmctld: error: gres/gpu: job 38139223 dealloc node holygpu7c1311 type nvidia_a100_1g.5gb gres count underflow (0 1)

It looks like what happened was that a job on a compute node with nvidia_al00_1g.5b gres was requeued onto a node that didn't have that available. The job is now throwing this error. This short of requeue happens a lot for us as we have a gpu_requeue partition that contains all our gpu hardware regardless of type. We've only recently started being specific instead of just using gres/gpu for the new MIG feature.

Anyways its not causing any problems in the scheduler itself its just spewing the error so I wanted to let you know.

Comment 1 Scott Hilton 2023-01-03 13:56:40 MST

What is your slurm.conf and gres.conf?

Comment 3 Paul Edmon 2023-01-04 07:34:27 MST

Created attachment 28330 [details]
slurm.conf

Comment 4 Paul Edmon 2023-01-04 07:34:41 MST

Created attachment 28332 [details]
topology.conf

Comment 5 Paul Edmon 2023-01-04 07:34:54 MST

Created attachment 28333 [details]
gres.conf

Comment 6 Paul Edmon 2023-01-04 07:35:19 MST

I've uploaded them.

Comment 7 Scott Hilton 2023-01-04 11:20:07 MST

Paul, 

What was the batch request. Spcifically what was the gpu request exactly?

Are there any more details that could help me reproduce the issue?

-Scott

Comment 8 Paul Edmon 2023-01-04 11:24:47 MST

Here is an example:

[root@holy7c22501 ~]# scontrol show job 38200645
JobId=38200645 JobName=aerosynth_batch.sbatch
    UserId=rcloete(62479) GroupId=loeb_lab(34746) MCS_label=N/A
    Priority=2186350 Nice=0 Account=loeb_lab QOS=normal
    JobState=RUNNING Reason=None Dependency=(null)
    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
    RunTime=00:08:25 TimeLimit=00:30:00 TimeMin=N/A
    SubmitTime=2023-01-04T13:13:09 EligibleTime=2023-01-04T13:13:09
    AccrueTime=2023-01-04T13:13:09
    StartTime=2023-01-04T13:13:30 EndTime=2023-01-04T13:43:30 Deadline=N/A
    PreemptEligibleTime=2023-01-04T13:13:30 PreemptTime=None
    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-01-04T13:13:30 
Scheduler=Backfill
    Partition=gpu_requeue AllocNode:Sid=seasdgx104:7006
    ReqNodeList=(null) ExcNodeList=(null)
    NodeList=holygpu8a29104
    BatchHost=holygpu8a29104
    NumNodes=1 NumCPUs=8 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:*
TRES=cpu=8,mem=16000M,node=1,billing=110,gres/gpu=1,gres/gpu:nvidia_a100_1g.10gb=1
    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
    MinCPUsNode=8 MinMemoryNode=16000M MinTmpDiskNode=0
    Features=(null) DelayBoot=00:00:00
    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/n/home08/rcloete/fasrc/data/sys/myjobs/projects/default/170/aerosynth_batch.sbatch
WorkDir=/n/home08/rcloete/fasrc/data/sys/myjobs/projects/default/170
StdErr=/n/home08/rcloete/fasrc/data/sys/myjobs/projects/default/170/stderr.txt
    StdIn=/dev/null
StdOut=/n/home08/rcloete/fasrc/data/sys/myjobs/projects/default/170/stdout.txt
    Power=
    MemPerTres=gpu:100
    TresPerNode=gres:gpu:1

[root@holy7c22501 ~]# sacct -B -j 38200645
Batch Script for 38200645
--------------------------------------------------------------------------------
#!/bin/bash


declare -a types=("light_airplane" "standard_drone")

#declare -a types=("standard_airplane" "helicopter" "military_drone" 
"balloon" "blimp" "bird" "hotair_balloon")

#declare -a types=("light_airplane")
#declare -a types=("standard_airplane")
#declare -a types=("standard_drone")
#declare -a types=("military_drone")
#declare -a types=("helicopter")
#declare -a types=("balloon")
#declare -a types=("blimp")
#declare -a types=("hotair_balloon")
#declare -a types=("bird")
#declare -a types=("blank")


declare 
param_file="/n/holylabs/LABS/loeb_lab/Users/rcloete/dev/AeroSynth/src/2.0/configs/param_vis_near.yml"
declare 
output_dir="/n/holylabs/LABS/loeb_lab/Users/rcloete/data/raw/synthentic/aerosynth/wide_field/vis/near/100/"

mkdir -p $output_dir
cp $param_file $output_dir
#echo "Processing: ${SLURM_ARRAY_TASK_ID}"


for i in {1..10000}
do
   for model_type in "${types[@]}"
   do
     #if [ "$(ls -1q $output_dir$model_type/*.png | wc -l)" -lt 2000 ]; then
/n/holylabs/LABS/loeb_lab/Users/rcloete/apps/blender-3.3.1-linux-x64/blender 
--background --python 
/n/holylabs/LABS/loeb_lab/Users/rcloete/dev/AeroSynth/src/2.0/capture_sky_rich.py 
-- $param_file $model_type $output_dir
     #fi
   done
done

[root@holy-slurm02 log]# grep 38200645 messages
Jan  4 13:13:09 holy-slurm02 slurmctld[148001]: 
_slurm_rpc_submit_batch_job: JobId=38200645 InitPrio=2186350 usec=2901
Jan  4 13:13:30 holy-slurm02 slurmctld[148001]: sched/backfill: 
_start_job: Started JobId=38200645 in gpu_requeue on holygpu8a29104
Jan  4 13:13:30 holy-slurm02 slurmctld: slurmctld: sched/backfill: 
_start_job: Started JobId=38200645 in gpu_requeue on holygpu8a29104
Jan  4 13:14:01 holy-slurm02 slurmctld: slurmctld: error: gres/gpu: job 
38200645 dealloc node holygpu8a29104 type nvidia_a100_1g.10gb gres count 
underflow (0 1)
Jan  4 13:14:01 holy-slurm02 slurmctld: slurmctld: error: gres/gpu: job 
38200645 dealloc node holygpu8a29104 type nvidia_a100_1g.10gb gres count 
underflow (0 1)
Jan  4 13:14:01 holy-slurm02 slurmctld: slurmctld: error: gres/gpu: job 
38200645 dealloc node holygpu8a29104 type nvidia_a100_1g.10gb gres count 
underflow (0 1)

[root@holy-slurm02 log]# scontrol show node holygpu8a29104
NodeName=holygpu8a29104 Arch=x86_64 CoresPerSocket=32
    CPUAlloc=45 CPUEfctv=64 CPUTot=64 CPULoad=15.20
AvailableFeatures=intel,holyhdr,icelake,avx,avx2,avx512,gpu,a100-mig,cc8.0
ActiveFeatures=intel,holyhdr,icelake,avx,avx2,avx512,gpu,a100-mig,cc8.0
Gres=gpu:nvidia_a100_3g.39gb:4(S:0-1),gpu:nvidia_a100_1g.10gb:16(S:0-1)
    NodeAddr=holygpu8a29104 NodeHostName=holygpu8a29104 Version=22.05.6
    OS=Linux 3.10.0-1160.36.2.el7.x86_64 #1 SMP Wed Jul 21 11:57:15 UTC 2021
    RealMemory=515458 AllocMem=361904 FreeMem=273154 Sockets=2 Boards=1
    MemSpecLimit=4096
    State=MIXED ThreadsPerCore=1 TmpDisk=405861 Weight=1 Owner=N/A 
MCS_label=N/A
    Partitions=arguelles_delgado_gpu,gpu_requeue,serial_requeue
    BootTime=2022-12-08T09:19:14 SlurmdStartTime=2022-12-19T10:21:12
    LastBusyTime=2023-01-03T11:49:24
CfgTRES=cpu=64,mem=515458M,billing=2186,gres/gpu=20,gres/gpu:nvidia_a100_1g.10gb=16,gres/gpu:nvidia_a100_3g.39gb=4
AllocTRES=cpu=45,mem=361904M,gres/gpu=20,gres/gpu:nvidia_a100_1g.10gb=8,gres/gpu:nvidia_a100_3g.39gb=2
    CapWatts=n/a
    CurrentWatts=0 AveWatts=0
    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

-Paul Edmon-

On 1/4/2023 1:20 PM, bugs@schedmd.com wrote:
>
> *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=15726#c7> on 
> bug 15726 <https://bugs.schedmd.com/show_bug.cgi?id=15726> from Scott 
> Hilton <mailto:scott@schedmd.com> *
> Paul,
>
> What was the batch request. Spcifically what was the gpu request exactly?
>
> Are there any more details that could help me reproduce the issue?
>
> -Scott
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 9 Scott Hilton 2023-03-23 15:46:08 MDT

Paul,

Sorry for the delayed update, I was unable to reproduce the issue.

We did change some things in 23.02 that may fix it. If you upgrade and see this error again, please let us know.

Let me know if you have any questions.

-Scott

Comment 10 Scott Hilton 2023-05-15 11:30:10 MDT

Paul,

Are you still seeing this issue? Have you upgraded to 23.02, if so which point release? 

We fixed another similar issue in 23.02.2 which may be related to this issue. See bug 16121.

-Scott

Comment 11 Paul Edmon 2023-05-15 11:47:43 MDT

We are still on 22.05.7.  We won't be upgrading to 23.02 until September 
as we are currently changing our operating system to Rocky 8 and wanted 
to keep the same version of slurm through the transition.

-Paul Edmon-

On 5/15/23 1:30 PM, bugs@schedmd.com wrote:
>
> *Comment # 10 <https://bugs.schedmd.com/show_bug.cgi?id=15726#c10> on 
> bug 15726 <https://bugs.schedmd.com/show_bug.cgi?id=15726> from Scott 
> Hilton <mailto:scott@schedmd.com> *
> Paul,
>
> Are you still seeing this issue? Have you upgraded to 23.02, if so which point
> release?
>
> We fixed another similar issue in 23.02.2 which may be related to this issue.
> Seebug 16121  <show_bug.cgi?id=16121>.
>
> -Scott
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 12 Scott Hilton 2023-05-16 10:32:45 MDT

Paul,

Thanks for letting us know. If you still see this issue after upgrading, please let us know.

For now I will close this bug.

-Scott