| Summary: | Deallocate GRES | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Paul Edmon <pedmon> |
| Component: | slurmctld | Assignee: | Scott Hilton <scott> |
| Status: | RESOLVED CANNOTREPRODUCE | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | cinek, scott |
| Version: | 22.05.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=15142 | ||
| Site: | Harvard University | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
slurm.conf
topology.conf gres.conf |
||
|
Description
Paul Edmon
2023-01-03 11:34:15 MST
What is your slurm.conf and gres.conf? Created attachment 28330 [details]
slurm.conf
Created attachment 28332 [details]
topology.conf
Created attachment 28333 [details]
gres.conf
I've uploaded them. Paul, What was the batch request. Spcifically what was the gpu request exactly? Are there any more details that could help me reproduce the issue? -Scott Here is an example:
[root@holy7c22501 ~]# scontrol show job 38200645
JobId=38200645 JobName=aerosynth_batch.sbatch
UserId=rcloete(62479) GroupId=loeb_lab(34746) MCS_label=N/A
Priority=2186350 Nice=0 Account=loeb_lab QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:08:25 TimeLimit=00:30:00 TimeMin=N/A
SubmitTime=2023-01-04T13:13:09 EligibleTime=2023-01-04T13:13:09
AccrueTime=2023-01-04T13:13:09
StartTime=2023-01-04T13:13:30 EndTime=2023-01-04T13:43:30 Deadline=N/A
PreemptEligibleTime=2023-01-04T13:13:30 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-01-04T13:13:30
Scheduler=Backfill
Partition=gpu_requeue AllocNode:Sid=seasdgx104:7006
ReqNodeList=(null) ExcNodeList=(null)
NodeList=holygpu8a29104
BatchHost=holygpu8a29104
NumNodes=1 NumCPUs=8 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:*
TRES=cpu=8,mem=16000M,node=1,billing=110,gres/gpu=1,gres/gpu:nvidia_a100_1g.10gb=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=8 MinMemoryNode=16000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/n/home08/rcloete/fasrc/data/sys/myjobs/projects/default/170/aerosynth_batch.sbatch
WorkDir=/n/home08/rcloete/fasrc/data/sys/myjobs/projects/default/170
StdErr=/n/home08/rcloete/fasrc/data/sys/myjobs/projects/default/170/stderr.txt
StdIn=/dev/null
StdOut=/n/home08/rcloete/fasrc/data/sys/myjobs/projects/default/170/stdout.txt
Power=
MemPerTres=gpu:100
TresPerNode=gres:gpu:1
[root@holy7c22501 ~]# sacct -B -j 38200645
Batch Script for 38200645
--------------------------------------------------------------------------------
#!/bin/bash
declare -a types=("light_airplane" "standard_drone")
#declare -a types=("standard_airplane" "helicopter" "military_drone"
"balloon" "blimp" "bird" "hotair_balloon")
#declare -a types=("light_airplane")
#declare -a types=("standard_airplane")
#declare -a types=("standard_drone")
#declare -a types=("military_drone")
#declare -a types=("helicopter")
#declare -a types=("balloon")
#declare -a types=("blimp")
#declare -a types=("hotair_balloon")
#declare -a types=("bird")
#declare -a types=("blank")
declare
param_file="/n/holylabs/LABS/loeb_lab/Users/rcloete/dev/AeroSynth/src/2.0/configs/param_vis_near.yml"
declare
output_dir="/n/holylabs/LABS/loeb_lab/Users/rcloete/data/raw/synthentic/aerosynth/wide_field/vis/near/100/"
mkdir -p $output_dir
cp $param_file $output_dir
#echo "Processing: ${SLURM_ARRAY_TASK_ID}"
for i in {1..10000}
do
for model_type in "${types[@]}"
do
#if [ "$(ls -1q $output_dir$model_type/*.png | wc -l)" -lt 2000 ]; then
/n/holylabs/LABS/loeb_lab/Users/rcloete/apps/blender-3.3.1-linux-x64/blender
--background --python
/n/holylabs/LABS/loeb_lab/Users/rcloete/dev/AeroSynth/src/2.0/capture_sky_rich.py
-- $param_file $model_type $output_dir
#fi
done
done
[root@holy-slurm02 log]# grep 38200645 messages
Jan 4 13:13:09 holy-slurm02 slurmctld[148001]:
_slurm_rpc_submit_batch_job: JobId=38200645 InitPrio=2186350 usec=2901
Jan 4 13:13:30 holy-slurm02 slurmctld[148001]: sched/backfill:
_start_job: Started JobId=38200645 in gpu_requeue on holygpu8a29104
Jan 4 13:13:30 holy-slurm02 slurmctld: slurmctld: sched/backfill:
_start_job: Started JobId=38200645 in gpu_requeue on holygpu8a29104
Jan 4 13:14:01 holy-slurm02 slurmctld: slurmctld: error: gres/gpu: job
38200645 dealloc node holygpu8a29104 type nvidia_a100_1g.10gb gres count
underflow (0 1)
Jan 4 13:14:01 holy-slurm02 slurmctld: slurmctld: error: gres/gpu: job
38200645 dealloc node holygpu8a29104 type nvidia_a100_1g.10gb gres count
underflow (0 1)
Jan 4 13:14:01 holy-slurm02 slurmctld: slurmctld: error: gres/gpu: job
38200645 dealloc node holygpu8a29104 type nvidia_a100_1g.10gb gres count
underflow (0 1)
[root@holy-slurm02 log]# scontrol show node holygpu8a29104
NodeName=holygpu8a29104 Arch=x86_64 CoresPerSocket=32
CPUAlloc=45 CPUEfctv=64 CPUTot=64 CPULoad=15.20
AvailableFeatures=intel,holyhdr,icelake,avx,avx2,avx512,gpu,a100-mig,cc8.0
ActiveFeatures=intel,holyhdr,icelake,avx,avx2,avx512,gpu,a100-mig,cc8.0
Gres=gpu:nvidia_a100_3g.39gb:4(S:0-1),gpu:nvidia_a100_1g.10gb:16(S:0-1)
NodeAddr=holygpu8a29104 NodeHostName=holygpu8a29104 Version=22.05.6
OS=Linux 3.10.0-1160.36.2.el7.x86_64 #1 SMP Wed Jul 21 11:57:15 UTC 2021
RealMemory=515458 AllocMem=361904 FreeMem=273154 Sockets=2 Boards=1
MemSpecLimit=4096
State=MIXED ThreadsPerCore=1 TmpDisk=405861 Weight=1 Owner=N/A
MCS_label=N/A
Partitions=arguelles_delgado_gpu,gpu_requeue,serial_requeue
BootTime=2022-12-08T09:19:14 SlurmdStartTime=2022-12-19T10:21:12
LastBusyTime=2023-01-03T11:49:24
CfgTRES=cpu=64,mem=515458M,billing=2186,gres/gpu=20,gres/gpu:nvidia_a100_1g.10gb=16,gres/gpu:nvidia_a100_3g.39gb=4
AllocTRES=cpu=45,mem=361904M,gres/gpu=20,gres/gpu:nvidia_a100_1g.10gb=8,gres/gpu:nvidia_a100_3g.39gb=2
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
-Paul Edmon-
On 1/4/2023 1:20 PM, bugs@schedmd.com wrote:
>
> *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=15726#c7> on
> bug 15726 <https://bugs.schedmd.com/show_bug.cgi?id=15726> from Scott
> Hilton <mailto:scott@schedmd.com> *
> Paul,
>
> What was the batch request. Spcifically what was the gpu request exactly?
>
> Are there any more details that could help me reproduce the issue?
>
> -Scott
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
> * You reported the bug.
>
Paul, Sorry for the delayed update, I was unable to reproduce the issue. We did change some things in 23.02 that may fix it. If you upgrade and see this error again, please let us know. Let me know if you have any questions. -Scott Paul, Are you still seeing this issue? Have you upgraded to 23.02, if so which point release? We fixed another similar issue in 23.02.2 which may be related to this issue. See bug 16121. -Scott We are still on 22.05.7. We won't be upgrading to 23.02 until September as we are currently changing our operating system to Rocky 8 and wanted to keep the same version of slurm through the transition. -Paul Edmon- On 5/15/23 1:30 PM, bugs@schedmd.com wrote: > > *Comment # 10 <https://bugs.schedmd.com/show_bug.cgi?id=15726#c10> on > bug 15726 <https://bugs.schedmd.com/show_bug.cgi?id=15726> from Scott > Hilton <mailto:scott@schedmd.com> * > Paul, > > Are you still seeing this issue? Have you upgraded to 23.02, if so which point > release? > > We fixed another similar issue in 23.02.2 which may be related to this issue. > Seebug 16121 <show_bug.cgi?id=16121>. > > -Scott > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Paul, Thanks for letting us know. If you still see this issue after upgrading, please let us know. For now I will close this bug. -Scott |