Ticket 12943

Summary:	A no_consume GRES added in job_submit.lua has non-zero value
Product:	Slurm	Reporter:	Trey Dockendorf <tdockendorf>
Component:	Scheduling	Assignee:	Chad Vizino <chad>
Status:	RESOLVED CANNOTREPRODUCE	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	chad, troy
Version:	21.08.4
Hardware:	Linux
OS:	Linux
Site:	Ohio State OSC	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	gres.conf slurm.conf job_submit_lib.lua osc_common.lua job_submit.lua cli_filter.lua localize lua scripts patch

Description Trey Dockendorf 2021-12-01 15:28:02 MST

Created attachment 22478 [details]
gres.conf

We use job_submit.lua to add a GPFS GRES based on various things about a job, like if the filesystem is in the job script or their submit directory is GPFS.  It seems when I submit a job that has this condition, the GRES added is not 0, notice how TRES= has gres/gpfs:ess=2.

$ sbatch -A PZS0708 --reservation=test --nodes=2 --time=00:05:00 --wrap 'ls -la /fs/ess/scratch/PZS0708 ; scontrol show job=$SLURM_JOB_ID'
Submitted batch job 8052278

$ scontrol show job=8052278
JobId=8052278 JobName=wrap
   UserId=tdockendorf(20821) GroupId=PZS0708(5509) MCS_label=N/A
   Priority=200010740 Nice=0 Account=pzs0708 QOS=pitzer-all
   JobState=PENDING Reason=Reservation Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:05:00 TimeMin=N/A
   SubmitTime=2021-12-01T17:21:18 EligibleTime=Unknown
   AccrueTime=Unknown
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-12-01T17:21:18 Scheduler=Main
   Partition=parallel-40core,parallel-48core,condo-olivucci-backfill-parallel,gpubackfill-parallel-40core,condo-osumed-gpu-48core-backfill-parallel,gpubackfill-parallel-48core,condo-datta-backfill-parallel,condo-honscheid-backfill-parallel,gpubackfill-parallel-quad,condo-ccapp-backfill-parallel,condo-osumed-gpu-quad-backfill-parallel,condo-osumed-gpu-40core-backfill-parallel AllocNode:Sid=pitzer-rw01:108063
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=2-2 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,mem=9112M,node=2,billing=2,gres/gpfs:ess=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=4556M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Reservation=test
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/users/sysp/tdockendorf/git/slurm-exporter
   Comment=stdout=/users/sysp/tdockendorf/git/slurm-exporter/slurm-8052278.out 
   StdErr=/users/sysp/tdockendorf/git/slurm-exporter/slurm-8052278.out
   StdIn=/dev/null
   StdOut=/users/sysp/tdockendorf/git/slurm-exporter/slurm-8052278.out
   Power=
   TresPerNode=gres:gpfs:ess:1

If I try and replicate that behavior with just sbatch and comment out the Lua code, the TRES line is different:

$ sbatch -A PZS0708 --reservation=test --nodes=2 --gres gpfs:ess --time=00:05:00 --wrap 'ls -la /fs/ess/scratch/PZS0708 ; scontrol show job=$SLURM_JOB_ID'
Submitted batch job 8052315

$ scontrol show job=8052315
JobId=8052315 JobName=wrap
   UserId=tdockendorf(20821) GroupId=PZS0708(5509) MCS_label=N/A
   Priority=100041830 Nice=0 Account=pzs0708 QOS=pitzer-all
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:02 TimeLimit=00:05:00 TimeMin=N/A
   SubmitTime=2021-12-01T17:25:27 EligibleTime=2021-12-01T17:25:27
   AccrueTime=2021-12-01T17:25:27
   StartTime=2021-12-01T17:25:27 EndTime=2021-12-01T17:25:29 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-12-01T17:25:27 Scheduler=Main
   Partition=parallel-48core AllocNode:Sid=pitzer-rw01:108063
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=p[0615,0621]
   BatchHost=p0615
   NumNodes=2 NumCPUs=96 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=96,mem=364512M,node=2,billing=96,gres/gpfs:ess=0,gres/gpfs:project=0,gres/gpfs:scratch=0,gres/ime=0,gres/pfsdir=0,gres/pfsdir:ess=0,gres/pfsdir:scratch=0
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=3797M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Reservation=test
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/users/sysp/tdockendorf/git/slurm-exporter
   Comment=stdout=/users/sysp/tdockendorf/git/slurm-exporter/slurm-8052315.out 
   StdErr=/users/sysp/tdockendorf/git/slurm-exporter/slurm-8052315.out
   StdIn=/dev/null
   StdOut=/users/sysp/tdockendorf/git/slurm-exporter/slurm-8052315.out
   Power=
   TresPerNode=gres:gpfs:ess

The correct behavior is happening when the job is submitted with sbatch. Our parallel partitions are forced to be whole node so it makes sense each GRES is there and set to 0, the behavior however through Lua is wrong.

I am marking medium impact because we have commercial customers who ran into this issue and it was preventing them from getting their jobs to run.

Comment 1 Trey Dockendorf 2021-12-01 15:28:23 MST

Created attachment 22479 [details]
slurm.conf

Comment 2 Trey Dockendorf 2021-12-01 15:29:09 MST

Created attachment 22480 [details]
job_submit_lib.lua

Comment 3 Trey Dockendorf 2021-12-01 15:29:23 MST

Created attachment 22481 [details]
osc_common.lua

Comment 4 Trey Dockendorf 2021-12-01 15:29:44 MST

Created attachment 22482 [details]
job_submit.lua

Comment 5 Trey Dockendorf 2021-12-01 15:34:05 MST

I attached the job_submit.lua and the 2 library files we use with that.  The line 230 on job_submit.lua is commented out currently to work around the issue but if you put that line back and remove the comments the issue manifests itself.

This relates to bug 12642 where issues similar to this were resolved.  Also the reservation I was using is this:

scontrol create reservation=test starttime=17:25:00 duration=00:10:00 nodecnt=2 features='c6420&48core' partitionname=parallel-48core accounts=PZS0708 flags=DAILY,PURGE_COMP=00:02:00,REPLACE_DOWN

I think this issue might actually be just with jobs using reservations because as I test this further I'm unable to reproduce the behavior outside a reservation.

Here's a job not using reservation with Lua code activated:

$ sbatch -A PZS0708 --nodes=2 --time=00:05:00 --wrap 'ls -la /fs/ess/scratch/PZS0708 ; scontrol show job=$SLURM_JOB_ID'
Submitted batch job 8052376

$ scontrol show job=8052376
JobId=8052376 JobName=wrap
   UserId=tdockendorf(20821) GroupId=PZS0708(5509) MCS_label=N/A
   Priority=200007006 Nice=0 Account=pzs0708 QOS=pitzer-all
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:02 TimeLimit=00:05:00 TimeMin=N/A
   SubmitTime=2021-12-01T17:31:38 EligibleTime=2021-12-01T17:31:38
   AccrueTime=2021-12-01T17:31:38
   StartTime=2021-12-01T17:31:43 EndTime=2021-12-01T17:31:45 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-12-01T17:31:43 Scheduler=Backfill
   Partition=parallel-48core AllocNode:Sid=pitzer-rw01:108063
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=p[0692-0693]
   BatchHost=p0692
   NumNodes=2 NumCPUs=96 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=96,mem=364512M,node=2,billing=96,gres/gpfs:ess=0,gres/gpfs:project=0,gres/gpfs:scratch=0,gres/ime=0,gres/pfsdir=0,gres/pfsdir:ess=0,gres/pfsdir:scratch=0
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=3797M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/users/sysp/tdockendorf/git/slurm-exporter
   Comment=stdout=/users/sysp/tdockendorf/git/slurm-exporter/slurm-8052376.out 
   StdErr=/users/sysp/tdockendorf/git/slurm-exporter/slurm-8052376.out
   StdIn=/dev/null
   StdOut=/users/sysp/tdockendorf/git/slurm-exporter/slurm-8052376.out
   Power=
   TresPerNode=gres:gpfs:ess:1

So I think this actually might be an issue with no_consume GRES added by Lua with reservations.

Comment 6 Chad Vizino 2021-12-01 16:02:24 MST

I'll take a look at this.

Comment 7 Chad Vizino 2021-12-03 16:30:55 MST

Still looking at this--will focus work on it early next week.

Comment 8 Chad Vizino 2021-12-07 17:18:27 MST

I'm working on reproducing the results using your lua script and a reservation. My results show no gres:

>$ sbatch --reservation=new --nodes=2 --wrap="ls /fs/ess/scratch/PZS0708 2>/dev/null"
>Submitted batch job 161579
>$ scontrol show job 161579
>JobId=161579 JobName=wrap
>...
>   TRES=cpu=4,mem=5288M,node=2,billing=333344
>...
I have this set:

>$ sacct -o jobid,node -j 161579
>JobID               NodeList
>------------ ---------------
>161579        mackinac-[1-2]
>161579.batch      mackinac-1
>161579.exte+  mackinac-[1-2]
>$ scontrol show node mackinac-[1-2]|egrep '(^Node)|TRES'
>NodeName=mackinac-1 Arch=x86_64 CoresPerSocket=6
>   CfgTRES=cpu=12,mem=15865M,billing=1000065,gres/gpfs:ess=1,gres/gpfs:project=1
>   AllocTRES=
>NodeName=mackinac-2 Arch=x86_64 CoresPerSocket=6
>   CfgTRES=cpu=12,mem=15865M,billing=1000065,gres/gpfs:ess=1,gres/gpfs:project=1
>   AllocTRES=
I also don't have "job_submit_data" required by job_submit.lua--not sure if that's needed (have the require commented out for now). There may be some other things that don't overlay with my config and yours so have to poke around with it some more.

Do you have a cli_filter.lua script you are using, too?

Comment 9 Trey Dockendorf 2021-12-07 17:20:44 MST

Created attachment 22577 [details]
cli_filter.lua

Comment 10 Trey Dockendorf 2021-12-07 17:23:12 MST

The code I uploaded may have had the call to "filesystem_gres" commented out, that's on line 230 of the job_submit.lua.  If you remove the comment and make that function called, it should see "/fs/ess" and add the GRES. I also attached our cli_filter.lua.  The job_submit_data.lua isn't used for this particular issue, that's only used for condo assignments.

Comment 11 Chad Vizino 2021-12-09 12:24:06 MST

Created attachment 22608 [details]
localize lua scripts patch

Here are the modifications I made to your supplied lua scripts to localize them for me so they'd work on my test system.

Comment 12 Chad Vizino 2021-12-09 12:35:14 MST

Here's what I am seeing from several tests all with the job submit and cli filter scripts in place (see comment 11). Note that filesystem_gres is being called (uncommented that).

>$ scontrol show config|grep lua
>CliFilterPlugins        = syslog,user_defaults,lua
>JobSubmitPlugins        = lua
No res/lua script sets gres:

>$ scontrol show res
>No reservations in the system
>$ sbatch -A test --nodes=2 --wrap="ls /fs/ess/scratch/PZS0708 2>/dev/null"
>sbatch: job_submit.lua called
>Submitted batch job 161597
>$ scontrol show job 161597|grep -i tres
>   TRES=cpu=4,mem=5288M,node=2,billing=333344,gres/gpfs:ess=0
>   TresPerNode=gres:gpfs:ess:1
> 
>[2021-12-09T11:47:50.877] lua: slurm_job_submit: action=set_whole_node shared=65534.0 user=chad
>[2021-12-09T11:47:50.877] lua: action=set_environments user=chad SLURM_TIME_LIMIT=3600 PBS_WALLTIME=3600
>[2021-12-09T11:47:50.877] lua: action=filesystem_gres return=gpfs:ess:1 assign_work_dir=false assign_script=ess fs_gres="{ [1] = gpfs:ess:1,} " user=chad
>[2021-12-09T11:47:50.877] lua: slurm_job_submit: action=gres_defaults gres=gres:gpfs:ess:1 new_gres=gpfs:ess:1 user=chad
>[2021-12-09T11:47:50.877] debug:  lua: action=is_gpu_job return=false gres=gres:gpfs:ess:1 tres_per_job= tres_per_node=gres:gpfs:ess:1 tres_per_socket= tres_per_task=
Res/lua script sets gres:

># scontrol create reservation reservation=new nodes=mackinac-[1,2] starttime=now duration=1:00:00 users=chad
>Reservation created: new
> 
>$ sbatch --reservation=new -A test --nodes=2 --wrap="ls /fs/ess/scratch/PZS0708 2>/dev/null"
>sbatch: job_submit.lua called
>Submitted batch job 161598
>$ scontrol show job 161598|grep -i tres
>   TRES=cpu=4,mem=5288M,node=2,billing=333344,gres/gpfs:ess=0
>   TresPerNode=gres:gpfs:ess:1
> 
>[2021-12-09T11:52:20.014] debug:  lua: action=set_whole_node gres= return=65534.0
>[2021-12-09T11:52:20.014] lua: slurm_job_submit: action=set_whole_node shared=65534.0 user=chad
>[2021-12-09T11:52:20.014] lua: action=set_environments user=chad SLURM_TIME_LIMIT=3600 PBS_WALLTIME=3600
>[2021-12-09T11:52:20.014] lua: action=filesystem_gres return=gpfs:ess:1 assign_work_dir=false assign_script=ess fs_gres="{ [1] = gpfs:ess:1,} " user=chad
>[2021-12-09T11:52:20.014] lua: slurm_job_submit: action=gres_defaults gres=gres:gpfs:ess:1 new_gres=gpfs:ess:1 user=chad
>[2021-12-09T11:52:20.014] debug:  lua: action=is_gpu_job return=false gres=gres:gpfs:ess:1 tres_per_job= tres_per_node=gres:gpfs:ess:1 tres_per_socket= tres_per_task=
No res/sbatch sets gres:

>$ scontrol show res
>No reservations in the system
>$ sbatch --gres=gpfs:ess -A test --nodes=2 --wrap="ls /fs/ess/scratch/PZS0708 2>/dev/null"sbatch: job_submit.lua called
>Submitted batch job 161599
>$ scontrol show job 161599|grep -i tres
>   TRES=cpu=4,mem=5288M,node=2,billing=333344,gres/gpfs:ess=0
>   TresPerNode=gres:gpfs:ess,gres:gpfs:ess:1
> 
>[2021-12-09T11:57:38.142] debug:  lua: action=set_whole_node gres=gres:gpfs:ess return=65534.0
>[2021-12-09T11:57:38.142] lua: slurm_job_submit: action=set_whole_node shared=65534.0 user=chad
>[2021-12-09T11:57:38.142] lua: action=set_environments user=chad SLURM_TIME_LIMIT=3600 PBS_WALLTIME=3600
>[2021-12-09T11:57:38.142] lua: action=filesystem_gres return=gres:gpfs:ess,gpfs:ess:1 assign_work_dir=false assign_script=ess fs_gres="{ [1] = gpfs:ess:1,} " user=chad
>[2021-12-09T11:57:38.142] lua: slurm_job_submit: action=gres_defaults gres=gres:gres:gpfs:ess,gres:gpfs:ess:1 new_gres=gpfs:ess,gpfs:ess:1 user=chad
>[2021-12-09T11:57:38.142] debug:  lua: action=is_gpu_job return=false gres=gres:gpfs:ess,gres:gpfs:ess:1 tres_per_job= tres_per_node=gres:gpfs:ess,gres:gpfs:ess:1 tres_per_socket= tres_per_task=
Res/sbatch sets gres:

># scontrol create reservation reservation=new nodes=mackinac-[1,2] starttime=now duration=1:00:00 users=chad
>Reservation created: new
> 
>$ sbatch --reservation=new --gres=gpfs:ess -A test --nodes=2 --wrap="ls /fs/ess/scratch/PZS0708 2>/dev/null"
>sbatch: job_submit.lua called
>Submitted batch job 161600
>$ scontrol show job 161600|grep -i tres
>   TRES=cpu=4,mem=5288M,node=2,billing=333344,gres/gpfs:ess=0
>   TresPerNode=gres:gpfs:ess,gres:gpfs:ess:1
> 
>[2021-12-09T12:00:09.057] debug:  lua: action=set_whole_node gres=gres:gpfs:ess return=65534.0
>[2021-12-09T12:00:09.057] lua: slurm_job_submit: action=set_whole_node shared=65534.0 user=chad
>[2021-12-09T12:00:09.057] lua: action=set_environments user=chad SLURM_TIME_LIMIT=3600 PBS_WALLTIME=3600
>[2021-12-09T12:00:09.057] lua: action=filesystem_gres return=gres:gpfs:ess,gpfs:ess:1 assign_work_dir=false assign_script=ess fs_gres="{ [1] = gpfs:ess:1,} " user=chad
>[2021-12-09T12:00:09.057] lua: slurm_job_submit: action=gres_defaults gres=gres:gres:gpfs:ess,gres:gpfs:ess:1 new_gres=gpfs:ess,gpfs:ess:1 user=chad
>[2021-12-09T12:00:09.057] debug:  lua: action=is_gpu_job return=false gres=gres:gpfs:ess,gres:gpfs:ess:1 tres_per_job= tres_per_node=gres:gpfs:ess,gres:gpfs:ess:1 tres_per_socket= tres_per_task=
What do your slurmctld.log lua lines look like compared to what I got?

Comment 13 Trey Dockendorf 2021-12-09 13:56:42 MST

I think that maybe the TRES for the GRES being non-zero while pending but turning to 0 once running or completed threw me off.  I tried to replicate the issue but am not able to.  The issue came to us from a customer who was having issues getting 92 node reservation jobs to run but might have been an issue from bug 12954 where one of our staff modified their walltime and I just didn't see that issue as the cause.

When I had a job in pending this is what I saw just now during replecation:

$ scontrol show job=2006362 | grep TRES
   TRES=cpu=2,mem=9112M,node=2,billing=2,gres/gpfs:ess=2


Once the job ran I saw this:

$ scontrol show job=2006362 | grep TRES
   TRES=cpu=80,mem=364480M,node=2,billing=80,gres/gpfs:ess=0,gres/gpfs:project=0,gres/gpfs:scratch=0,gres/ime=0,gres/pfsdir=0,gres/pfsdir:ess=0,gres/pfsdir:scratch=0

I will have to see if we can find a time to re-enable the filesystem GRES on our systems and see if we can replicate on larger environment.

Comment 15 Chad Vizino 2021-12-16 15:59:06 MST

Side issue: We are going to add a note to the scontrol doc about the count being zero when generic resources are set to no_consume (similar to the one in the sacct doc).

Closing for now. Feel free to reopen or submit a new ticket if the comment 0 issue turns out to be a problem later.