| Summary: | A no_consume GRES added in job_submit.lua has non-zero value | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Trey Dockendorf <tdockendorf> |
| Component: | Scheduling | Assignee: | Chad Vizino <chad> |
| Status: | RESOLVED CANNOTREPRODUCE | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | chad, troy |
| Version: | 21.08.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Ohio State OSC | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
gres.conf
slurm.conf job_submit_lib.lua osc_common.lua job_submit.lua cli_filter.lua localize lua scripts patch |
||
Created attachment 22479 [details]
slurm.conf
Created attachment 22480 [details]
job_submit_lib.lua
Created attachment 22481 [details]
osc_common.lua
Created attachment 22482 [details]
job_submit.lua
I attached the job_submit.lua and the 2 library files we use with that. The line 230 on job_submit.lua is commented out currently to work around the issue but if you put that line back and remove the comments the issue manifests itself. This relates to bug 12642 where issues similar to this were resolved. Also the reservation I was using is this: scontrol create reservation=test starttime=17:25:00 duration=00:10:00 nodecnt=2 features='c6420&48core' partitionname=parallel-48core accounts=PZS0708 flags=DAILY,PURGE_COMP=00:02:00,REPLACE_DOWN I think this issue might actually be just with jobs using reservations because as I test this further I'm unable to reproduce the behavior outside a reservation. Here's a job not using reservation with Lua code activated: $ sbatch -A PZS0708 --nodes=2 --time=00:05:00 --wrap 'ls -la /fs/ess/scratch/PZS0708 ; scontrol show job=$SLURM_JOB_ID' Submitted batch job 8052376 $ scontrol show job=8052376 JobId=8052376 JobName=wrap UserId=tdockendorf(20821) GroupId=PZS0708(5509) MCS_label=N/A Priority=200007006 Nice=0 Account=pzs0708 QOS=pitzer-all JobState=COMPLETED Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:02 TimeLimit=00:05:00 TimeMin=N/A SubmitTime=2021-12-01T17:31:38 EligibleTime=2021-12-01T17:31:38 AccrueTime=2021-12-01T17:31:38 StartTime=2021-12-01T17:31:43 EndTime=2021-12-01T17:31:45 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-12-01T17:31:43 Scheduler=Backfill Partition=parallel-48core AllocNode:Sid=pitzer-rw01:108063 ReqNodeList=(null) ExcNodeList=(null) NodeList=p[0692-0693] BatchHost=p0692 NumNodes=2 NumCPUs=96 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=96,mem=364512M,node=2,billing=96,gres/gpfs:ess=0,gres/gpfs:project=0,gres/gpfs:scratch=0,gres/ime=0,gres/pfsdir=0,gres/pfsdir:ess=0,gres/pfsdir:scratch=0 Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=* MinCPUsNode=1 MinMemoryCPU=3797M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/users/sysp/tdockendorf/git/slurm-exporter Comment=stdout=/users/sysp/tdockendorf/git/slurm-exporter/slurm-8052376.out StdErr=/users/sysp/tdockendorf/git/slurm-exporter/slurm-8052376.out StdIn=/dev/null StdOut=/users/sysp/tdockendorf/git/slurm-exporter/slurm-8052376.out Power= TresPerNode=gres:gpfs:ess:1 So I think this actually might be an issue with no_consume GRES added by Lua with reservations. I'll take a look at this. Still looking at this--will focus work on it early next week. I'm working on reproducing the results using your lua script and a reservation. My results show no gres: >$ sbatch --reservation=new --nodes=2 --wrap="ls /fs/ess/scratch/PZS0708 2>/dev/null" >Submitted batch job 161579 >$ scontrol show job 161579 >JobId=161579 JobName=wrap >... > TRES=cpu=4,mem=5288M,node=2,billing=333344 >... I have this set: >$ sacct -o jobid,node -j 161579 >JobID NodeList >------------ --------------- >161579 mackinac-[1-2] >161579.batch mackinac-1 >161579.exte+ mackinac-[1-2] >$ scontrol show node mackinac-[1-2]|egrep '(^Node)|TRES' >NodeName=mackinac-1 Arch=x86_64 CoresPerSocket=6 > CfgTRES=cpu=12,mem=15865M,billing=1000065,gres/gpfs:ess=1,gres/gpfs:project=1 > AllocTRES= >NodeName=mackinac-2 Arch=x86_64 CoresPerSocket=6 > CfgTRES=cpu=12,mem=15865M,billing=1000065,gres/gpfs:ess=1,gres/gpfs:project=1 > AllocTRES= I also don't have "job_submit_data" required by job_submit.lua--not sure if that's needed (have the require commented out for now). There may be some other things that don't overlay with my config and yours so have to poke around with it some more. Do you have a cli_filter.lua script you are using, too? Created attachment 22577 [details]
cli_filter.lua
The code I uploaded may have had the call to "filesystem_gres" commented out, that's on line 230 of the job_submit.lua. If you remove the comment and make that function called, it should see "/fs/ess" and add the GRES. I also attached our cli_filter.lua. The job_submit_data.lua isn't used for this particular issue, that's only used for condo assignments. Created attachment 22608 [details]
localize lua scripts patch
Here are the modifications I made to your supplied lua scripts to localize them for me so they'd work on my test system.
Here's what I am seeing from several tests all with the job submit and cli filter scripts in place (see comment 11). Note that filesystem_gres is being called (uncommented that). >$ scontrol show config|grep lua >CliFilterPlugins = syslog,user_defaults,lua >JobSubmitPlugins = lua No res/lua script sets gres: >$ scontrol show res >No reservations in the system >$ sbatch -A test --nodes=2 --wrap="ls /fs/ess/scratch/PZS0708 2>/dev/null" >sbatch: job_submit.lua called >Submitted batch job 161597 >$ scontrol show job 161597|grep -i tres > TRES=cpu=4,mem=5288M,node=2,billing=333344,gres/gpfs:ess=0 > TresPerNode=gres:gpfs:ess:1 > >[2021-12-09T11:47:50.877] lua: slurm_job_submit: action=set_whole_node shared=65534.0 user=chad >[2021-12-09T11:47:50.877] lua: action=set_environments user=chad SLURM_TIME_LIMIT=3600 PBS_WALLTIME=3600 >[2021-12-09T11:47:50.877] lua: action=filesystem_gres return=gpfs:ess:1 assign_work_dir=false assign_script=ess fs_gres="{ [1] = gpfs:ess:1,} " user=chad >[2021-12-09T11:47:50.877] lua: slurm_job_submit: action=gres_defaults gres=gres:gpfs:ess:1 new_gres=gpfs:ess:1 user=chad >[2021-12-09T11:47:50.877] debug: lua: action=is_gpu_job return=false gres=gres:gpfs:ess:1 tres_per_job= tres_per_node=gres:gpfs:ess:1 tres_per_socket= tres_per_task= Res/lua script sets gres: ># scontrol create reservation reservation=new nodes=mackinac-[1,2] starttime=now duration=1:00:00 users=chad >Reservation created: new > >$ sbatch --reservation=new -A test --nodes=2 --wrap="ls /fs/ess/scratch/PZS0708 2>/dev/null" >sbatch: job_submit.lua called >Submitted batch job 161598 >$ scontrol show job 161598|grep -i tres > TRES=cpu=4,mem=5288M,node=2,billing=333344,gres/gpfs:ess=0 > TresPerNode=gres:gpfs:ess:1 > >[2021-12-09T11:52:20.014] debug: lua: action=set_whole_node gres= return=65534.0 >[2021-12-09T11:52:20.014] lua: slurm_job_submit: action=set_whole_node shared=65534.0 user=chad >[2021-12-09T11:52:20.014] lua: action=set_environments user=chad SLURM_TIME_LIMIT=3600 PBS_WALLTIME=3600 >[2021-12-09T11:52:20.014] lua: action=filesystem_gres return=gpfs:ess:1 assign_work_dir=false assign_script=ess fs_gres="{ [1] = gpfs:ess:1,} " user=chad >[2021-12-09T11:52:20.014] lua: slurm_job_submit: action=gres_defaults gres=gres:gpfs:ess:1 new_gres=gpfs:ess:1 user=chad >[2021-12-09T11:52:20.014] debug: lua: action=is_gpu_job return=false gres=gres:gpfs:ess:1 tres_per_job= tres_per_node=gres:gpfs:ess:1 tres_per_socket= tres_per_task= No res/sbatch sets gres: >$ scontrol show res >No reservations in the system >$ sbatch --gres=gpfs:ess -A test --nodes=2 --wrap="ls /fs/ess/scratch/PZS0708 2>/dev/null"sbatch: job_submit.lua called >Submitted batch job 161599 >$ scontrol show job 161599|grep -i tres > TRES=cpu=4,mem=5288M,node=2,billing=333344,gres/gpfs:ess=0 > TresPerNode=gres:gpfs:ess,gres:gpfs:ess:1 > >[2021-12-09T11:57:38.142] debug: lua: action=set_whole_node gres=gres:gpfs:ess return=65534.0 >[2021-12-09T11:57:38.142] lua: slurm_job_submit: action=set_whole_node shared=65534.0 user=chad >[2021-12-09T11:57:38.142] lua: action=set_environments user=chad SLURM_TIME_LIMIT=3600 PBS_WALLTIME=3600 >[2021-12-09T11:57:38.142] lua: action=filesystem_gres return=gres:gpfs:ess,gpfs:ess:1 assign_work_dir=false assign_script=ess fs_gres="{ [1] = gpfs:ess:1,} " user=chad >[2021-12-09T11:57:38.142] lua: slurm_job_submit: action=gres_defaults gres=gres:gres:gpfs:ess,gres:gpfs:ess:1 new_gres=gpfs:ess,gpfs:ess:1 user=chad >[2021-12-09T11:57:38.142] debug: lua: action=is_gpu_job return=false gres=gres:gpfs:ess,gres:gpfs:ess:1 tres_per_job= tres_per_node=gres:gpfs:ess,gres:gpfs:ess:1 tres_per_socket= tres_per_task= Res/sbatch sets gres: ># scontrol create reservation reservation=new nodes=mackinac-[1,2] starttime=now duration=1:00:00 users=chad >Reservation created: new > >$ sbatch --reservation=new --gres=gpfs:ess -A test --nodes=2 --wrap="ls /fs/ess/scratch/PZS0708 2>/dev/null" >sbatch: job_submit.lua called >Submitted batch job 161600 >$ scontrol show job 161600|grep -i tres > TRES=cpu=4,mem=5288M,node=2,billing=333344,gres/gpfs:ess=0 > TresPerNode=gres:gpfs:ess,gres:gpfs:ess:1 > >[2021-12-09T12:00:09.057] debug: lua: action=set_whole_node gres=gres:gpfs:ess return=65534.0 >[2021-12-09T12:00:09.057] lua: slurm_job_submit: action=set_whole_node shared=65534.0 user=chad >[2021-12-09T12:00:09.057] lua: action=set_environments user=chad SLURM_TIME_LIMIT=3600 PBS_WALLTIME=3600 >[2021-12-09T12:00:09.057] lua: action=filesystem_gres return=gres:gpfs:ess,gpfs:ess:1 assign_work_dir=false assign_script=ess fs_gres="{ [1] = gpfs:ess:1,} " user=chad >[2021-12-09T12:00:09.057] lua: slurm_job_submit: action=gres_defaults gres=gres:gres:gpfs:ess,gres:gpfs:ess:1 new_gres=gpfs:ess,gpfs:ess:1 user=chad >[2021-12-09T12:00:09.057] debug: lua: action=is_gpu_job return=false gres=gres:gpfs:ess,gres:gpfs:ess:1 tres_per_job= tres_per_node=gres:gpfs:ess,gres:gpfs:ess:1 tres_per_socket= tres_per_task= What do your slurmctld.log lua lines look like compared to what I got? I think that maybe the TRES for the GRES being non-zero while pending but turning to 0 once running or completed threw me off. I tried to replicate the issue but am not able to. The issue came to us from a customer who was having issues getting 92 node reservation jobs to run but might have been an issue from bug 12954 where one of our staff modified their walltime and I just didn't see that issue as the cause. When I had a job in pending this is what I saw just now during replecation: $ scontrol show job=2006362 | grep TRES TRES=cpu=2,mem=9112M,node=2,billing=2,gres/gpfs:ess=2 Once the job ran I saw this: $ scontrol show job=2006362 | grep TRES TRES=cpu=80,mem=364480M,node=2,billing=80,gres/gpfs:ess=0,gres/gpfs:project=0,gres/gpfs:scratch=0,gres/ime=0,gres/pfsdir=0,gres/pfsdir:ess=0,gres/pfsdir:scratch=0 I will have to see if we can find a time to re-enable the filesystem GRES on our systems and see if we can replicate on larger environment. Side issue: We are going to add a note to the scontrol doc about the count being zero when generic resources are set to no_consume (similar to the one in the sacct doc). Closing for now. Feel free to reopen or submit a new ticket if the comment 0 issue turns out to be a problem later. |
Created attachment 22478 [details] gres.conf We use job_submit.lua to add a GPFS GRES based on various things about a job, like if the filesystem is in the job script or their submit directory is GPFS. It seems when I submit a job that has this condition, the GRES added is not 0, notice how TRES= has gres/gpfs:ess=2. $ sbatch -A PZS0708 --reservation=test --nodes=2 --time=00:05:00 --wrap 'ls -la /fs/ess/scratch/PZS0708 ; scontrol show job=$SLURM_JOB_ID' Submitted batch job 8052278 $ scontrol show job=8052278 JobId=8052278 JobName=wrap UserId=tdockendorf(20821) GroupId=PZS0708(5509) MCS_label=N/A Priority=200010740 Nice=0 Account=pzs0708 QOS=pitzer-all JobState=PENDING Reason=Reservation Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:05:00 TimeMin=N/A SubmitTime=2021-12-01T17:21:18 EligibleTime=Unknown AccrueTime=Unknown StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-12-01T17:21:18 Scheduler=Main Partition=parallel-40core,parallel-48core,condo-olivucci-backfill-parallel,gpubackfill-parallel-40core,condo-osumed-gpu-48core-backfill-parallel,gpubackfill-parallel-48core,condo-datta-backfill-parallel,condo-honscheid-backfill-parallel,gpubackfill-parallel-quad,condo-ccapp-backfill-parallel,condo-osumed-gpu-quad-backfill-parallel,condo-osumed-gpu-40core-backfill-parallel AllocNode:Sid=pitzer-rw01:108063 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=2-2 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=2,mem=9112M,node=2,billing=2,gres/gpfs:ess=2 Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=* MinCPUsNode=1 MinMemoryCPU=4556M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Reservation=test OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/users/sysp/tdockendorf/git/slurm-exporter Comment=stdout=/users/sysp/tdockendorf/git/slurm-exporter/slurm-8052278.out StdErr=/users/sysp/tdockendorf/git/slurm-exporter/slurm-8052278.out StdIn=/dev/null StdOut=/users/sysp/tdockendorf/git/slurm-exporter/slurm-8052278.out Power= TresPerNode=gres:gpfs:ess:1 If I try and replicate that behavior with just sbatch and comment out the Lua code, the TRES line is different: $ sbatch -A PZS0708 --reservation=test --nodes=2 --gres gpfs:ess --time=00:05:00 --wrap 'ls -la /fs/ess/scratch/PZS0708 ; scontrol show job=$SLURM_JOB_ID' Submitted batch job 8052315 $ scontrol show job=8052315 JobId=8052315 JobName=wrap UserId=tdockendorf(20821) GroupId=PZS0708(5509) MCS_label=N/A Priority=100041830 Nice=0 Account=pzs0708 QOS=pitzer-all JobState=COMPLETED Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:02 TimeLimit=00:05:00 TimeMin=N/A SubmitTime=2021-12-01T17:25:27 EligibleTime=2021-12-01T17:25:27 AccrueTime=2021-12-01T17:25:27 StartTime=2021-12-01T17:25:27 EndTime=2021-12-01T17:25:29 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-12-01T17:25:27 Scheduler=Main Partition=parallel-48core AllocNode:Sid=pitzer-rw01:108063 ReqNodeList=(null) ExcNodeList=(null) NodeList=p[0615,0621] BatchHost=p0615 NumNodes=2 NumCPUs=96 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=96,mem=364512M,node=2,billing=96,gres/gpfs:ess=0,gres/gpfs:project=0,gres/gpfs:scratch=0,gres/ime=0,gres/pfsdir=0,gres/pfsdir:ess=0,gres/pfsdir:scratch=0 Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=* MinCPUsNode=1 MinMemoryCPU=3797M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Reservation=test OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/users/sysp/tdockendorf/git/slurm-exporter Comment=stdout=/users/sysp/tdockendorf/git/slurm-exporter/slurm-8052315.out StdErr=/users/sysp/tdockendorf/git/slurm-exporter/slurm-8052315.out StdIn=/dev/null StdOut=/users/sysp/tdockendorf/git/slurm-exporter/slurm-8052315.out Power= TresPerNode=gres:gpfs:ess The correct behavior is happening when the job is submitted with sbatch. Our parallel partitions are forced to be whole node so it makes sense each GRES is there and set to 0, the behavior however through Lua is wrong. I am marking medium impact because we have commercial customers who ran into this issue and it was preventing them from getting their jobs to run.