We're on 19.05.6 still and planning to make the jump to 20.11 in our next maintenance window in April. We had a user report an issue with the --switches parameter though that I've been able to reproduce. It works as expected most of the time but in cases where the user is requesting entire nodes, it seems to just pull nodes from anywhere. Our topology file looks like SwitchName=r1l1 Nodes=r1u0[3-9]n[1-2],r1u1[0-8]n[1-2],r1u2[5-9]n[1-2],r1u3[0-9]n[1-2],r1u40n[1-2] SwitchName=r2l1 Nodes=r2u0[3-9]n[1-2],r2u1[0-8]n[1-2],r2u2[5-9]n[1-2],r2u3[0-8]n[1-2] SwitchName=r3l1 Nodes=r3u0[3-9]n[1-2],r3u1[0-8]n[1-2],r3u2[5-9]n[1-2],r3u3[0-6]n[1-2] SwitchName=r4l1 Nodes=r4u0[3-9]n[1-2],r4u1[0-8]n[1-2],r4u2[5-9]n[1-2],r4u3[0-6]n[1-2] SwitchName=r5l1 Nodes=r5u19n1,r5u24n1,r5u31n1,r5u29n1,r5u27n1,r5u25n1,r5u17n1,r5u15n1,r5u13n1,r5u11n1,r5u09n1 SwitchName=spine1 Switches=r1l1,r2l1,r3l1,r4l1 SwitchName=spine2 Switches=r1l1,r2l1,r3l1,r4l1 SwitchName=spine3 Switches=r1l1,r2l1,r3l1,r4l1 SwitchName=spine4 Switches=r1l1,r2l1,r3l1,r4l1 Using #SBATCH --nodes=6 #SBATCH --ntasks=564 #SBATCH --ntasks-per-node=94 #SBATCH --mem=1GB #SBATCH --time=00:10:00 #SBATCH --job-name=slurm-switch-test #SBATCH --account=hpcteam #SBATCH --qos=user_qos_bjoyce3 #SBATCH --partition=standard #SBATCH --output=slurm-standard-test.out #SBATCH --switches=1 got me SLURM_JOB_NODELIST=r2u03n1,r2u11n2,r2u16n2,r4u03n1,r4u09n1,r4u10n2 but #SBATCH --nodes=6 #SBATCH --ntasks=120 #SBATCH --ntasks-per-node=20 #SBATCH --mem=1GB #SBATCH --time=00:10:00 #SBATCH --job-name=slurm-switch-test #SBATCH --account=hpcteam #SBATCH --qos=user_qos_bjoyce3 #SBATCH --partition=standard #SBATCH --output=slurm-standard-test.out #SBATCH --switches=1 got me something more like what I had expected SLURM_JOB_NODELIST=r1u07n2,r1u16n2,r1u30n2,r1u32n[1-2],r1u37n2
Hi Could you send me slurm.conf? Without log or at least output from "scontrol show job <affected job_id>" I can only speculate. But this can be related to "max_switch_wait" which is by default 300 seconds. man slurm.conf: ... max_switch_wait=# Maximum number of seconds that a job can delay execution waiting for the specified desired switch count. The default value is 300 seconds. ... Dominik
Thanks, Dominik. I'm sure that's it. I was only looking at the man page for sbatch and saw the max switch wait but it seemed to imply that there was no limit if the default was unset. I'll bump that setting up. You can close this out. Thanks, Todd From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Thursday, February 18, 2021 at 3:01 AM To: "Merritt, Todd R - (tmerritt)" <tmerritt@arizona.edu> Subject: [EXT][Bug 10878] issue with --switches not working in some cases External Email Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=10878#c1> on bug 10878<https://bugs.schedmd.com/show_bug.cgi?id=10878> from Dominik Bartkiewicz<mailto:bart@schedmd.com> Hi Could you send me slurm.conf? Without log or at least output from "scontrol show job <affected job_id>" I can only speculate. But this can be related to "max_switch_wait" which is by default 300 seconds. man slurm.conf: ... max_switch_wait=# Maximum number of seconds that a job can delay execution waiting for the specified desired switch count. The default value is 300 seconds. ... Dominik ________________________________ You are receiving this mail because: * You reported the bug.
Hi I am marking the bug as resolved. Feel free to reopen it if you have a follow up question. Dominim
Hi Dominik, I've bumped the limit up to 1 hour and then to 12 hours and that does indeed seem to be what was causing it to drop the switches requirement. It's brought up a new, tangentially related question. These jobs are being submitted with QOS that should preempt jobs in our windfall partition but that does not seem to happen until the switch requirement has been removed root@ericidle:~ # scontrol show job 590692 JobId=590692 JobName=SCCM_Diag UserId=mmazloff(25123) GroupId=staff(340) MCS_label=N/A Priority=24 Nice=0 Account=jrussell QOS=user_qos_jrussell JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=06:24:51 TimeLimit=5-03:00:00 TimeMin=N/A SubmitTime=2021-02-18T13:35:22 EligibleTime=2021-02-18T13:35:22 AccrueTime=2021-02-18T13:35:22 StartTime=2021-02-19T01:35:38 EndTime=2021-02-24T04:35:38 Deadline=N/A PreemptEligibleTime=2021-02-19T01:35:38 PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-02-19T01:35:38 Partition=standard AllocNode:Sid=wentletrap:13272 ReqNodeList=(null) ExcNodeList=(null) NodeList=r1u07n1,r1u12n1,r1u15n2,r1u16n1,r2u09n2,r3u03n1 BatchHost=r1u07n1 NumNodes=6 NumCPUs=576 NumTasks=564 CPUs/Task=N/A ReqB:S:C:T=0:0:*:* TRES=cpu=576,mem=2592G,node=6,billing=576 Socks/Node=* NtasksPerN:B:S:C=94:0:*:* CoreSpec=* MinCPUsNode=94 MinMemoryNode=432G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=/xdisk/jrussell/SOSE/SO6/SO6_DiagBlng/run_so6_puma_wind.sh WorkDir=/xdisk/jrussell/SOSE/SO6/SO6_DiagBlng StdErr=/xdisk/jrussell/SOSE/SO6/SO6_DiagBlng/run_so6_puma_wind.sh.e590692 StdIn=/dev/null StdOut=/xdisk/jrussell/SOSE/SO6/SO6_DiagBlng/run_so6_puma_wind.sh.o590692 Switches=1@12:00:00 Power= user_qos_jrussell was created with sacctmgr -i add qos user_qos_jrussell Priority=5 Preempt=part_qos_windfall Flags=OverPartQOS GrpTRES=cpu=4442,gres/gpu:volta=0 GrpTRESMins=cpu=50457600 GrpJobs=2000 GrpSubmit=2000 Though I can't see where the overpartqos setting is represented in the scontrol output root@ericidle:~ # scontrol show assoc_mgr flags=qos qos=user_qos_jrussell Current Association Manager state QOS Records QOS=user_qos_jrussell(101) UsageRaw=1278054214.000000 GrpJobs=2000(2) GrpJobsAccrue=N(0) GrpSubmitJobs=2000(2) GrpWall=N(172888.48) GrpTRES=cpu=4442(1140),mem=N(5308416),energy=N(0),node=N(12),billing=N(1140),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu:volta=0(0) GrpTRESMins=cpu=50457600(21300903),mem=N(98429041180),energy=N(0),node=N(223610),billing=N(21300903),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu:volta=N(0) GrpTRESRunMins=cpu=N(4522354),mem=N(20887687987),energy=N(0),node=N(47217),billing=N(4522354),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu:volta=N(0) MaxWallPJ= MaxTRESPJ=mem=52428800 MaxTRESPN= MaxTRESMinsPJ= MinPrioThresh= MinTRESPJ= PreemptMode=OFF Priority=5 Account Limits jrussell MaxJobsPA=N(2) MaxJobsAccruePA=N(0) MaxSubmitJobsPA=N(2) MaxTRESPA=cpu=N(1140),mem=N(5308416),energy=N(0),node=N(12),billing=N(1140),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu:volta=N(0) User Limits 25123 MaxJobsPU=N(1) MaxJobsAccruePU=N(0) MaxSubmitJobsPU=N(1) MaxTRESPU=cpu=N(576),mem=N(2654208),energy=N(0),node=N(6),billing=N(576),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu:volta=N(0) 48946 MaxJobsPU=N(1) MaxJobsAccruePU=N(0) MaxSubmitJobsPU=N(1) MaxTRESPU=cpu=N(564),mem=N(2654208),energy=N(0),node=N(6),billing=N(564),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu:volta=N(0) My questions are is this the expected behavior and if so is there a way to get preemption to trigger? I'll attach our slurm config as well. Thanks!
Created attachment 18020 [details] slurm.conf
Hi Could you call your test job again but this time with extra logging active on the controller? eg.: > scontrol setdebug debug3 > scontrol setdebugflags +selecttype to revert the extra logging: > scontrol setdebug info > scontrol setdebugflags -selecttype Dominik
Created attachment 18063 [details] slurmctld.log
Thanks Dominik, I've uploaded a slurmctld log with those settings. Job id 609130 is an example of a high qos job that sits waiting on switches until the timeout.
Hi Unfortunately, until 20.2.6 cons_tres has completely broken "--switches" option. This was solved by: https://github.com/SchedMD/slurm/commit/f2eef3cd6ab Sorry that I didn't catch this before. Dominik
Thanks! We're planning to go to 20.11 shortly so it looks like this will be resolved in that update. You can go ahead and close this out. Thanks, Todd From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Wednesday, February 24, 2021 at 8:49 AM To: "Merritt, Todd R - (tmerritt)" <tmerritt@arizona.edu> Subject: [EXT][Bug 10878] issue with --switches not working in some cases External Email Comment # 10<https://bugs.schedmd.com/show_bug.cgi?id=10878#c10> on bug 10878<https://bugs.schedmd.com/show_bug.cgi?id=10878> from Dominik Bartkiewicz<mailto:bart@schedmd.com> Hi Unfortunately, until 20.2.6 cons_tres has completely broken "--switches" option. This was solved by: https://github.com/SchedMD/slurm/commit/f2eef3cd6ab Sorry that I didn't catch this before. Dominik ________________________________ You are receiving this mail because: * You reported the bug.