Ticket 10878

Summary: issue with --switches not working in some cases
Product: Slurm Reporter: Todd Merritt <tmerritt>
Component: SchedulingAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: chrisreidy
Version: - Unsupported Older Versions   
Hardware: Linux   
OS: Linux   
Site: U of AZ Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf
slurmctld.log

Description Todd Merritt 2021-02-16 13:47:08 MST
We're on 19.05.6 still and planning to make the jump to 20.11 in our next maintenance window in April. We had a user report an issue with the --switches parameter though that I've been able to reproduce. It works as expected most of the time but in cases where the user is requesting entire nodes, it seems to just pull nodes from anywhere.

Our topology file looks like

SwitchName=r1l1 Nodes=r1u0[3-9]n[1-2],r1u1[0-8]n[1-2],r1u2[5-9]n[1-2],r1u3[0-9]n[1-2],r1u40n[1-2]
SwitchName=r2l1 Nodes=r2u0[3-9]n[1-2],r2u1[0-8]n[1-2],r2u2[5-9]n[1-2],r2u3[0-8]n[1-2]
SwitchName=r3l1 Nodes=r3u0[3-9]n[1-2],r3u1[0-8]n[1-2],r3u2[5-9]n[1-2],r3u3[0-6]n[1-2]
SwitchName=r4l1 Nodes=r4u0[3-9]n[1-2],r4u1[0-8]n[1-2],r4u2[5-9]n[1-2],r4u3[0-6]n[1-2]
SwitchName=r5l1 Nodes=r5u19n1,r5u24n1,r5u31n1,r5u29n1,r5u27n1,r5u25n1,r5u17n1,r5u15n1,r5u13n1,r5u11n1,r5u09n1
SwitchName=spine1 Switches=r1l1,r2l1,r3l1,r4l1
SwitchName=spine2 Switches=r1l1,r2l1,r3l1,r4l1
SwitchName=spine3 Switches=r1l1,r2l1,r3l1,r4l1
SwitchName=spine4 Switches=r1l1,r2l1,r3l1,r4l1

Using 
#SBATCH --nodes=6
#SBATCH --ntasks=564
#SBATCH --ntasks-per-node=94
#SBATCH --mem=1GB
#SBATCH --time=00:10:00
#SBATCH --job-name=slurm-switch-test
#SBATCH --account=hpcteam
#SBATCH --qos=user_qos_bjoyce3
#SBATCH --partition=standard
#SBATCH --output=slurm-standard-test.out
#SBATCH --switches=1

got me

SLURM_JOB_NODELIST=r2u03n1,r2u11n2,r2u16n2,r4u03n1,r4u09n1,r4u10n2

but 
#SBATCH --nodes=6
#SBATCH --ntasks=120
#SBATCH --ntasks-per-node=20
#SBATCH --mem=1GB
#SBATCH --time=00:10:00
#SBATCH --job-name=slurm-switch-test
#SBATCH --account=hpcteam
#SBATCH --qos=user_qos_bjoyce3
#SBATCH --partition=standard
#SBATCH --output=slurm-standard-test.out
#SBATCH --switches=1

got me something more like what I had expected

SLURM_JOB_NODELIST=r1u07n2,r1u16n2,r1u30n2,r1u32n[1-2],r1u37n2
Comment 1 Dominik Bartkiewicz 2021-02-18 03:01:14 MST
Hi

Could you send me slurm.conf?
Without log or at least output from "scontrol show job <affected job_id>" I can only speculate.
But this can be related to "max_switch_wait" which is by default 300 seconds.


man slurm.conf:
...
max_switch_wait=#
        Maximum  number  of  seconds  that a job can delay execution waiting for the
        specified desired switch count. The default value is 300 seconds.
...

Dominik
Comment 2 Todd Merritt 2021-02-18 05:37:00 MST
Thanks, Dominik. I'm sure that's it. I was only looking at the man page for sbatch and saw the max switch wait but it seemed to imply that there was no limit if the default was unset. I'll bump that setting up. You can close this out.

Thanks,
Todd

From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Thursday, February 18, 2021 at 3:01 AM
To: "Merritt, Todd R - (tmerritt)" <tmerritt@arizona.edu>
Subject: [EXT][Bug 10878] issue with --switches not working in some cases


External Email
Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=10878#c1> on bug 10878<https://bugs.schedmd.com/show_bug.cgi?id=10878> from Dominik Bartkiewicz<mailto:bart@schedmd.com>

Hi



Could you send me slurm.conf?

Without log or at least output from "scontrol show job <affected job_id>" I can

only speculate.

But this can be related to "max_switch_wait" which is by default 300 seconds.





man slurm.conf:

...

max_switch_wait=#

        Maximum  number  of  seconds  that a job can delay execution waiting

for the

        specified desired switch count. The default value is 300 seconds.

...



Dominik

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 3 Dominik Bartkiewicz 2021-02-18 06:43:33 MST
Hi

I am marking the bug as resolved.
Feel free to reopen it if you have a follow up question.

Dominim
Comment 4 Todd Merritt 2021-02-19 08:13:52 MST
Hi Dominik,

I've bumped the limit up to 1 hour and then to 12 hours and that does indeed seem to be what was causing it to drop the switches requirement. It's brought up a new, tangentially related question. These jobs are being submitted with QOS that should preempt jobs in our windfall partition but that does not seem to happen until the switch requirement has been removed

root@ericidle:~ # scontrol show job 590692
JobId=590692 JobName=SCCM_Diag
   UserId=mmazloff(25123) GroupId=staff(340) MCS_label=N/A
   Priority=24 Nice=0 Account=jrussell QOS=user_qos_jrussell
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=06:24:51 TimeLimit=5-03:00:00 TimeMin=N/A
   SubmitTime=2021-02-18T13:35:22 EligibleTime=2021-02-18T13:35:22
   AccrueTime=2021-02-18T13:35:22
   StartTime=2021-02-19T01:35:38 EndTime=2021-02-24T04:35:38 Deadline=N/A
   PreemptEligibleTime=2021-02-19T01:35:38 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-02-19T01:35:38
   Partition=standard AllocNode:Sid=wentletrap:13272
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=r1u07n1,r1u12n1,r1u15n2,r1u16n1,r2u09n2,r3u03n1
   BatchHost=r1u07n1
   NumNodes=6 NumCPUs=576 NumTasks=564 CPUs/Task=N/A ReqB:S:C:T=0:0:*:*
   TRES=cpu=576,mem=2592G,node=6,billing=576
   Socks/Node=* NtasksPerN:B:S:C=94:0:*:* CoreSpec=*
   MinCPUsNode=94 MinMemoryNode=432G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/xdisk/jrussell/SOSE/SO6/SO6_DiagBlng/run_so6_puma_wind.sh
   WorkDir=/xdisk/jrussell/SOSE/SO6/SO6_DiagBlng
   StdErr=/xdisk/jrussell/SOSE/SO6/SO6_DiagBlng/run_so6_puma_wind.sh.e590692
   StdIn=/dev/null
   StdOut=/xdisk/jrussell/SOSE/SO6/SO6_DiagBlng/run_so6_puma_wind.sh.o590692
   Switches=1@12:00:00
   Power=

user_qos_jrussell was created with

sacctmgr -i add qos user_qos_jrussell Priority=5 Preempt=part_qos_windfall Flags=OverPartQOS GrpTRES=cpu=4442,gres/gpu:volta=0 GrpTRESMins=cpu=50457600 GrpJobs=2000 GrpSubmit=2000

Though I can't see where the overpartqos setting is represented in the scontrol output

root@ericidle:~ # scontrol show assoc_mgr flags=qos qos=user_qos_jrussell
Current Association Manager state

QOS Records

QOS=user_qos_jrussell(101)
    UsageRaw=1278054214.000000
    GrpJobs=2000(2) GrpJobsAccrue=N(0) GrpSubmitJobs=2000(2) GrpWall=N(172888.48)
    GrpTRES=cpu=4442(1140),mem=N(5308416),energy=N(0),node=N(12),billing=N(1140),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu:volta=0(0)
    GrpTRESMins=cpu=50457600(21300903),mem=N(98429041180),energy=N(0),node=N(223610),billing=N(21300903),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu:volta=N(0)
    GrpTRESRunMins=cpu=N(4522354),mem=N(20887687987),energy=N(0),node=N(47217),billing=N(4522354),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu:volta=N(0)
    MaxWallPJ=
    MaxTRESPJ=mem=52428800
    MaxTRESPN=
    MaxTRESMinsPJ=
    MinPrioThresh= 
    MinTRESPJ=
    PreemptMode=OFF
    Priority=5
    Account Limits
      jrussell
        MaxJobsPA=N(2) MaxJobsAccruePA=N(0) MaxSubmitJobsPA=N(2)
        MaxTRESPA=cpu=N(1140),mem=N(5308416),energy=N(0),node=N(12),billing=N(1140),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu:volta=N(0)
    User Limits
      25123
        MaxJobsPU=N(1) MaxJobsAccruePU=N(0) MaxSubmitJobsPU=N(1)
        MaxTRESPU=cpu=N(576),mem=N(2654208),energy=N(0),node=N(6),billing=N(576),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu:volta=N(0)
      48946
        MaxJobsPU=N(1) MaxJobsAccruePU=N(0) MaxSubmitJobsPU=N(1)
        MaxTRESPU=cpu=N(564),mem=N(2654208),energy=N(0),node=N(6),billing=N(564),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu:volta=N(0)

My questions are is this the expected behavior and if so is there a way to get preemption to trigger? I'll attach our slurm config as well. Thanks!
Comment 5 Todd Merritt 2021-02-19 08:14:36 MST
Created attachment 18020 [details]
slurm.conf
Comment 7 Dominik Bartkiewicz 2021-02-22 08:21:41 MST
Hi

Could you call your test job again but this time with extra logging active on the
controller?
eg.:
> scontrol setdebug debug3
> scontrol setdebugflags +selecttype

to revert the extra logging:
> scontrol setdebug info
> scontrol setdebugflags -selecttype

Dominik
Comment 8 Todd Merritt 2021-02-23 09:08:22 MST
Created attachment 18063 [details]
slurmctld.log
Comment 9 Todd Merritt 2021-02-23 09:09:35 MST
Thanks Dominik, I've uploaded a slurmctld log with those settings. Job id 609130 is an example of a high qos job that sits waiting on switches until the timeout.
Comment 10 Dominik Bartkiewicz 2021-02-24 08:49:20 MST
Hi

Unfortunately, until 20.2.6 cons_tres has completely broken "--switches" option.
This was solved by:
https://github.com/SchedMD/slurm/commit/f2eef3cd6ab
Sorry that I didn't catch this before.

Dominik
Comment 11 Todd Merritt 2021-02-24 09:01:15 MST
Thanks! We're planning to go to 20.11 shortly so it looks like this will be resolved in that update. You can go ahead and close this out.

Thanks,
Todd



From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Wednesday, February 24, 2021 at 8:49 AM
To: "Merritt, Todd R - (tmerritt)" <tmerritt@arizona.edu>
Subject: [EXT][Bug 10878] issue with --switches not working in some cases


External Email
Comment # 10<https://bugs.schedmd.com/show_bug.cgi?id=10878#c10> on bug 10878<https://bugs.schedmd.com/show_bug.cgi?id=10878> from Dominik Bartkiewicz<mailto:bart@schedmd.com>

Hi



Unfortunately, until 20.2.6 cons_tres has completely broken "--switches"

option.

This was solved by:

https://github.com/SchedMD/slurm/commit/f2eef3cd6ab

Sorry that I didn't catch this before.



Dominik

________________________________
You are receiving this mail because:

  *   You reported the bug.