Ticket 14557

Summary: Nodes idle, jobs pending
Product: Slurm Reporter: Jurij Pečar <jurij.pecar>
Component: SchedulingAssignee: Marcin Stolarek <cinek>
Status: RESOLVED FIXED QA Contact: Brian Christiansen <brian>
Severity: 4 - Minor Issue    
Priority: ---    
Version: 21.08.8   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=15720
https://bugs.schedmd.com/show_bug.cgi?id=18270
https://bugs.schedmd.com/show_bug.cgi?id=19146
Site: EMBL Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 23.02pre1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: sdiag output
current slurm.conf
slurmctld.log and scheduler.log
all jobs
debug logs (over 5GB)
debug logs with Power
current node & jobs state
debugging patch
slurmctld log with debug patch
10 minutes around job 54265158

Description Jurij Pečar 2022-07-18 02:03:38 MDT
We have this kind of nodes:

NodeName=sb03-[05-20] CPUs=64 Boards=1 SocketsPerBoard=1 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=515590 MemSpecLimit=1024 TmpDisk=377000 Feature=HT,cpu2.5GHz,avx2,net25G,gpu=A100,rome,znver2,ssd Gres=gpu:A100:1,tmp:377G

That are part of this partition:

PartitionName=gpu-el8 Nodes=sb03-[05-20],gpu[25-28] Default=NO MaxTime=7-0 State=Up

Sinfo -Nl shows them as
sb03-05         1   gpu-el8       mixed 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-06         1   gpu-el8       idle~ 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-07         1   gpu-el8       idle~ 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-08         1   gpu-el8       mixed 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-09         1   gpu-el8       mixed 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-10         1   gpu-el8       idle~ 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-11         1   gpu-el8       idle~ 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-12         1   gpu-el8       idle~ 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-13         1   gpu-el8       idle~ 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-14         1   gpu-el8       idle~ 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-15         1   gpu-el8       idle~ 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-16         1   gpu-el8       mixed 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-17         1   gpu-el8       idle~ 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-18         1   gpu-el8       idle~ 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-19         1   gpu-el8       idle~ 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-20         1   gpu-el8       idle~ 64     1:32:2 515590   377000      1 HT,cpu2. none

You see that only 4 are up and running jobs, others are powered off.

However we have a job array with this kind of jobs sitting in the queue, only getting scheduled to the active 4 nodes and not bringing other up:

JobId=44704138 ArrayJobId=44690533 ArrayTaskId=4020 JobName=array
   UserId=kosinski(21549) GroupId=kosinski(721) MCS_label=N/A
   Priority=3613 Nice=0 Account=kosinski QOS=low
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=06:00:00 TimeMin=N/A
   SubmitTime=2022-07-17T00:06:30 EligibleTime=2022-07-17T00:08:31
   AccrueTime=2022-07-17T00:08:31
   StartTime=2022-07-24T18:48:00 EndTime=2022-07-25T00:48:00 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-07-18T09:44:33 Scheduler=Main
   Partition=gpu-el8 AllocNode:Sid=login01:2728919
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=sb03-08
   BatchHost=sb03-08
   NumNodes=1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=16,mem=125G,node=1,billing=16,gres/gpu=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=125G MinTmpDiskNode=0
   Features=gpu=A100 DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/g/kosinski/kosinski/Giardia/AlphaFold/run_AF_monomer.sh
   WorkDir=/g/kosinski/kosinski/Giardia/AlphaFold
   StdErr=/g/kosinski/kosinski/Giardia/AlphaFold/logs/run_monomer_44690533_4020_err.txt
   StdIn=/dev/null
   StdOut=/g/kosinski/kosinski/Giardia/AlphaFold/logs/run_monomer_44690533_4020_out.txt
   Power=
   TresPerNode=gres:gpu:1

(Other higher priority job pending with Resources at the time of this writing has Features=gpu=A40 that match other gpu nodes in this partition so it should not affect the above job array.)

If I check one of the active nodes, I see:
# scontrol show node sb03-05
NodeName=sb03-05 Arch=x86_64 CoresPerSocket=32 
   CPUAlloc=32 CPUTot=64 CPULoad=0.80
   AvailableFeatures=HT,cpu2.5GHz,avx2,net25G,gpu=A100,rome,znver2,ssd
   ActiveFeatures=HT,cpu2.5GHz,avx2,net25G,gpu=A100,rome,znver2,ssd
   Gres=gpu:A100:1,tmp:377G
   NodeAddr=sb03-05 NodeHostName=sb03-05 Version=21.08.8
   OS=Linux 4.18.0-305.25.1.el8_4.x86_64 #1 SMP Tue Nov 2 10:32:37 UTC 2021 
   RealMemory=515590 AllocMem=512000 FreeMem=482181 Sockets=1 Boards=1
   MemSpecLimit=1024
   State=MIXED ThreadsPerCore=2 TmpDisk=377000 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=gpu-el8 
   BootTime=2022-07-16T21:19:00 SlurmdStartTime=2022-07-16T21:19:20
   LastBusyTime=2022-07-16T21:24:24
   CfgTRES=cpu=64,mem=515590M,billing=64,gres/gpu=1
   AllocTRES=cpu=32,mem=500G,gres/gpu=1
   CapWatts=n/a
   CurrentWatts=n/s AveWatts=n/s
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

And one inactive node shows:
# scontrol show node sb03-06
NodeName=sb03-06 Arch=x86_64 CoresPerSocket=32 
   CPUAlloc=0 CPUTot=64 CPULoad=0.00
   AvailableFeatures=HT,cpu2.5GHz,avx2,net25G,gpu=A100,rome,znver2,ssd
   ActiveFeatures=HT,cpu2.5GHz,avx2,net25G,gpu=A100,rome,znver2,ssd
   Gres=gpu:A100:1,tmp:377G
   NodeAddr=sb03-06 NodeHostName=sb03-06 Version=21.08.8
   OS=Linux 4.18.0-305.25.1.el8_4.x86_64 #1 SMP Tue Nov 2 10:32:37 UTC 2021 
   RealMemory=515590 AllocMem=0 FreeMem=511024 Sockets=1 Boards=1
   MemSpecLimit=1024
   State=IDLE+POWERED_DOWN ThreadsPerCore=2 TmpDisk=377000 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=gpu-el8 
   BootTime=2022-07-17T01:02:45 SlurmdStartTime=2022-07-17T01:03:05
   LastBusyTime=Unknown
   CfgTRES=cpu=64,mem=515590M,billing=64,gres/gpu=1
   AllocTRES=gres/gpu=1
   CapWatts=n/a
   CurrentWatts=n/s AveWatts=n/s
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

I would expect pending jobs that match these specific A100 equipped nodes to cause slurm to boot them up and schedule jobs on them. However they stay idle even if I manually bring them up with "scontrol update nodename=sb03-06 state=power_up". It feels like slurm is not finding these nodes to match pending jobs.

Only line that sticks out to me as wrong in the "scontrol show node" output above is the line AllocTRES=gres/gpu=1 on a powered down node. Does this mean that slurm thinks GPU is occupied even when the node is powered down? Could this be a reason why it doesn't see these nodes as fit to run these jobs? How can I clear this?

It is always the same four nodes of this kind that get jobs scheduled so that leads me to think that it's something on the node level on the other 12 nodes that makes slurm see them as unfit for these jobs.

I'm seeing this kind of behavior after we upgraded to 21.08.8 (from 20.11.7). My initial impression was that restart of slurmctld got things back to expected state (nodes got powered up, pending jobs got scheduled on them) but lately this is not sufficient anymore. 

Let me know what further info you need to help me figure this out and how I can get this info.

Thanks for help.
Comment 1 Marcin Stolarek 2022-07-19 01:10:59 MDT
Jurij,

At a glance the job is pending because of "Priority", which strictly means that the there are higher priority job waiting for resources. A simple experiment to check that would be to update the job priority to the very high value, like:

>scontrol update job=44704138 priority=99999999

It may be that there is a configuration tuning required to make backfill scheduler reach that job.

If you can check the command above and let us know how that work. Attach slurm.conf and sdiag command results and we'll continue the case debugging.

cheers,
Marcin
Comment 2 Jurij Pečar 2022-07-19 02:08:39 MDT
Created attachment 25908 [details]
sdiag output
Comment 3 Jurij Pečar 2022-07-19 02:09:42 MDT
Created attachment 25909 [details]
current slurm.conf
Comment 4 Jurij Pečar 2022-07-19 02:11:56 MDT
About 30min ago I bumped priority on another array job of this kind to bring it up to the front of the queue:

# sq -t pd -p gpu-el8
JOBID                     NAME               ST  QOS    PARTITION PRIORITY   NODES TIME_LIMIT  NODELIST(REASON) 
44951894_[3066-3499]      array              PD  normal gpu-el8    4100       1     6:00:00    (Resources) 
44690533_[4020-4021,4024- array              PD  low    gpu-el8    4058       1     6:00:00    (Priority) 
44843535                  aa_SNX33_HUMAN.job PD  normal gpu-el8    3659       1     7-00:00:00 (Priority) 
44963488                  220625.job         PD  normal gpu-el8    3399       1     7-00:00:00 (Priority) 
45016847                  array              PD  normal gpu-el8    3362       1     2-00:00:00 (Priority)

And jobs in this array look like this:

# scontrol show job 44951894_3066
JobId=44951894 ArrayJobId=44951894 ArrayTaskId=3066 JobName=array
   UserId=kosinski(21549) GroupId=kosinski(721) MCS_label=N/A
   Priority=4100 Nice=0 Account=kosinski QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=06:00:00 TimeMin=N/A
   SubmitTime=2022-07-18T17:03:02 EligibleTime=2022-07-18T17:03:05
   AccrueTime=2022-07-18T17:03:05
   StartTime=2022-07-21T09:20:00 EndTime=2022-07-21T15:20:00 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-07-19T09:59:48 Scheduler=Main
   Partition=gpu-el8 AllocNode:Sid=login01:2082155
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=sb03-16
   NumNodes=1-1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=16,mem=125G,node=1,billing=16,gres/gpu=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=125G MinTmpDiskNode=0
   Features=gpu=A100 DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/g/kosinski/kosinski/Giardia/AlphaFold/run_AF_monomer.sh
   WorkDir=/g/kosinski/kosinski/Giardia/AlphaFold
   StdErr=/g/kosinski/kosinski/Giardia/AlphaFold/logs/run_monomer_44951894_3066_err.txt
   StdIn=/dev/null
   StdOut=/g/kosinski/kosinski/Giardia/AlphaFold/logs/run_monomer_44951894_3066_out.txt
   Power=
   TresPerNode=gres:gpu:1

However our A100 nodes remain powered off:

# sinfo -Nlp gpu-el8
Tue Jul 19 10:04:30 2022
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
gpu25          1   gpu-el8   allocated 224    2:56:2 103163   126185      1 HT,cpu2. none                
gpu26          1   gpu-el8   draining@ 224    2:56:2 103163   126185      1 HT,cpu2. Reboot ASAP         
gpu27          1   gpu-el8   draining@ 224    2:56:2 103163   126185      1 HT,cpu2. Reboot ASAP         
gpu28          1   gpu-el8   draining@ 224    2:56:2 103163   126185      1 HT,cpu2. Reboot ASAP         
sb03-05        1   gpu-el8       mixed 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-06        1   gpu-el8       idle~ 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-07        1   gpu-el8       idle~ 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-08        1   gpu-el8       mixed 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-09        1   gpu-el8       mixed 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-10        1   gpu-el8       idle~ 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-11        1   gpu-el8       idle~ 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-12        1   gpu-el8       idle~ 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-13        1   gpu-el8       idle~ 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-14        1   gpu-el8       idle~ 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-15        1   gpu-el8       idle~ 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-16        1   gpu-el8       mixed 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-17        1   gpu-el8       idle~ 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-18        1   gpu-el8       idle~ 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-19        1   gpu-el8       idle~ 64     1:32:2 515590   377000      1 HT,cpu2. none                
sb03-20        1   gpu-el8       idle~ 64     1:32:2 515590   377000      1 HT,cpu2. none

Sdiag output and slurm.conf attached.
Comment 5 Marcin Stolarek 2022-07-19 02:39:34 MDT
Thanks for the scontrol show job output, it shows that the job was recently checked by main scheduler (so definitively not a priority/backfill issue).

Could you please share your slurmctld.log and scheduler.log?

cheers,
Marcin
Comment 6 Jurij Pečar 2022-07-19 02:43:02 MDT
Created attachment 25910 [details]
slurmctld.log and scheduler.log
Comment 7 Jurij Pečar 2022-07-19 02:43:41 MDT
Attached. I assume you'll want to see increased verbosity?
Comment 9 Marcin Stolarek 2022-07-19 03:54:46 MDT
Could you please share:
scontrol show job -d --oneline ?

cheers,
Marcin
Comment 10 Jurij Pečar 2022-07-19 03:58:09 MDT
# scontrol show job -d --oneline 44951894_3078
JobId=44951894 ArrayJobId=44951894 ArrayTaskId=3078 JobName=array UserId=kosinski(21549) GroupId=kosinski(721) MCS_label=N/A Priority=4100 Nice=0 Account=kosinski QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:00:00 TimeLimit=06:00:00 TimeMin=N/A SubmitTime=2022-07-18T17:03:02 EligibleTime=2022-07-18T17:03:05 AccrueTime=2022-07-18T17:03:05 StartTime=2022-07-21T11:28:00 EndTime=2022-07-21T17:28:00 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-07-19T11:57:06 Scheduler=Main Partition=gpu-el8 AllocNode:Sid=login01:2082155 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) SchedNodeList=sb03-05 NumNodes=1-1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=16,mem=125G,node=1,billing=16,gres/gpu=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=125G MinTmpDiskNode=0 Features=gpu=A100 DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/g/kosinski/kosinski/Giardia/AlphaFold/run_AF_monomer.sh WorkDir=/g/kosinski/kosinski/Giardia/AlphaFold StdErr=/g/kosinski/kosinski/Giardia/AlphaFold/logs/run_monomer_44951894_3078_err.txt StdIn=/dev/null StdOut=/g/kosinski/kosinski/Giardia/AlphaFold/logs/run_monomer_44951894_3078_out.txt Power= TresPerNode=gres:gpu:1
Comment 11 Jurij Pečar 2022-07-19 04:00:13 MDT
Created attachment 25911 [details]
all jobs
Comment 12 Marcin Stolarek 2022-07-19 04:05:50 MDT
Can you share all jobs info - without a JOBID as an option to previous scontrol? I'd like to check if those nodes aren't reserved by backfill for higher priority job in htc-el8 partition. Am I correct that those nodes are shared between those parts? At the same time gpu-el8 partition doesn't have a PriorityJobFactor=3000, which makes it 1(default) so the jobs are very likely lower when compared to that part. 

>I assume you'll want to see increased verbosity?

If you can handle higher verbosity for some time (based on the logs ~30 minutes should be enough) please switch to:
SlurmctldDebug=debug2
DebugFlags=NO_CONF_HASH,SelectType,TraceJobs

you can enable/disable both without slurmctld restart using scontrol.

cheers,
Marcin
Comment 13 Jurij Pečar 2022-07-19 05:27:19 MDT
Created attachment 25912 [details]
debug logs (over 5GB)
Comment 14 Jurij Pečar 2022-07-19 05:32:03 MDT
See "all jobs" attachment above. 
No, these nodes are dedicated to gpu-el8 partition and are not shared anywhere. Am I then correct in thinking that PriorityJobFactor on this partition is not needed?

I've now gathered ~30min of debug logs and uploaded them. Let me know if you need anything else to figure this out.
Comment 15 Marcin Stolarek 2022-07-19 05:46:24 MDT
>See "all jobs" attachment above. 

Thanks for that.

>No, these nodes are dedicated to gpu-el8 partition and are not shared anywhere. Am I then correct in thinking that PriorityJobFactor on this partition is not needed?
Yes- you are correct. I've messed the two ranges in sb nodes names. The PriorityJobFactor may still have some impact, but it shouldn't be major - for instance lower priority job may not be checked by backfill when a "global" limit like bf_max_job_test is reached.

>Let me know if you need anything else to figure this out.
Could you please share topology.conf and `scontrol show res`? I remember that you've noticed that as an upgrade result, but still want to check basic things just to be sure.

cheers,
Marcin
Comment 16 Jurij Pečar 2022-07-19 05:53:52 MDT
I have two reservations in place for August and September to finish redeploying nodes over to el8. However these nodes in question (sb03-[05-20]) are already on el8.

# scontrol show res
ReservationName=el8move_step2 StartTime=2022-08-08T12:00:00 EndTime=2022-08-12T18:00:00 Duration=4-06:00:00
   Nodes=gpu[8-14,21-22],sm-epyc-[01-03],smer01-[1-4],smer02-[1-4],smer03-[1-4],smer04-[1-4],smer05-[1-4],smer06-[1-4],smer07-[1-4],smer08-[1-4],smer09-[1-4],smer10-[1-4],smer11-[1-4],smer12-[1-4],smer13-[1-4],smer14-[1-4],smer15-[1-4] NodeCnt=72 CoreCnt=4400 Features=(null) PartitionName=(null) Flags=MAINT,IGNORE_JOBS,SPEC_NODES
   TRES=cpu=8800
   Users=pecar Groups=(null) Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

ReservationName=el8move_step3 StartTime=2022-09-05T12:00:00 EndTime=2022-09-09T18:00:00 Duration=4-06:00:00
   Nodes=gpu[15-20,23-24],sm-epyc-[04-05],smer16-[1-4],smer17-[1-4],smer18-[1-4],smer19-[1-4],smer20-[1-4],smer21-[1-4],smer22-[1-4],smer23-[1-4],smer24-[1-4],smer25-[1-4],smer26-[1-4],smer27-[1-4],smer28-[1-4],smer29-[1-4],smer30-[1-4] NodeCnt=70 CoreCnt=4288 Features=(null) PartitionName=(null) Flags=MAINT,IGNORE_JOBS,SPEC_NODES
   TRES=cpu=8576
   Users=pecar Groups=(null) Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

And our topology.conf is reasonably simple:
SwitchName=spine Switches=leaf[1-5],sb[03-05],other
SwitchName=leaf1 Nodes=smer[01-20]-[1-4],login01
SwitchName=leaf2 Nodes=smer[21-30]-[1-4],login02
SwitchName=leaf3 Nodes=sm-epyc-0[1-5],gpu[27-28]
SwitchName=leaf4 Nodes=gpu[21-24],smem[01-10]-[1-4]
SwitchName=leaf5 Nodes=gpu[8-20,25-26]
SwitchName=sb03 Nodes=sb03-[01-20]
SwitchName=sb04 Nodes=sb04-[01-20]
SwitchName=sb05 Nodes=sb05-[01-20]
SwitchName=other Nodes=bn[01-04]
Comment 18 Marcin Stolarek 2022-07-19 06:31:03 MDT
Are you running a federated setup?

cheers,
Marcin
Comment 19 Jurij Pečar 2022-07-19 06:48:34 MDT
No, not that I'm aware of.
Comment 20 Jurij Pečar 2022-07-19 06:53:31 MDT
News: I see that all our A100 nodes are now up and chewing on jobs.

Checking on their uptime it looks like this happened when I switched debugging level and debugflags which corresponds to my initial observation that slurmctld restart fixed things in the past.

Hope that debug logs capture why jobs were being held back initially ...
Comment 21 Marcin Stolarek 2022-07-20 03:47:04 MDT
Jurij,

Can you try to reproduce the issue/collect logs with "Power" DebugFlags (and at least verbose log level) enabled?

I'm still trying to reproduce that looking at the logs and code.

cheers,
Marcin
Comment 22 Jurij Pečar 2022-07-20 04:00:31 MDT
Ok, I had a gut feeling that this is somehow related to powering idle nodes off but didn't manage to test this hypothesis.
As you see I'll need to wait for the current set of jobs to finish and then wait for issue to reoccur. Could be days or weeks but I'll definitely come back to you with additional debug logs.
Comment 23 Jurij Pečar 2022-07-21 10:46:50 MDT
Created attachment 25965 [details]
debug logs with Power
Comment 24 Jurij Pečar 2022-07-21 10:56:03 MDT
Here's the next set of logs (and a list of jobs) hopefully catching the problem (over 11GB) at debug level.

I had three attempts at recreating it. I set the partition down and up to allow power saving to shut down nodes inbetween.
First attempt was around noon where I think I got the debugflags command wrong but nodes came up immediately. 

Second one was around 14:20 where nodes also came up as expected. Here I set partition state to up and then set debugflags.

Third one began at about 16:53 when I set debugflags first and only then (about 16:55) set the partition state to up. Nothing much happened on the powered down nodes for the next 15 minutes. There was one other job pending with Resources (45338245) and waiting for A40 node so I added Backfill and later BackfillMap to debugflags as well. Then shortly before 17:30 that job got scheduled and then A100 nodes were powered up and started chewing on jobs (45220650).

Hope that shines more light into why this is happening.
Comment 25 Marcin Stolarek 2022-07-25 06:03:09 MDT
Jurij,

I'm still looking at the logs and code to fully understand the flow, however, two things wanted to point out are:

1) Around 14:25 we have multiple: 'POWER: resume rate reached', which means that automatic power on of nodes will be delayed because of the configured ResumeRate. You have it set to only 20, which is quite low. I think you may want to recheck the reason for that.

2) There are quite long time spans when you have a lot of 'Note large processing time...' messages, which is a symptom of overload. This may be a reason for jobs not being evaluated on time by scheduler or other periodic activities (like nodes power up) being delayed.

Those are most often:
>7214 _slurm_rpc_dump_job_user:
>5922 _slurm_rpc_dump_job_single:
>4912 _slurm_rpc_complete_batch_script:
>3949 _slurm_rpc_dump_jobs:
>3872 _slurm_rpc_dump_nodes:
>3342 _slurm_rpc_complete_prolog:

dump job/node RPCs are a result of scontrol show/squeue queries. In this case I'd recommend checking the users (in sdiag) being the top query makers and educating them that they should avoid execution of commands like squeue in a loop, since it may result in slurmctld overload.

cheers,
Marcin
Comment 26 Jurij Pečar 2022-07-26 01:21:31 MDT
1) I've set this to a ballpark figure of 10% of cluster nodes with the idea of preventing huge spikes on electrical side but rather ensure a more staggered load increase. Can increase for test.

2) We have a bunch of scripts both generating a login greeting with a fresh overview of cluster state as well as feeding ganglia with per group (=Account) cpu,memory and gpu allocations. These both run every minute. So far I didn't see any negative consequences from them. Do you think what we're seeing here could be a consequence of these?

In this case, what could be done to increase slurmctld performance? 

I can try and run these scripts say every other minute and see if this changes anything.
Comment 27 Marcin Stolarek 2022-07-26 03:44:09 MDT
I'm still not 100% sure about the behavior you described in initial comments, but definitively slurmctld may be underperforming from time to time due to the overload.

For instance, the logs like (305 occurrences in the shared log):
>[2022-07-21T14:19:04.772] sched/backfill: _yield_locks: continuing to yield locks, 40 RPCs pending
means that backfill doesn't resume job scheduling after yielding locks since we have high number of waiting RPCs.

There are also time spans like [2022-07-21T16:53:55.062] to [2022-07-21T17:22:07.055] when the "[...]very large processing time[...]" happens really often which is very likely result in poor performance.

Based on the sdiag result you shared before, there are users making quite many RPCs, are thoes related with your monitoring:
>Remote Procedure Call statistics by user
>	root            (       0) count:7580184 ave_time:84217  total_time:638387759486
>	marotta         (   24809) count:393859 ave_time:33055   total_time:13019222339
>	pekel           (   25255) count:95579  ave_time:311218  total_time:29745976681
>       tweber          (   25221) count:92908  ave_time:6326    total_time:587771022
>	galaxy          (   20239) count:52673  ave_time:8432    total_time:444161784
>	zzhao           (   24718) count:52372  ave_time:9596    total_time:502613792
>	jendrusc        (   24710) count:13712  ave_time:14607   total_time:200295040
>	fabreges        (   22758) count:12835  ave_time:2485    total_time:31900771

looking at the total_time for the top users, and having in mind that they are very likely generating this load in short time I'd say that you should check theirs activity. You can use `sdiag -r` to reset the counters and then sdiag to have just recent info.

cheers,
Marcin
Comment 28 Jurij Pečar 2022-07-26 05:37:03 MDT
Majority of those RPCs coming from users are generated by snakemakes, nextflows and other tools of that sort. So I'm not sure how much can I decrease those numbers as they're required for their operation to some degree.
However I can spend some time looking at our monitoring scripts and figure out if there's something I can do there to polish things up. That "scontrol show jobs -d --oneline" from above will for sure come handy ...
Comment 29 Jurij Pečar 2022-07-26 11:28:45 MDT
Ok, I've managed to decrease number of RPCs from root by about 60%. 

At which logging level do these "large processing time" log messages appear? I'll keep an eye on them ...
Comment 30 Marcin Stolarek 2022-07-27 01:58:49 MDT
I continue looking into the logs with the focus on the orginal issue, but however those tools are outside of SchedMD expertise, my professional experience is snakemakes - doesn't encourage users to use scontrol they prefer sacct if possible, the main repo even suggests not to configure monitoring script if the jobs don't fail often. Using sacct is probably still better idea since it doesn't query slurmctld. It should be even quite easy to modify the scripts responsible for that to cache their results for specified number of seconds (for instance in a temp file) if it happens to be a major issue.


>At which logging level do these "large processing time" log messages appear? I'll keep an eye on them ...
Those are displayed at debug level. I don't think you need to focus on them now. We also have a higher threshold with the message 'very large processing time' displayed at verbose level.

In general my advice would be to look at sdiag, or even have a differential plot in a monitoring system of sdiag results per user/rpc type to be able to match spikes there with other issues. Just the fact that specific RPC takes long time doesn't mean that it's an cause of an issue. It may be that it is delayed by other queries. For instance thousands of job complete RPC comming at a time are likely to response to `squeue`.

>Ok, I've managed to decrease number of RPCs from root by about 60%. 
Sounds good.

cheers,
Marcin
Comment 31 Marcin Stolarek 2022-07-28 05:48:19 MDT
 Jurij, 

Another thing you may want to validate in your slurm.conf is:
>CompleteWait=300 

Per man slurm.conf:
>CompleteWait
>The time to wait, in seconds, when any job is in the COMPLETING state before any additional jobs are scheduled. This is to
>attempt to keep jobs on nodes that were recently in use, with the goal of preventing fragmentation.  If set to zero, pend‐
>ing jobs will be started as soon as possible. Since a COMPLETING job's resources are released for use by other jobs as
>soon as the Epilog completes on each individual node, this can result in very fragmented resource allocations.  To provide
>jobs with the minimum response time, a value of zero is recommended (no waiting).  To minimize fragmentation of resources,
>a value equal to KillWait plus two is recommended.  In that case, setting KillWait to a small value may be beneficial.
>The default value of CompleteWait is zero seconds.  The value may not exceed 65533.

so if you have any job in CG state Slurm waits for 300s before scheduling a job.

cheers,
Marcin
Comment 32 Jurij Pečar 2022-07-28 08:30:19 MDT
That's intentional. 
Ideally I would prefer that setting to be per-node as we have fat nodes with 2T of memory and 9.4T of swap and sometimes it takes minutes for slurm to clean up after big jobs. Still with this set to 5min we get "Kill task failed" often enough that I've set up a hourly cron job to resume nodes in this state.

Our regular nodes on el7 had TMPDIR pointing to node-local /tmp implemented as tmpfs backed by ssd. Now on el8 I'm just pointing TMPDIR to per-job folder on scratch so I'll revisit this setting once we're done moving to el8 completely.
Comment 33 Marcin Stolarek 2022-07-29 07:18:48 MDT
>Ideally I would prefer that setting to be per-node[..]
Could you please elaborate on that a little bit? As long as the node is in completing state (which means that at least one job on the node is in completing state) no new jobs are scheduled to the node. Setting CompleteWait means that no jobs are scheduled on any node for up to CompleteWait seconds if at least one job went and stays in CG.

cheers,
Marcin
Comment 34 Jurij Pečar 2022-08-08 00:37:56 MDT
We again hit this situation over the weekend. 

Again there was a higher priority job submitted to gpu,gpu-el8 asking for gres of any two gpus (our A100 nodes have only one) and an array of jobs each asking for one A100 gpu. I've modified the higher priority job to only sit in partition gpu to not interfere with job array destined for A100s and waited ... nothing happened for about 1h. Then I restarted slurmctld and within minutes, all A100 nodes were up and chewing on jobs.
That happened in the evening. Funny enough, in the morning, array was not yet finished but I found only two (of 16) A100 nodes up and busy while others were idle~. Nothing much changed in the queue, this array job remained at the top through the night. Would be interesting to understand why slurmctl decided to not schedule pending jobs to idle nodes for such a long time that they went into power down state... 
I've also checked for jobs in CG state and haven't seen any.
Comment 35 Marcin Stolarek 2022-08-08 01:44:49 MDT
Jurij,

Am I correct that the waiting job you've noticed are always job arrays? I didn't focus on that before, but maybe it's the key to reproduce/understand the issue.

cheers,
Marcin
Comment 36 Jurij Pečar 2022-08-08 03:25:21 MDT
Good observation. 

Since we still have a couple of jobs from that array pending and just two A100 nodes up to chew on them, I quickly submitted one interactive job asking for A100, got resources allocated within seconds and got a shell once a node came up.

Now I'm looking at it sitting idle, slurm is not giving it jobs from that array. So it might very well be related to array handling.
Comment 37 Marcin Stolarek 2022-08-08 05:50:34 MDT
>Now I'm looking at it sitting idle, slurm is not giving it jobs from that array. So it might very well be related to array handling.

Could you please share `scontrol show node` and `scontrol show job` for the current state?

I can't find any evidence of things working incorrectly in the debug log. It looks like jobs are checked by backfill scheduler and its start time is estimated, however, it's not in tact with what you see since - no other higher priority jobs waiting for those nodes and they just stay idle.

Did you notice those jobs being in CF/configuring state?
Comment 38 Jurij Pečar 2022-08-08 05:57:03 MDT
Created attachment 26198 [details]
current node & jobs state
Comment 39 Jurij Pečar 2022-08-08 06:00:01 MDT
Attached. Array 46665271_[86-90] is the one in question.

No, these array jobs stay pending, they don't enter configuring state.

Hm, at least not that I'm aware of, I have "-t pd" almost hardwired in my muscle memory to squeue. Will pay attention to that.
Comment 40 Marcin Stolarek 2022-08-08 06:11:22 MDT
I see on every interesting thing there, for instance sb03-06:

>NodeName=sb03-06 Arch=x86_64 CoresPerSocket=32  [...] State=IDLE+POWERED_DOWN [...]CfgTRES=cpu=64,mem=515590M,billing=64,gres/gpu=1 AllocTRES=gres/gpu=1 [..]
>                                                                                                                                    ^^^^^^^^^^^^^^^^^^^^^

which is obviously wrong - powered down shouldn't have resources in allocated state. Could you please check if calling scontrol reconfigure (or restarting ctld) clears that counter?

cheers,
Marcin
Comment 41 Jurij Pečar 2022-08-08 06:16:35 MDT
That's one thing I noticed in my initial post. Any idea how this can happen?

And yes, scontrol reconfigure clears it.
Comment 42 Marcin Stolarek 2022-08-08 06:29:10 MDT
>That's one thing I noticed in my initial post.
My bad, I've lost that peace looking at the code and logs. Could you please share your gres.conf?
Comment 43 Jurij Pečar 2022-08-08 06:33:37 MDT
NodeName=gpu8 Name=gpu Type=2080Ti File=/dev/nvidia[0-3] COREs=[0-15]
NodeName=gpu8 Name=gpu Type=2080Ti File=/dev/nvidia[4-7] COREs=[16-31]
NodeName=gpu9 Name=gpu Type=2080Ti File=/dev/nvidia[0-3] COREs=[0-15]
NodeName=gpu9 Name=gpu Type=2080Ti File=/dev/nvidia[4-7] COREs=[16-31]
NodeName=gpu10 Name=gpu Type=V100 File=/dev/nvidia[0-3]
NodeName=gpu[11-20] Name=gpu Type=2080Ti File=/dev/nvidia[0-3]
NodeName=sb03-[05-20] Name=gpu Type=A100 File=/dev/nvidia0 
NodeName=gpu[21-24] Name=gpu Type=3090 File=/dev/nvidia[0-3] COREs=[0-31]
NodeName=gpu[21-24] Name=gpu Type=3090 File=/dev/nvidia[4-7] COREs=[32-63]
NodeName=gpu[25-28] Name=gpu Type=A40 File=/dev/nvidia[0-3] COREs=[0-55,112-167]
NodeName=gpu[25-28] Name=gpu Type=A40 File=/dev/nvidia[4-7] COREs=[56-111,168-223]

#TMPDIR as gres
NodeName=gpu[8-9] Name=tmp Count=3100G
NodeName=gpu10 Name=tmp Count=456G
NodeName=gpu[11-20] Name=tmp Count=467G
NodeName=gpu[21-24] Name=tmp Count=435G
NodeName=gpu[25-28] Name=tmp Count=124G
NodeName=sb03-[01-20] Name=tmp Count=467G
NodeName=sb[01-02]-[01-20] Name=tmp Count=844G
NodeName=sm-epyc-01 Name=tmp Count=7571G
NodeName=sm-epyc-0[2-5] Name=tmp Count=9400282M
NodeName=smer[01-20]-[1-4] Name=tmp Count=203G
Comment 44 Marcin Stolarek 2022-08-08 07:09:59 MDT
Do you update nodes in your Resume/Suspend programs?

cheers,
Marcin
Comment 45 Jurij Pečar 2022-08-08 07:17:51 MDT
No, those are just wrappers for ipmitool.

Closest to what you're asking that we do is setting state=resume in rc.local on el7 nodes so that when they come up for whatever reason, they're presented as ready. I remember we had some issues around that some years ago, that's why I added this to rc.local.
I'm moving away from this on el8 with assumption that motivation for that line is long gone.
Comment 46 Marcin Stolarek 2022-08-09 02:56:18 MDT
Jurij,

I'm playing with the very similar config trying to trigger a failure in the node GRES counter deallocation without success. Could you please check the last job that run on one of those nodes and see if there is anything unusual related to the job end?
The way it ends should give us some hint on the potential code path that lead to GRES counter not being decremented.

I'm looking at the slurmctld log you shared trying to find those cases, but if you'll be able to find out something interactively it may be helpful too.

cheers,
Marcin
Comment 47 Jurij Pečar 2022-08-09 04:21:00 MDT
Ok, will pay attention to that next time this situation occurs again.
Any suggestions which debug flags might be helpful?
Comment 48 Marcin Stolarek 2022-08-09 04:52:02 MDT
>Any suggestions which debug flags might be helpful?
Unfortunately there are no debug messages in that part of code. Is any of the options below possible:
a) users use job shrinking feature?
b) there are still jobs submitted prior to upgrade in the queue?
c) gres.conf was changed - specifically File=... was added(recently/some time ago)?

The issue is only visible on A100 nodes?

cheers,
Marcin
Comment 50 Marcin Stolarek 2022-08-09 09:14:22 MDT
Jurij,

I found a way to reproduce the issue, but the scenario requires inconsistent and non-stable cluster configuration along the nodes.

I've noticed you have:
>DebugFlags=NO_CONF_HASH

Can you disable the flag and check if all nodes(slurmd&slurmctld) run the same config? Please take a look at questions from comment 48 too.

cheers,
Marcin
Comment 51 Jurij Pečar 2022-08-09 09:32:26 MDT
Ah, another historic artifact. We manage our nodes (including slurm config) with puppet and we recently moved to configless, so yes, all our nodes should have the same config.
Only exception are our fat nodes, see bug #13503.

As for questions in #48:
a) highly unlikely, will ask
b) don't really understand the question, can you rephrase it?
c) yes, values in gres.conf is changing every now and then - I'm in the middle of migration form el7 to el8. File= for gpus usually only changes when I define new nodes.

Great that you managed to reproduce it, hope that leads to patch soon :)
Otherwise, wouldn't it be good enough to just clear any AllocTRES when power_down is triggered?
Comment 52 Jurij Pečar 2022-08-09 10:17:50 MDT
As for if this is limited to only A100 nodes, I can't really say. It could very well be that this is most obvious just on this small subset of nodes, especially now when our users are super eager on anything alphafold. If this manifested elsewhere before, it wasn't as obvious and users were more patient waiting ...
Comment 53 Marcin Stolarek 2022-08-10 04:53:48 MDT
>b) don't really understand the question, can you rephrase it?
You've mentioned that this started after the upgrade. Was that online upgrade with jobs submitted, is it possible that you have jobs submitted before the upgrade that are started now?

>Otherwise, wouldn't it be good enough to just clear any AllocTRES when power_down is triggered?
Not really, it looks like node isn't correctly deallocated in the scenario, so even if there are jobs that should just start there node remains idle and is powered down, but really it shouldn't even be powered down.

The issue with reproducer is that I have to manipulate gres.conf on the node when jobs are in CF/R state, which is generally unexpected situation. Are you 100% sure that there are no left over gres.conf pupet agent putting an old gres.conf on the node from time to time, etc. ?
Are nodes started with --conf-server option or using an SRV record to point to configuration server?

cheers,
Marcin
Comment 54 Jurij Pečar 2022-08-15 00:14:43 MDT
b) Not really as the upgrade happened on May 4th, which is much further back in time than our max job duration.

For configless we use SRV dns records.

And I'm reasonably confident that gres.conf is never dropped onto A100 nodes by puppet. 
Our puppet only places slurm.conf, topology.conf, cgroup.conf and gres.conf onto slurm controllers, fat nodes and login nodes. Last change to any of these files was on Aug 9 16:36 yet this morning I again found only three A100 nodes up and a couple of jobs pending, waiting for A100 to become available.

Btw, users confirmed that this also happens for regular, non-array jobs.
Comment 55 Marcin Stolarek 2022-08-17 01:47:58 MDT
Jurij,

I'm thinking about sharing with you a debugging patch - the will add some additional messages that should help us narrow down the code paths. Are you able to apply such a patch on the system?

cheers,
Marcin
Comment 56 Jurij Pečar 2022-08-17 04:07:25 MDT
Sure, we build our own rpms so this should not be an issue. 
Where would you like to see this patch in effect? I assume on both slurmctlds as well as on affected nodes?
Comment 57 Marcin Stolarek 2022-08-25 07:56:38 MDT
Created attachment 26467 [details]
debugging  patch

Jurij,

I'm sorry it took so much time. Please find the debugging patch attached - it's required on slurmctld only.
All log messages added by it are gated by DebugFlags=Gres (so at least verbose logging level is required too). Please observe the logs volume while running with it enabled.

The patch should apply ontop of Slurm 21.08

cheers,
Marcin
Comment 58 Jurij Pečar 2022-09-09 05:40:27 MDT
I waited until I finished with the last batch of migrating nodes to el8. Now we're completely on el8, no more el7 nodes and no more frequent config changes. At least until end of this year when we expect some more hardware to arrive.

Patched slurmctld is now running.

Just to confirm, I need to set SlurmctldDebug=verbose and DebugFlags=Gres in slurm.conf to capture any of these new messages, right?
Comment 59 Jurij Pečar 2022-09-12 09:46:30 MDT
Ok, I already see sb03-11 in IDLE+POWERED_DONW with AllocTRES=gres/gpu=1.
My ipmi wrapper log tells me that it was shut down on 2022-09-11 13:56:48.

Attached is the log from yesterday, up to the point sb03-11 was shut down. Hope it helps you understand what's going on here. (Warning: 14GB)
Comment 60 Jurij Pečar 2022-09-12 09:47:56 MDT
Created attachment 26724 [details]
slurmctld log with debug patch
Comment 61 Marcin Stolarek 2022-09-20 22:08:59 MDT
Jurij,

Could you please check:
>sacct -D -j49921463  -o AllocTRES,ReqTRES,Start,End,NodeLi -p
and
>sacctmgr show event Nodes=sb03-11 Start=<TimeBeforeJobStart> End=<TimeAfterJobEnd>

cheers,
Marcin
Comment 62 Marcin Stolarek 2022-09-23 04:53:31 MDT
Can you also attach earlier logs - starting when job 49921463 was allocated resources?
Comment 68 Marcin Stolarek 2022-10-20 05:48:21 MDT
Jurij,

Are you able to upload the logs from slurmctld starting at the time when job 49921463 was allocated resources?

cheers,
Marcin
Comment 69 Jurij Pečar 2022-10-24 01:03:02 MDT
Sorry I was OoO for a while and those logs slipped past ... But I have a new case with hopefully everything you need.

It starts with slurmctld.log:

[2022-10-23T17:33:11.651] gres/gpu: state for sb03-14
[2022-10-23T17:33:11.651]   gres_cnt found:TBD configured:1 avail:1 alloc:0
[2022-10-23T17:33:11.651]   gres_bit_alloc: of 1
[2022-10-23T17:33:11.651]   gres_used:(null)
[2022-10-23T17:33:11.651]   type[0]:A100(808464705)
[2022-10-23T17:33:11.651]    type_cnt_alloc[0]:0
[2022-10-23T17:33:11.651]    type_cnt_avail[0]:1
[2022-10-23T17:33:11.651] gres/gpu_mem: state for sb03-14
[2022-10-23T17:33:11.651]   gres_cnt found:TBD configured:0 avail:0 alloc:0
[2022-10-23T17:33:11.651]   gres_bit_alloc:NULL
[2022-10-23T17:33:11.651]   gres_used:(null)
[2022-10-23T17:33:11.651] gres/tmp: state for sb03-14
[2022-10-23T17:33:11.651]   gres_cnt found:TBD configured:404800667648 avail:404800667648 alloc:0
[2022-10-23T17:33:11.651]   gres_bit_alloc:NULL
[2022-10-23T17:33:11.651]   gres_used:(null)

And then few seconds later

[2022-10-23T17:33:17.006] gres/gpu: state for sb03-14
[2022-10-23T17:33:17.006]   gres_cnt found:TBD configured:1 avail:1 alloc:1
[2022-10-23T17:33:17.006]   gres_bit_alloc:0 of 1
[2022-10-23T17:33:17.006]   gres_used:(null)
[2022-10-23T17:33:17.006]   type[0]:A100(808464705)
[2022-10-23T17:33:17.006]    type_cnt_alloc[0]:0
[2022-10-23T17:33:17.006]    type_cnt_avail[0]:1
[2022-10-23T17:33:17.006] gres/gpu_mem: state for sb03-14
[2022-10-23T17:33:17.006]   gres_cnt found:TBD configured:0 avail:0 alloc:0
[2022-10-23T17:33:17.006]   gres_bit_alloc:NULL
[2022-10-23T17:33:17.006]   gres_used:(null)
[2022-10-23T17:33:17.006] gres/tmp: state for sb03-14
[2022-10-23T17:33:17.006]   gres_cnt found:TBD configured:404800667648 avail:404800667648 alloc:0
[2022-10-23T17:33:17.006]   gres_bit_alloc:NULL
[2022-10-23T17:33:17.006]   gres_used:(null)
[2022-10-23T17:33:17.006] gres_job_state gres:gpu(7696487) type:(null)(0) job:54265158 flags:
[2022-10-23T17:33:17.006]   gres_per_node:1 node_cnt:1
[2022-10-23T17:33:17.006]   ntasks_per_gres:65534
[2022-10-23T17:33:17.006]   total_gres:1
[2022-10-23T17:33:17.006]   node_cnt:1
[2022-10-23T17:33:17.006]   gres_cnt_node_alloc[0]:1
[2022-10-23T17:33:17.006]   gres_bit_alloc[0]:0 of 1
[2022-10-23T17:33:17.006]   gres_cnt_step_alloc[0]:0
[2022-10-23T17:33:17.006] select/cons_res: select_p_select_nodeinfo_set: Alloc GRES
[2022-10-23T17:33:17.007] gres_job_state gres:gpu(7696487) type:A100(808464705) job:54265158 flags:
[2022-10-23T17:33:17.007]   node_cnt:1
[2022-10-23T17:33:17.007]   gres_cnt_node_alloc[0]:NULL
[2022-10-23T17:33:17.007]   gres_bit_alloc[0]: of 1
[2022-10-23T17:33:17.007]   gres_bit_step_alloc[0]:NULL
[2022-10-23T17:33:17.007]   gres_cnt_step_alloc[0]:0
[2022-10-23T17:33:17.007] sched: Allocate JobId=54265158 NodeList=sb03-14 #CPUs=32 Partition=gpu-el8

My ipmi wrapper log then shows this node powering up:

2022-10-23 17:33:18  woke up sb03-14

When it comes up, slurmctld.log says:

[2022-10-23T17:36:34.451] GRES: Node:sb03-14 Gres:gpu Type:A100 UniqueId:(null) Flags:HAS_FILE,HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_OPENCL,ENV_DEFAULT CPU_IDs:(null) CPU#:64 Count:1 Links:(null)
[2022-10-23T17:36:34.451] GRES: Node:sb03-14 Gres:tmp Type:(null) UniqueId:(null) Flags: CPU_IDs:(null) CPU#:64 Count:404800667648 Links:(null)
[2022-10-23T17:36:34.451] gres/gpu: state for sb03-14
[2022-10-23T17:36:34.451]   gres_cnt found:1 configured:1 avail:1 alloc:1
[2022-10-23T17:36:34.451]   gres_bit_alloc:0 of 1
[2022-10-23T17:36:34.451]   gres_used:gpu:A100:0
[2022-10-23T17:36:34.451]   topo[0]:A100(808464705)
[2022-10-23T17:36:34.451]    topo_core_bitmap[0]:NULL
[2022-10-23T17:36:34.451]    topo_gres_bitmap[0]:0 of 1
[2022-10-23T17:36:34.451]    topo_gres_cnt_alloc[0]:1
[2022-10-23T17:36:34.451]    topo_gres_cnt_avail[0]:1
[2022-10-23T17:36:34.451]   type[0]:A100(808464705)
[2022-10-23T17:36:34.451]    type_cnt_alloc[0]:0
[2022-10-23T17:36:34.451]    type_cnt_avail[0]:1
[2022-10-23T17:36:34.451] gres/gpu_mem: state for sb03-14
[2022-10-23T17:36:34.451]   gres_cnt found:0 configured:0 avail:0 alloc:0
[2022-10-23T17:36:34.451]   gres_bit_alloc:NULL
[2022-10-23T17:36:34.451]   gres_used:gpu_mem:0
[2022-10-23T17:36:34.451] gres/tmp: state for sb03-14
[2022-10-23T17:36:34.451]   gres_cnt found:404800667648 configured:404800667648 avail:404800667648 alloc:0
[2022-10-23T17:36:34.451]   gres_bit_alloc:NULL
[2022-10-23T17:36:34.451]   gres_used:tmp:0
[2022-10-23T17:36:34.451] Node sb03-14 now responding

Job then finishes, but:

[2022-10-23T17:38:28.536] _job_complete: JobId=54265158 WEXITSTATUS 0
[2022-10-23T17:38:28.548] GRES: _job_dealloc:1084: JobId=54265158, NodeName=sb03-14
[2022-10-23T17:38:28.548] GRES: _job_dealloc:1106: Job and node have gres_bit_alloc
[2022-10-23T17:38:28.548] GRES: _job_dealloc:1154: This is a never happen case now  node_gres_ptr->gres_cnt_alloc: 1
[2022-10-23T17:38:28.548] GRES: _job_dealloc:1161: Job has gres_bit_alloc and node has a topo_gres_bitmap and topo_gres_cnt_alloc
[2022-10-23T17:38:28.548] gres/gpu: state for sb03-14
[2022-10-23T17:38:28.548]   gres_cnt found:1 configured:1 avail:1 alloc:1
[2022-10-23T17:38:28.548]   gres_bit_alloc:0 of 1
[2022-10-23T17:38:28.548]   gres_used:(null)
[2022-10-23T17:38:28.548]   topo[0]:A100(808464705)
[2022-10-23T17:38:28.548]    topo_core_bitmap[0]:NULL
[2022-10-23T17:38:28.548]    topo_gres_bitmap[0]:0 of 1
[2022-10-23T17:38:28.548]    topo_gres_cnt_alloc[0]:1
[2022-10-23T17:38:28.548]    topo_gres_cnt_avail[0]:1
[2022-10-23T17:38:28.548]   type[0]:A100(808464705)
[2022-10-23T17:38:28.548]    type_cnt_alloc[0]:0
[2022-10-23T17:38:28.548]    type_cnt_avail[0]:1
[2022-10-23T17:38:28.548] gres/gpu_mem: state for sb03-14
[2022-10-23T17:38:28.548]   gres_cnt found:0 configured:0 avail:0 alloc:0
[2022-10-23T17:38:28.548]   gres_bit_alloc:NULL
[2022-10-23T17:38:28.548]   gres_used:gpu_mem:0
[2022-10-23T17:38:28.548] gres/tmp: state for sb03-14
[2022-10-23T17:38:28.548]   gres_cnt found:404800667648 configured:404800667648 avail:404800667648 alloc:0
[2022-10-23T17:38:28.548]   gres_bit_alloc:NULL
[2022-10-23T17:38:28.548]   gres_used:tmp:0
[2022-10-23T17:38:28.548] _job_complete: JobId=54265158 done

At this point node is sitting idle but has gres/gpu allocated so it can't take any new jobs.

All that remains is for slurm to power it off 1h later:

2022-10-23 18:40:37  shut down sb03-14

So it seems like we need to understand where that "never happen case" comes from ...

Hope that helps.
Comment 70 Marcin Stolarek 2022-10-24 01:53:27 MDT
Yuri,

Could you please attach a full slurmctld log covering complete time span of last job life time (from submit to complete)?

cheers,
Marcin
Comment 71 Jurij Pečar 2022-10-24 02:05:58 MDT
Created attachment 27410 [details]
10 minutes around job 54265158

Attached. Luckily the job was relatively short (17:33:10 to 17:38:28).
Comment 72 Marcin Stolarek 2022-10-25 04:39:32 MDT
Jurij,

Thanks for that! Could you please check if you can find a log message like:
>_slurm_rpc_submit_batch_job: JobId=54265158 InitPrio=[0-9]+ usec=[0-9]+
in earlier logs? (This is really when job was submitted).

I'm really curious to see the whole job history from submission to complete. 
additionally, can you share:

>sacct -D -j54265158  -o AllocTRES,ReqTRES,Submit,Start,End,NodeLi -p
and
>sacctmgr show event Nodes=sb03-14 Start=2022-10-23T17:33:11 End=2022-10-23T17:38:28

In the output of sacct you'll have job sumbit time in 3rd column could you please upload the logs starting from that time till job complete or
>[2022-10-23T17:30:01.392] _job_complete: JobId=54637611_259(54638681) WEXITSTATUS 0
when the submitted log starts.

cheers,
Marcin
Comment 73 Marcin Stolarek 2022-10-25 05:18:35 MDT
Jurij,

Please ignore the last request - I can reliably reproduce the issue.

cheers,
Marcin
Comment 77 Marcin Stolarek 2022-11-03 08:10:54 MDT
Just wanted to let you know that I'm sending a patch fix the issue to our QA queue.

cheers,
Marcin
Comment 78 Jurij Pečar 2022-11-03 08:15:45 MDT
Thanks, good to hear.

Can you tell me whether if the fix is specific to 21.08 or also applicable to 22.05? I'm planning to upgrade to 22.05 end of year or early January next year.
Comment 79 Marcin Stolarek 2022-11-03 08:57:52 MDT
Keeping in mind that the patch didn't pass our QA yet. It should be easy to apply  locally on top of 21.08. However, we are going merge it to an older version 22.05 or even master - Slurm 23.02 to be.

cheers,
Marcin
Comment 84 Marcin Stolarek 2022-11-23 04:50:12 MST
Jurij,

While we're working on the patch I think I have a potential workaround for you. The node should work just fine if slurmd registers to slurmctld prior to being scheduled a job, so if you:
1. Drain those nodes
2. Restart slurmctld (to clear AllocTRES)
3. Power on the nodes
4. When you see those are drained but not powered off according to scontrol show node - resume the nodes.

The initial registration of the nodes may be required after every slurmctld restart. However, after that Slurm should be able to correctly suspend/resume those nodes.

cheers,
Marcin
Comment 86 Jurij Pečar 2022-12-06 02:27:40 MST
Any chance this patch lands in a slurm release still this year? Maybe 22.05.7?
We usually schedule major cluster maintenance (including slurm upgrades) at the end of each year. Would be nice to have a fix for this in time ...

Thanks.
Comment 87 Marcin Stolarek 2022-12-07 23:53:02 MST
>Any chance this patch lands in a slurm release still this year? Maybe 22.05.7?
We're targeting the fix to master branch - Slurm 23.02 to be, so it's unlikely that the fix will happen on 22.05.

Besides the workaround from comment 84, just wanted to check if you are aware of SuspendExcNodes[1]? It's obviously not a fix, but since the issue occurs only on subset of nodes it may be worth considering.

cheers,
Marcin
[1]https://slurm.schedmd.com/slurm.conf.html#OPT_SuspendExcNodes
Comment 92 Marcin Stolarek 2023-01-10 02:43:34 MST
Jurij,

We're still looking for the best way to address the issue. However I wanted to share a configuration workaround that should be enough to mitigate it even on Slurm 21.08.

Could you please try configuring those nodes as State=CLOUD and add SlurmctldParameters=cloud_dns?

cheers,
Marcin
Comment 93 Jurij Pečar 2023-01-11 01:27:00 MST
During our end-of-year maintenance, we upgraded to Slurm 22.05.7. I also instructed users to switch from --gres=gpu... to --gpus N, --cpus-per-gpu etc and we haven't seen this issue yet. Will monitor things closely if it shows up again.
Comment 94 Marcin Stolarek 2023-01-11 02:36:41 MST
>I also instructed users to switch from --gres=gpu... to --gpus N

Just to be 100% clear --gpus isn't a replacement of --gres=gpu, --gpus is a per job requested type/number of GPU devices, while --gres=gpu is a per node specification.
Selection of resources for job allocation is often easier from perspective of the result understanding and scheduler performance when per node specification is used.

>[...]Will monitor things closely if it shows up again.
Sounds good, however, from code perspective it's still possible.

cheers,
Marcin
Comment 95 Jurij Pečar 2023-01-11 02:45:21 MST
Thanks for clarifying this. Our jobs are single node in large majority, we have only two or three mpi capable codes in regular use and those run fine over ethernet. We just deployed our first infiniband rack and we'll be exploring more mpi intensive multi-node codes in this year.
Comment 96 Jurij Pečar 2023-01-17 12:13:51 MST
Just to confirm, issue does indeed still happen. I've switched to CLOUD config as suggested in comment #92. We now have a large batch of gpu jobs in the queue and so far things look fine.
Comment 102 Marcin Stolarek 2023-02-03 01:30:51 MST
Jurij,

I know it was a long bug, but, I have a good news to share. The patch for reported issue is merged to our public repository. The most important commit is a54f9cffff[1], since it requires protocol changes it will be part of Slurm 23.02 release.

Let me know if you have any questions.

cheers,
Marcin
[1]https://github.com/SchedMD/slurm/commit/a54f9cffff7aed32354d7eed696a50d6d8831666
Comment 103 Marcin Stolarek 2023-02-16 03:33:28 MST
Jurij,

I read no reply as no questions from your side. I'll go ahead and close the case now.

Should you have any questions please reopen.

cheers,
Marcin