|
Description
Jurij Pečar
2022-07-18 02:03:38 MDT
Jurij,
At a glance the job is pending because of "Priority", which strictly means that the there are higher priority job waiting for resources. A simple experiment to check that would be to update the job priority to the very high value, like:
>scontrol update job=44704138 priority=99999999
It may be that there is a configuration tuning required to make backfill scheduler reach that job.
If you can check the command above and let us know how that work. Attach slurm.conf and sdiag command results and we'll continue the case debugging.
cheers,
Marcin
Created attachment 25908 [details]
sdiag output
Created attachment 25909 [details]
current slurm.conf
About 30min ago I bumped priority on another array job of this kind to bring it up to the front of the queue: # sq -t pd -p gpu-el8 JOBID NAME ST QOS PARTITION PRIORITY NODES TIME_LIMIT NODELIST(REASON) 44951894_[3066-3499] array PD normal gpu-el8 4100 1 6:00:00 (Resources) 44690533_[4020-4021,4024- array PD low gpu-el8 4058 1 6:00:00 (Priority) 44843535 aa_SNX33_HUMAN.job PD normal gpu-el8 3659 1 7-00:00:00 (Priority) 44963488 220625.job PD normal gpu-el8 3399 1 7-00:00:00 (Priority) 45016847 array PD normal gpu-el8 3362 1 2-00:00:00 (Priority) And jobs in this array look like this: # scontrol show job 44951894_3066 JobId=44951894 ArrayJobId=44951894 ArrayTaskId=3066 JobName=array UserId=kosinski(21549) GroupId=kosinski(721) MCS_label=N/A Priority=4100 Nice=0 Account=kosinski QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=06:00:00 TimeMin=N/A SubmitTime=2022-07-18T17:03:02 EligibleTime=2022-07-18T17:03:05 AccrueTime=2022-07-18T17:03:05 StartTime=2022-07-21T09:20:00 EndTime=2022-07-21T15:20:00 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-07-19T09:59:48 Scheduler=Main Partition=gpu-el8 AllocNode:Sid=login01:2082155 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) SchedNodeList=sb03-16 NumNodes=1-1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=16,mem=125G,node=1,billing=16,gres/gpu=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=125G MinTmpDiskNode=0 Features=gpu=A100 DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/g/kosinski/kosinski/Giardia/AlphaFold/run_AF_monomer.sh WorkDir=/g/kosinski/kosinski/Giardia/AlphaFold StdErr=/g/kosinski/kosinski/Giardia/AlphaFold/logs/run_monomer_44951894_3066_err.txt StdIn=/dev/null StdOut=/g/kosinski/kosinski/Giardia/AlphaFold/logs/run_monomer_44951894_3066_out.txt Power= TresPerNode=gres:gpu:1 However our A100 nodes remain powered off: # sinfo -Nlp gpu-el8 Tue Jul 19 10:04:30 2022 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON gpu25 1 gpu-el8 allocated 224 2:56:2 103163 126185 1 HT,cpu2. none gpu26 1 gpu-el8 draining@ 224 2:56:2 103163 126185 1 HT,cpu2. Reboot ASAP gpu27 1 gpu-el8 draining@ 224 2:56:2 103163 126185 1 HT,cpu2. Reboot ASAP gpu28 1 gpu-el8 draining@ 224 2:56:2 103163 126185 1 HT,cpu2. Reboot ASAP sb03-05 1 gpu-el8 mixed 64 1:32:2 515590 377000 1 HT,cpu2. none sb03-06 1 gpu-el8 idle~ 64 1:32:2 515590 377000 1 HT,cpu2. none sb03-07 1 gpu-el8 idle~ 64 1:32:2 515590 377000 1 HT,cpu2. none sb03-08 1 gpu-el8 mixed 64 1:32:2 515590 377000 1 HT,cpu2. none sb03-09 1 gpu-el8 mixed 64 1:32:2 515590 377000 1 HT,cpu2. none sb03-10 1 gpu-el8 idle~ 64 1:32:2 515590 377000 1 HT,cpu2. none sb03-11 1 gpu-el8 idle~ 64 1:32:2 515590 377000 1 HT,cpu2. none sb03-12 1 gpu-el8 idle~ 64 1:32:2 515590 377000 1 HT,cpu2. none sb03-13 1 gpu-el8 idle~ 64 1:32:2 515590 377000 1 HT,cpu2. none sb03-14 1 gpu-el8 idle~ 64 1:32:2 515590 377000 1 HT,cpu2. none sb03-15 1 gpu-el8 idle~ 64 1:32:2 515590 377000 1 HT,cpu2. none sb03-16 1 gpu-el8 mixed 64 1:32:2 515590 377000 1 HT,cpu2. none sb03-17 1 gpu-el8 idle~ 64 1:32:2 515590 377000 1 HT,cpu2. none sb03-18 1 gpu-el8 idle~ 64 1:32:2 515590 377000 1 HT,cpu2. none sb03-19 1 gpu-el8 idle~ 64 1:32:2 515590 377000 1 HT,cpu2. none sb03-20 1 gpu-el8 idle~ 64 1:32:2 515590 377000 1 HT,cpu2. none Sdiag output and slurm.conf attached. Thanks for the scontrol show job output, it shows that the job was recently checked by main scheduler (so definitively not a priority/backfill issue). Could you please share your slurmctld.log and scheduler.log? cheers, Marcin Created attachment 25910 [details]
slurmctld.log and scheduler.log
Attached. I assume you'll want to see increased verbosity? Could you please share: scontrol show job -d --oneline ? cheers, Marcin # scontrol show job -d --oneline 44951894_3078 JobId=44951894 ArrayJobId=44951894 ArrayTaskId=3078 JobName=array UserId=kosinski(21549) GroupId=kosinski(721) MCS_label=N/A Priority=4100 Nice=0 Account=kosinski QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:00:00 TimeLimit=06:00:00 TimeMin=N/A SubmitTime=2022-07-18T17:03:02 EligibleTime=2022-07-18T17:03:05 AccrueTime=2022-07-18T17:03:05 StartTime=2022-07-21T11:28:00 EndTime=2022-07-21T17:28:00 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-07-19T11:57:06 Scheduler=Main Partition=gpu-el8 AllocNode:Sid=login01:2082155 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) SchedNodeList=sb03-05 NumNodes=1-1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=16,mem=125G,node=1,billing=16,gres/gpu=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=125G MinTmpDiskNode=0 Features=gpu=A100 DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/g/kosinski/kosinski/Giardia/AlphaFold/run_AF_monomer.sh WorkDir=/g/kosinski/kosinski/Giardia/AlphaFold StdErr=/g/kosinski/kosinski/Giardia/AlphaFold/logs/run_monomer_44951894_3078_err.txt StdIn=/dev/null StdOut=/g/kosinski/kosinski/Giardia/AlphaFold/logs/run_monomer_44951894_3078_out.txt Power= TresPerNode=gres:gpu:1 Created attachment 25911 [details]
all jobs
Can you share all jobs info - without a JOBID as an option to previous scontrol? I'd like to check if those nodes aren't reserved by backfill for higher priority job in htc-el8 partition. Am I correct that those nodes are shared between those parts? At the same time gpu-el8 partition doesn't have a PriorityJobFactor=3000, which makes it 1(default) so the jobs are very likely lower when compared to that part.
>I assume you'll want to see increased verbosity?
If you can handle higher verbosity for some time (based on the logs ~30 minutes should be enough) please switch to:
SlurmctldDebug=debug2
DebugFlags=NO_CONF_HASH,SelectType,TraceJobs
you can enable/disable both without slurmctld restart using scontrol.
cheers,
Marcin
Created attachment 25912 [details]
debug logs (over 5GB)
See "all jobs" attachment above. No, these nodes are dedicated to gpu-el8 partition and are not shared anywhere. Am I then correct in thinking that PriorityJobFactor on this partition is not needed? I've now gathered ~30min of debug logs and uploaded them. Let me know if you need anything else to figure this out. >See "all jobs" attachment above. Thanks for that. >No, these nodes are dedicated to gpu-el8 partition and are not shared anywhere. Am I then correct in thinking that PriorityJobFactor on this partition is not needed? Yes- you are correct. I've messed the two ranges in sb nodes names. The PriorityJobFactor may still have some impact, but it shouldn't be major - for instance lower priority job may not be checked by backfill when a "global" limit like bf_max_job_test is reached. >Let me know if you need anything else to figure this out. Could you please share topology.conf and `scontrol show res`? I remember that you've noticed that as an upgrade result, but still want to check basic things just to be sure. cheers, Marcin I have two reservations in place for August and September to finish redeploying nodes over to el8. However these nodes in question (sb03-[05-20]) are already on el8. # scontrol show res ReservationName=el8move_step2 StartTime=2022-08-08T12:00:00 EndTime=2022-08-12T18:00:00 Duration=4-06:00:00 Nodes=gpu[8-14,21-22],sm-epyc-[01-03],smer01-[1-4],smer02-[1-4],smer03-[1-4],smer04-[1-4],smer05-[1-4],smer06-[1-4],smer07-[1-4],smer08-[1-4],smer09-[1-4],smer10-[1-4],smer11-[1-4],smer12-[1-4],smer13-[1-4],smer14-[1-4],smer15-[1-4] NodeCnt=72 CoreCnt=4400 Features=(null) PartitionName=(null) Flags=MAINT,IGNORE_JOBS,SPEC_NODES TRES=cpu=8800 Users=pecar Groups=(null) Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a MaxStartDelay=(null) ReservationName=el8move_step3 StartTime=2022-09-05T12:00:00 EndTime=2022-09-09T18:00:00 Duration=4-06:00:00 Nodes=gpu[15-20,23-24],sm-epyc-[04-05],smer16-[1-4],smer17-[1-4],smer18-[1-4],smer19-[1-4],smer20-[1-4],smer21-[1-4],smer22-[1-4],smer23-[1-4],smer24-[1-4],smer25-[1-4],smer26-[1-4],smer27-[1-4],smer28-[1-4],smer29-[1-4],smer30-[1-4] NodeCnt=70 CoreCnt=4288 Features=(null) PartitionName=(null) Flags=MAINT,IGNORE_JOBS,SPEC_NODES TRES=cpu=8576 Users=pecar Groups=(null) Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a MaxStartDelay=(null) And our topology.conf is reasonably simple: SwitchName=spine Switches=leaf[1-5],sb[03-05],other SwitchName=leaf1 Nodes=smer[01-20]-[1-4],login01 SwitchName=leaf2 Nodes=smer[21-30]-[1-4],login02 SwitchName=leaf3 Nodes=sm-epyc-0[1-5],gpu[27-28] SwitchName=leaf4 Nodes=gpu[21-24],smem[01-10]-[1-4] SwitchName=leaf5 Nodes=gpu[8-20,25-26] SwitchName=sb03 Nodes=sb03-[01-20] SwitchName=sb04 Nodes=sb04-[01-20] SwitchName=sb05 Nodes=sb05-[01-20] SwitchName=other Nodes=bn[01-04] Are you running a federated setup? cheers, Marcin No, not that I'm aware of. News: I see that all our A100 nodes are now up and chewing on jobs. Checking on their uptime it looks like this happened when I switched debugging level and debugflags which corresponds to my initial observation that slurmctld restart fixed things in the past. Hope that debug logs capture why jobs were being held back initially ... Jurij, Can you try to reproduce the issue/collect logs with "Power" DebugFlags (and at least verbose log level) enabled? I'm still trying to reproduce that looking at the logs and code. cheers, Marcin Ok, I had a gut feeling that this is somehow related to powering idle nodes off but didn't manage to test this hypothesis. As you see I'll need to wait for the current set of jobs to finish and then wait for issue to reoccur. Could be days or weeks but I'll definitely come back to you with additional debug logs. Created attachment 25965 [details]
debug logs with Power
Here's the next set of logs (and a list of jobs) hopefully catching the problem (over 11GB) at debug level. I had three attempts at recreating it. I set the partition down and up to allow power saving to shut down nodes inbetween. First attempt was around noon where I think I got the debugflags command wrong but nodes came up immediately. Second one was around 14:20 where nodes also came up as expected. Here I set partition state to up and then set debugflags. Third one began at about 16:53 when I set debugflags first and only then (about 16:55) set the partition state to up. Nothing much happened on the powered down nodes for the next 15 minutes. There was one other job pending with Resources (45338245) and waiting for A40 node so I added Backfill and later BackfillMap to debugflags as well. Then shortly before 17:30 that job got scheduled and then A100 nodes were powered up and started chewing on jobs (45220650). Hope that shines more light into why this is happening. Jurij,
I'm still looking at the logs and code to fully understand the flow, however, two things wanted to point out are:
1) Around 14:25 we have multiple: 'POWER: resume rate reached', which means that automatic power on of nodes will be delayed because of the configured ResumeRate. You have it set to only 20, which is quite low. I think you may want to recheck the reason for that.
2) There are quite long time spans when you have a lot of 'Note large processing time...' messages, which is a symptom of overload. This may be a reason for jobs not being evaluated on time by scheduler or other periodic activities (like nodes power up) being delayed.
Those are most often:
>7214 _slurm_rpc_dump_job_user:
>5922 _slurm_rpc_dump_job_single:
>4912 _slurm_rpc_complete_batch_script:
>3949 _slurm_rpc_dump_jobs:
>3872 _slurm_rpc_dump_nodes:
>3342 _slurm_rpc_complete_prolog:
dump job/node RPCs are a result of scontrol show/squeue queries. In this case I'd recommend checking the users (in sdiag) being the top query makers and educating them that they should avoid execution of commands like squeue in a loop, since it may result in slurmctld overload.
cheers,
Marcin
1) I've set this to a ballpark figure of 10% of cluster nodes with the idea of preventing huge spikes on electrical side but rather ensure a more staggered load increase. Can increase for test. 2) We have a bunch of scripts both generating a login greeting with a fresh overview of cluster state as well as feeding ganglia with per group (=Account) cpu,memory and gpu allocations. These both run every minute. So far I didn't see any negative consequences from them. Do you think what we're seeing here could be a consequence of these? In this case, what could be done to increase slurmctld performance? I can try and run these scripts say every other minute and see if this changes anything. I'm still not 100% sure about the behavior you described in initial comments, but definitively slurmctld may be underperforming from time to time due to the overload. For instance, the logs like (305 occurrences in the shared log): >[2022-07-21T14:19:04.772] sched/backfill: _yield_locks: continuing to yield locks, 40 RPCs pending means that backfill doesn't resume job scheduling after yielding locks since we have high number of waiting RPCs. There are also time spans like [2022-07-21T16:53:55.062] to [2022-07-21T17:22:07.055] when the "[...]very large processing time[...]" happens really often which is very likely result in poor performance. Based on the sdiag result you shared before, there are users making quite many RPCs, are thoes related with your monitoring: >Remote Procedure Call statistics by user > root ( 0) count:7580184 ave_time:84217 total_time:638387759486 > marotta ( 24809) count:393859 ave_time:33055 total_time:13019222339 > pekel ( 25255) count:95579 ave_time:311218 total_time:29745976681 > tweber ( 25221) count:92908 ave_time:6326 total_time:587771022 > galaxy ( 20239) count:52673 ave_time:8432 total_time:444161784 > zzhao ( 24718) count:52372 ave_time:9596 total_time:502613792 > jendrusc ( 24710) count:13712 ave_time:14607 total_time:200295040 > fabreges ( 22758) count:12835 ave_time:2485 total_time:31900771 looking at the total_time for the top users, and having in mind that they are very likely generating this load in short time I'd say that you should check theirs activity. You can use `sdiag -r` to reset the counters and then sdiag to have just recent info. cheers, Marcin Majority of those RPCs coming from users are generated by snakemakes, nextflows and other tools of that sort. So I'm not sure how much can I decrease those numbers as they're required for their operation to some degree. However I can spend some time looking at our monitoring scripts and figure out if there's something I can do there to polish things up. That "scontrol show jobs -d --oneline" from above will for sure come handy ... Ok, I've managed to decrease number of RPCs from root by about 60%. At which logging level do these "large processing time" log messages appear? I'll keep an eye on them ... I continue looking into the logs with the focus on the orginal issue, but however those tools are outside of SchedMD expertise, my professional experience is snakemakes - doesn't encourage users to use scontrol they prefer sacct if possible, the main repo even suggests not to configure monitoring script if the jobs don't fail often. Using sacct is probably still better idea since it doesn't query slurmctld. It should be even quite easy to modify the scripts responsible for that to cache their results for specified number of seconds (for instance in a temp file) if it happens to be a major issue. >At which logging level do these "large processing time" log messages appear? I'll keep an eye on them ... Those are displayed at debug level. I don't think you need to focus on them now. We also have a higher threshold with the message 'very large processing time' displayed at verbose level. In general my advice would be to look at sdiag, or even have a differential plot in a monitoring system of sdiag results per user/rpc type to be able to match spikes there with other issues. Just the fact that specific RPC takes long time doesn't mean that it's an cause of an issue. It may be that it is delayed by other queries. For instance thousands of job complete RPC comming at a time are likely to response to `squeue`. >Ok, I've managed to decrease number of RPCs from root by about 60%. Sounds good. cheers, Marcin Jurij, Another thing you may want to validate in your slurm.conf is: >CompleteWait=300 Per man slurm.conf: >CompleteWait >The time to wait, in seconds, when any job is in the COMPLETING state before any additional jobs are scheduled. This is to >attempt to keep jobs on nodes that were recently in use, with the goal of preventing fragmentation. If set to zero, pend‐ >ing jobs will be started as soon as possible. Since a COMPLETING job's resources are released for use by other jobs as >soon as the Epilog completes on each individual node, this can result in very fragmented resource allocations. To provide >jobs with the minimum response time, a value of zero is recommended (no waiting). To minimize fragmentation of resources, >a value equal to KillWait plus two is recommended. In that case, setting KillWait to a small value may be beneficial. >The default value of CompleteWait is zero seconds. The value may not exceed 65533. so if you have any job in CG state Slurm waits for 300s before scheduling a job. cheers, Marcin That's intentional. Ideally I would prefer that setting to be per-node as we have fat nodes with 2T of memory and 9.4T of swap and sometimes it takes minutes for slurm to clean up after big jobs. Still with this set to 5min we get "Kill task failed" often enough that I've set up a hourly cron job to resume nodes in this state. Our regular nodes on el7 had TMPDIR pointing to node-local /tmp implemented as tmpfs backed by ssd. Now on el8 I'm just pointing TMPDIR to per-job folder on scratch so I'll revisit this setting once we're done moving to el8 completely. >Ideally I would prefer that setting to be per-node[..]
Could you please elaborate on that a little bit? As long as the node is in completing state (which means that at least one job on the node is in completing state) no new jobs are scheduled to the node. Setting CompleteWait means that no jobs are scheduled on any node for up to CompleteWait seconds if at least one job went and stays in CG.
cheers,
Marcin
We again hit this situation over the weekend. Again there was a higher priority job submitted to gpu,gpu-el8 asking for gres of any two gpus (our A100 nodes have only one) and an array of jobs each asking for one A100 gpu. I've modified the higher priority job to only sit in partition gpu to not interfere with job array destined for A100s and waited ... nothing happened for about 1h. Then I restarted slurmctld and within minutes, all A100 nodes were up and chewing on jobs. That happened in the evening. Funny enough, in the morning, array was not yet finished but I found only two (of 16) A100 nodes up and busy while others were idle~. Nothing much changed in the queue, this array job remained at the top through the night. Would be interesting to understand why slurmctl decided to not schedule pending jobs to idle nodes for such a long time that they went into power down state... I've also checked for jobs in CG state and haven't seen any. Jurij, Am I correct that the waiting job you've noticed are always job arrays? I didn't focus on that before, but maybe it's the key to reproduce/understand the issue. cheers, Marcin Good observation. Since we still have a couple of jobs from that array pending and just two A100 nodes up to chew on them, I quickly submitted one interactive job asking for A100, got resources allocated within seconds and got a shell once a node came up. Now I'm looking at it sitting idle, slurm is not giving it jobs from that array. So it might very well be related to array handling. >Now I'm looking at it sitting idle, slurm is not giving it jobs from that array. So it might very well be related to array handling.
Could you please share `scontrol show node` and `scontrol show job` for the current state?
I can't find any evidence of things working incorrectly in the debug log. It looks like jobs are checked by backfill scheduler and its start time is estimated, however, it's not in tact with what you see since - no other higher priority jobs waiting for those nodes and they just stay idle.
Did you notice those jobs being in CF/configuring state?
Created attachment 26198 [details]
current node & jobs state
Attached. Array 46665271_[86-90] is the one in question. No, these array jobs stay pending, they don't enter configuring state. Hm, at least not that I'm aware of, I have "-t pd" almost hardwired in my muscle memory to squeue. Will pay attention to that. I see on every interesting thing there, for instance sb03-06:
>NodeName=sb03-06 Arch=x86_64 CoresPerSocket=32 [...] State=IDLE+POWERED_DOWN [...]CfgTRES=cpu=64,mem=515590M,billing=64,gres/gpu=1 AllocTRES=gres/gpu=1 [..]
> ^^^^^^^^^^^^^^^^^^^^^
which is obviously wrong - powered down shouldn't have resources in allocated state. Could you please check if calling scontrol reconfigure (or restarting ctld) clears that counter?
cheers,
Marcin
That's one thing I noticed in my initial post. Any idea how this can happen? And yes, scontrol reconfigure clears it. >That's one thing I noticed in my initial post.
My bad, I've lost that peace looking at the code and logs. Could you please share your gres.conf?
NodeName=gpu8 Name=gpu Type=2080Ti File=/dev/nvidia[0-3] COREs=[0-15] NodeName=gpu8 Name=gpu Type=2080Ti File=/dev/nvidia[4-7] COREs=[16-31] NodeName=gpu9 Name=gpu Type=2080Ti File=/dev/nvidia[0-3] COREs=[0-15] NodeName=gpu9 Name=gpu Type=2080Ti File=/dev/nvidia[4-7] COREs=[16-31] NodeName=gpu10 Name=gpu Type=V100 File=/dev/nvidia[0-3] NodeName=gpu[11-20] Name=gpu Type=2080Ti File=/dev/nvidia[0-3] NodeName=sb03-[05-20] Name=gpu Type=A100 File=/dev/nvidia0 NodeName=gpu[21-24] Name=gpu Type=3090 File=/dev/nvidia[0-3] COREs=[0-31] NodeName=gpu[21-24] Name=gpu Type=3090 File=/dev/nvidia[4-7] COREs=[32-63] NodeName=gpu[25-28] Name=gpu Type=A40 File=/dev/nvidia[0-3] COREs=[0-55,112-167] NodeName=gpu[25-28] Name=gpu Type=A40 File=/dev/nvidia[4-7] COREs=[56-111,168-223] #TMPDIR as gres NodeName=gpu[8-9] Name=tmp Count=3100G NodeName=gpu10 Name=tmp Count=456G NodeName=gpu[11-20] Name=tmp Count=467G NodeName=gpu[21-24] Name=tmp Count=435G NodeName=gpu[25-28] Name=tmp Count=124G NodeName=sb03-[01-20] Name=tmp Count=467G NodeName=sb[01-02]-[01-20] Name=tmp Count=844G NodeName=sm-epyc-01 Name=tmp Count=7571G NodeName=sm-epyc-0[2-5] Name=tmp Count=9400282M NodeName=smer[01-20]-[1-4] Name=tmp Count=203G Do you update nodes in your Resume/Suspend programs? cheers, Marcin No, those are just wrappers for ipmitool. Closest to what you're asking that we do is setting state=resume in rc.local on el7 nodes so that when they come up for whatever reason, they're presented as ready. I remember we had some issues around that some years ago, that's why I added this to rc.local. I'm moving away from this on el8 with assumption that motivation for that line is long gone. Jurij, I'm playing with the very similar config trying to trigger a failure in the node GRES counter deallocation without success. Could you please check the last job that run on one of those nodes and see if there is anything unusual related to the job end? The way it ends should give us some hint on the potential code path that lead to GRES counter not being decremented. I'm looking at the slurmctld log you shared trying to find those cases, but if you'll be able to find out something interactively it may be helpful too. cheers, Marcin Ok, will pay attention to that next time this situation occurs again. Any suggestions which debug flags might be helpful? >Any suggestions which debug flags might be helpful?
Unfortunately there are no debug messages in that part of code. Is any of the options below possible:
a) users use job shrinking feature?
b) there are still jobs submitted prior to upgrade in the queue?
c) gres.conf was changed - specifically File=... was added(recently/some time ago)?
The issue is only visible on A100 nodes?
cheers,
Marcin
Jurij, I found a way to reproduce the issue, but the scenario requires inconsistent and non-stable cluster configuration along the nodes. I've noticed you have: >DebugFlags=NO_CONF_HASH Can you disable the flag and check if all nodes(slurmd&slurmctld) run the same config? Please take a look at questions from comment 48 too. cheers, Marcin Ah, another historic artifact. We manage our nodes (including slurm config) with puppet and we recently moved to configless, so yes, all our nodes should have the same config. Only exception are our fat nodes, see bug #13503. As for questions in #48: a) highly unlikely, will ask b) don't really understand the question, can you rephrase it? c) yes, values in gres.conf is changing every now and then - I'm in the middle of migration form el7 to el8. File= for gpus usually only changes when I define new nodes. Great that you managed to reproduce it, hope that leads to patch soon :) Otherwise, wouldn't it be good enough to just clear any AllocTRES when power_down is triggered? As for if this is limited to only A100 nodes, I can't really say. It could very well be that this is most obvious just on this small subset of nodes, especially now when our users are super eager on anything alphafold. If this manifested elsewhere before, it wasn't as obvious and users were more patient waiting ... >b) don't really understand the question, can you rephrase it? You've mentioned that this started after the upgrade. Was that online upgrade with jobs submitted, is it possible that you have jobs submitted before the upgrade that are started now? >Otherwise, wouldn't it be good enough to just clear any AllocTRES when power_down is triggered? Not really, it looks like node isn't correctly deallocated in the scenario, so even if there are jobs that should just start there node remains idle and is powered down, but really it shouldn't even be powered down. The issue with reproducer is that I have to manipulate gres.conf on the node when jobs are in CF/R state, which is generally unexpected situation. Are you 100% sure that there are no left over gres.conf pupet agent putting an old gres.conf on the node from time to time, etc. ? Are nodes started with --conf-server option or using an SRV record to point to configuration server? cheers, Marcin b) Not really as the upgrade happened on May 4th, which is much further back in time than our max job duration. For configless we use SRV dns records. And I'm reasonably confident that gres.conf is never dropped onto A100 nodes by puppet. Our puppet only places slurm.conf, topology.conf, cgroup.conf and gres.conf onto slurm controllers, fat nodes and login nodes. Last change to any of these files was on Aug 9 16:36 yet this morning I again found only three A100 nodes up and a couple of jobs pending, waiting for A100 to become available. Btw, users confirmed that this also happens for regular, non-array jobs. Jurij, I'm thinking about sharing with you a debugging patch - the will add some additional messages that should help us narrow down the code paths. Are you able to apply such a patch on the system? cheers, Marcin Sure, we build our own rpms so this should not be an issue. Where would you like to see this patch in effect? I assume on both slurmctlds as well as on affected nodes? Created attachment 26467 [details]
debugging patch
Jurij,
I'm sorry it took so much time. Please find the debugging patch attached - it's required on slurmctld only.
All log messages added by it are gated by DebugFlags=Gres (so at least verbose logging level is required too). Please observe the logs volume while running with it enabled.
The patch should apply ontop of Slurm 21.08
cheers,
Marcin
I waited until I finished with the last batch of migrating nodes to el8. Now we're completely on el8, no more el7 nodes and no more frequent config changes. At least until end of this year when we expect some more hardware to arrive. Patched slurmctld is now running. Just to confirm, I need to set SlurmctldDebug=verbose and DebugFlags=Gres in slurm.conf to capture any of these new messages, right? Ok, I already see sb03-11 in IDLE+POWERED_DONW with AllocTRES=gres/gpu=1. My ipmi wrapper log tells me that it was shut down on 2022-09-11 13:56:48. Attached is the log from yesterday, up to the point sb03-11 was shut down. Hope it helps you understand what's going on here. (Warning: 14GB) Created attachment 26724 [details]
slurmctld log with debug patch
Jurij, Could you please check: >sacct -D -j49921463 -o AllocTRES,ReqTRES,Start,End,NodeLi -p and >sacctmgr show event Nodes=sb03-11 Start=<TimeBeforeJobStart> End=<TimeAfterJobEnd> cheers, Marcin Can you also attach earlier logs - starting when job 49921463 was allocated resources? Jurij, Are you able to upload the logs from slurmctld starting at the time when job 49921463 was allocated resources? cheers, Marcin Sorry I was OoO for a while and those logs slipped past ... But I have a new case with hopefully everything you need. It starts with slurmctld.log: [2022-10-23T17:33:11.651] gres/gpu: state for sb03-14 [2022-10-23T17:33:11.651] gres_cnt found:TBD configured:1 avail:1 alloc:0 [2022-10-23T17:33:11.651] gres_bit_alloc: of 1 [2022-10-23T17:33:11.651] gres_used:(null) [2022-10-23T17:33:11.651] type[0]:A100(808464705) [2022-10-23T17:33:11.651] type_cnt_alloc[0]:0 [2022-10-23T17:33:11.651] type_cnt_avail[0]:1 [2022-10-23T17:33:11.651] gres/gpu_mem: state for sb03-14 [2022-10-23T17:33:11.651] gres_cnt found:TBD configured:0 avail:0 alloc:0 [2022-10-23T17:33:11.651] gres_bit_alloc:NULL [2022-10-23T17:33:11.651] gres_used:(null) [2022-10-23T17:33:11.651] gres/tmp: state for sb03-14 [2022-10-23T17:33:11.651] gres_cnt found:TBD configured:404800667648 avail:404800667648 alloc:0 [2022-10-23T17:33:11.651] gres_bit_alloc:NULL [2022-10-23T17:33:11.651] gres_used:(null) And then few seconds later [2022-10-23T17:33:17.006] gres/gpu: state for sb03-14 [2022-10-23T17:33:17.006] gres_cnt found:TBD configured:1 avail:1 alloc:1 [2022-10-23T17:33:17.006] gres_bit_alloc:0 of 1 [2022-10-23T17:33:17.006] gres_used:(null) [2022-10-23T17:33:17.006] type[0]:A100(808464705) [2022-10-23T17:33:17.006] type_cnt_alloc[0]:0 [2022-10-23T17:33:17.006] type_cnt_avail[0]:1 [2022-10-23T17:33:17.006] gres/gpu_mem: state for sb03-14 [2022-10-23T17:33:17.006] gres_cnt found:TBD configured:0 avail:0 alloc:0 [2022-10-23T17:33:17.006] gres_bit_alloc:NULL [2022-10-23T17:33:17.006] gres_used:(null) [2022-10-23T17:33:17.006] gres/tmp: state for sb03-14 [2022-10-23T17:33:17.006] gres_cnt found:TBD configured:404800667648 avail:404800667648 alloc:0 [2022-10-23T17:33:17.006] gres_bit_alloc:NULL [2022-10-23T17:33:17.006] gres_used:(null) [2022-10-23T17:33:17.006] gres_job_state gres:gpu(7696487) type:(null)(0) job:54265158 flags: [2022-10-23T17:33:17.006] gres_per_node:1 node_cnt:1 [2022-10-23T17:33:17.006] ntasks_per_gres:65534 [2022-10-23T17:33:17.006] total_gres:1 [2022-10-23T17:33:17.006] node_cnt:1 [2022-10-23T17:33:17.006] gres_cnt_node_alloc[0]:1 [2022-10-23T17:33:17.006] gres_bit_alloc[0]:0 of 1 [2022-10-23T17:33:17.006] gres_cnt_step_alloc[0]:0 [2022-10-23T17:33:17.006] select/cons_res: select_p_select_nodeinfo_set: Alloc GRES [2022-10-23T17:33:17.007] gres_job_state gres:gpu(7696487) type:A100(808464705) job:54265158 flags: [2022-10-23T17:33:17.007] node_cnt:1 [2022-10-23T17:33:17.007] gres_cnt_node_alloc[0]:NULL [2022-10-23T17:33:17.007] gres_bit_alloc[0]: of 1 [2022-10-23T17:33:17.007] gres_bit_step_alloc[0]:NULL [2022-10-23T17:33:17.007] gres_cnt_step_alloc[0]:0 [2022-10-23T17:33:17.007] sched: Allocate JobId=54265158 NodeList=sb03-14 #CPUs=32 Partition=gpu-el8 My ipmi wrapper log then shows this node powering up: 2022-10-23 17:33:18 woke up sb03-14 When it comes up, slurmctld.log says: [2022-10-23T17:36:34.451] GRES: Node:sb03-14 Gres:gpu Type:A100 UniqueId:(null) Flags:HAS_FILE,HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_OPENCL,ENV_DEFAULT CPU_IDs:(null) CPU#:64 Count:1 Links:(null) [2022-10-23T17:36:34.451] GRES: Node:sb03-14 Gres:tmp Type:(null) UniqueId:(null) Flags: CPU_IDs:(null) CPU#:64 Count:404800667648 Links:(null) [2022-10-23T17:36:34.451] gres/gpu: state for sb03-14 [2022-10-23T17:36:34.451] gres_cnt found:1 configured:1 avail:1 alloc:1 [2022-10-23T17:36:34.451] gres_bit_alloc:0 of 1 [2022-10-23T17:36:34.451] gres_used:gpu:A100:0 [2022-10-23T17:36:34.451] topo[0]:A100(808464705) [2022-10-23T17:36:34.451] topo_core_bitmap[0]:NULL [2022-10-23T17:36:34.451] topo_gres_bitmap[0]:0 of 1 [2022-10-23T17:36:34.451] topo_gres_cnt_alloc[0]:1 [2022-10-23T17:36:34.451] topo_gres_cnt_avail[0]:1 [2022-10-23T17:36:34.451] type[0]:A100(808464705) [2022-10-23T17:36:34.451] type_cnt_alloc[0]:0 [2022-10-23T17:36:34.451] type_cnt_avail[0]:1 [2022-10-23T17:36:34.451] gres/gpu_mem: state for sb03-14 [2022-10-23T17:36:34.451] gres_cnt found:0 configured:0 avail:0 alloc:0 [2022-10-23T17:36:34.451] gres_bit_alloc:NULL [2022-10-23T17:36:34.451] gres_used:gpu_mem:0 [2022-10-23T17:36:34.451] gres/tmp: state for sb03-14 [2022-10-23T17:36:34.451] gres_cnt found:404800667648 configured:404800667648 avail:404800667648 alloc:0 [2022-10-23T17:36:34.451] gres_bit_alloc:NULL [2022-10-23T17:36:34.451] gres_used:tmp:0 [2022-10-23T17:36:34.451] Node sb03-14 now responding Job then finishes, but: [2022-10-23T17:38:28.536] _job_complete: JobId=54265158 WEXITSTATUS 0 [2022-10-23T17:38:28.548] GRES: _job_dealloc:1084: JobId=54265158, NodeName=sb03-14 [2022-10-23T17:38:28.548] GRES: _job_dealloc:1106: Job and node have gres_bit_alloc [2022-10-23T17:38:28.548] GRES: _job_dealloc:1154: This is a never happen case now node_gres_ptr->gres_cnt_alloc: 1 [2022-10-23T17:38:28.548] GRES: _job_dealloc:1161: Job has gres_bit_alloc and node has a topo_gres_bitmap and topo_gres_cnt_alloc [2022-10-23T17:38:28.548] gres/gpu: state for sb03-14 [2022-10-23T17:38:28.548] gres_cnt found:1 configured:1 avail:1 alloc:1 [2022-10-23T17:38:28.548] gres_bit_alloc:0 of 1 [2022-10-23T17:38:28.548] gres_used:(null) [2022-10-23T17:38:28.548] topo[0]:A100(808464705) [2022-10-23T17:38:28.548] topo_core_bitmap[0]:NULL [2022-10-23T17:38:28.548] topo_gres_bitmap[0]:0 of 1 [2022-10-23T17:38:28.548] topo_gres_cnt_alloc[0]:1 [2022-10-23T17:38:28.548] topo_gres_cnt_avail[0]:1 [2022-10-23T17:38:28.548] type[0]:A100(808464705) [2022-10-23T17:38:28.548] type_cnt_alloc[0]:0 [2022-10-23T17:38:28.548] type_cnt_avail[0]:1 [2022-10-23T17:38:28.548] gres/gpu_mem: state for sb03-14 [2022-10-23T17:38:28.548] gres_cnt found:0 configured:0 avail:0 alloc:0 [2022-10-23T17:38:28.548] gres_bit_alloc:NULL [2022-10-23T17:38:28.548] gres_used:gpu_mem:0 [2022-10-23T17:38:28.548] gres/tmp: state for sb03-14 [2022-10-23T17:38:28.548] gres_cnt found:404800667648 configured:404800667648 avail:404800667648 alloc:0 [2022-10-23T17:38:28.548] gres_bit_alloc:NULL [2022-10-23T17:38:28.548] gres_used:tmp:0 [2022-10-23T17:38:28.548] _job_complete: JobId=54265158 done At this point node is sitting idle but has gres/gpu allocated so it can't take any new jobs. All that remains is for slurm to power it off 1h later: 2022-10-23 18:40:37 shut down sb03-14 So it seems like we need to understand where that "never happen case" comes from ... Hope that helps. Yuri, Could you please attach a full slurmctld log covering complete time span of last job life time (from submit to complete)? cheers, Marcin Created attachment 27410 [details]
10 minutes around job 54265158
Attached. Luckily the job was relatively short (17:33:10 to 17:38:28).
Jurij, Thanks for that! Could you please check if you can find a log message like: >_slurm_rpc_submit_batch_job: JobId=54265158 InitPrio=[0-9]+ usec=[0-9]+ in earlier logs? (This is really when job was submitted). I'm really curious to see the whole job history from submission to complete. additionally, can you share: >sacct -D -j54265158 -o AllocTRES,ReqTRES,Submit,Start,End,NodeLi -p and >sacctmgr show event Nodes=sb03-14 Start=2022-10-23T17:33:11 End=2022-10-23T17:38:28 In the output of sacct you'll have job sumbit time in 3rd column could you please upload the logs starting from that time till job complete or >[2022-10-23T17:30:01.392] _job_complete: JobId=54637611_259(54638681) WEXITSTATUS 0 when the submitted log starts. cheers, Marcin Jurij, Please ignore the last request - I can reliably reproduce the issue. cheers, Marcin Just wanted to let you know that I'm sending a patch fix the issue to our QA queue. cheers, Marcin Thanks, good to hear. Can you tell me whether if the fix is specific to 21.08 or also applicable to 22.05? I'm planning to upgrade to 22.05 end of year or early January next year. Keeping in mind that the patch didn't pass our QA yet. It should be easy to apply locally on top of 21.08. However, we are going merge it to an older version 22.05 or even master - Slurm 23.02 to be. cheers, Marcin Jurij, While we're working on the patch I think I have a potential workaround for you. The node should work just fine if slurmd registers to slurmctld prior to being scheduled a job, so if you: 1. Drain those nodes 2. Restart slurmctld (to clear AllocTRES) 3. Power on the nodes 4. When you see those are drained but not powered off according to scontrol show node - resume the nodes. The initial registration of the nodes may be required after every slurmctld restart. However, after that Slurm should be able to correctly suspend/resume those nodes. cheers, Marcin Any chance this patch lands in a slurm release still this year? Maybe 22.05.7? We usually schedule major cluster maintenance (including slurm upgrades) at the end of each year. Would be nice to have a fix for this in time ... Thanks. >Any chance this patch lands in a slurm release still this year? Maybe 22.05.7? We're targeting the fix to master branch - Slurm 23.02 to be, so it's unlikely that the fix will happen on 22.05. Besides the workaround from comment 84, just wanted to check if you are aware of SuspendExcNodes[1]? It's obviously not a fix, but since the issue occurs only on subset of nodes it may be worth considering. cheers, Marcin [1]https://slurm.schedmd.com/slurm.conf.html#OPT_SuspendExcNodes Jurij, We're still looking for the best way to address the issue. However I wanted to share a configuration workaround that should be enough to mitigate it even on Slurm 21.08. Could you please try configuring those nodes as State=CLOUD and add SlurmctldParameters=cloud_dns? cheers, Marcin During our end-of-year maintenance, we upgraded to Slurm 22.05.7. I also instructed users to switch from --gres=gpu... to --gpus N, --cpus-per-gpu etc and we haven't seen this issue yet. Will monitor things closely if it shows up again. >I also instructed users to switch from --gres=gpu... to --gpus N Just to be 100% clear --gpus isn't a replacement of --gres=gpu, --gpus is a per job requested type/number of GPU devices, while --gres=gpu is a per node specification. Selection of resources for job allocation is often easier from perspective of the result understanding and scheduler performance when per node specification is used. >[...]Will monitor things closely if it shows up again. Sounds good, however, from code perspective it's still possible. cheers, Marcin Thanks for clarifying this. Our jobs are single node in large majority, we have only two or three mpi capable codes in regular use and those run fine over ethernet. We just deployed our first infiniband rack and we'll be exploring more mpi intensive multi-node codes in this year. Just to confirm, issue does indeed still happen. I've switched to CLOUD config as suggested in comment #92. We now have a large batch of gpu jobs in the queue and so far things look fine. Jurij, I know it was a long bug, but, I have a good news to share. The patch for reported issue is merged to our public repository. The most important commit is a54f9cffff[1], since it requires protocol changes it will be part of Slurm 23.02 release. Let me know if you have any questions. cheers, Marcin [1]https://github.com/SchedMD/slurm/commit/a54f9cffff7aed32354d7eed696a50d6d8831666 Jurij, I read no reply as no questions from your side. I'll go ahead and close the case now. Should you have any questions please reopen. cheers, Marcin |