Since a few hours the slurmctld in our environment is no longer dispatching jobs to our "linlarge" partition, even when there are about 100 idle nodes available. Other partitions are still working fine. We didn't have any config change recently and I haven't been able to find any relevant error in any log file so far. Only possible indication is in slurmctld that build_node_list started failing around that time (grep on linlarge): [2022-03-02T10:50:01.025] sched: Allocate JobId=5259123 NodeList=gisath085 #CPUs=28 Partition=linlarge [2022-03-02T10:50:11.246] sched: Allocate JobId=5259124 NodeList=gisath325 #CPUs=28 Partition=linlarge [2022-03-02T10:50:26.571] _build_node_list: No nodes satisfy JobId=5259126 requirements in partition linlarge [2022-03-02T10:50:35.572] _build_node_list: No nodes satisfy JobId=5259127 requirements in partition linlarge I tried restart of slurmdbd but no change I'm hesitating to restart slurmctld if not absolutely required as this could have potential impact on job tracking processes. Can you please advise on further troubleshooting steps that can be taken ? thanks
Would you please attach the following? > slurmctld.log > slurm.conf > sinfo > squeue > sprio > sdiag output (please run this command 5 times seperated by 30 second). > $ for i in {1..5}; do date; sdiag; sleep 30; done Please also attach the output of "scontrol show job <jobID>" from a few of the top jobs that should be starting in this partition.
Created attachment 23692 [details] slurmctld log
Created attachment 23693 [details] slurm conf (part1)
Created attachment 23694 [details] slurm conf (part2)
Created attachment 23695 [details] sdiag
Created attachment 23696 [details] sinfo
Created attachment 23697 [details] sprio
Created attachment 23698 [details] squeue
Created attachment 23699 [details] job example 1
Created attachment 23700 [details] job example 2
Hello Jason - I believe I found the issue; you mentioned to check jobs with highest priority and the highest priority was a job asking for 30 cores, while all except 2 hosts in linlarge have only 28. I have "drained" these 2 nodes which have >= 30 cores now and the controller immediately started to dispatch other jobs. So for some reason this pending job may have blocked all other jobs for submission. I just changed priority of this incident down to 3 as the jobs are submitting again. It would still be good to analyze if we have possibly some issue with our configuration. thanks Patrick
the "blocking" job seems to have been 5259128
Patrick, that is good news. We will try to analyze the issue with more calm now. I'll be back asap. Regards, Carlos.
I can't see in the slurm conf provided in Comments 4,5 the definition for those nodes with 30 CPUs. What I'm missing? I see JobId=5259128 from Comment 10 asking for "NumNodes=1 NumCPUs=30 NumTasks=1 CPUs/Task=30" but no nodes is like that in the provided config. Would you please tell at least the nodenames for those 2 nodes? Thanks! Carlos.
The 2 nodes in question are gisath[367-368] which are included in the linlarge partition. (they have 32 cores each)
Ahh yes, my fault. I didn't saw them. Sorry. Thanks!
Wait, the sinfo says: linlarge up infinite 3 mix gisath[339,367-368] But the last jobs for 367 and 368 are: [2022-03-01T17:32:17.784] sched: Allocate JobId=5254538 NodeList=gisath368 #CPUs=1 Partition=linlarge [2022-03-02T09:52:14.695] sched: Allocate JobId=5257180 NodeList=gisath367 #CPUs=1 Partition=linlarge I can't see those jobs nor those nodes in the squeue. I can't see why the nodes are in MIX state (partly free, partly ???) Is there any possibility to look at those jobs and see if we can get the jobscript, or if they were interactive, etc. What does the details are in the accounting for those 2 job? Send us output from: sacct -lP -j 5254538,5257180. Thanks, Carlos.
Created attachment 23710 [details] sacct info comment 18
as far as I can see the 2 jobs you reference were standard batch jobs; If I recall well when I had checked these nodes they were running multi-node spanning jobs when we saw the issue. in the squeue you see job 5254248 on 367 and job 5254301 on 368
Created attachment 23711 [details] sacct info comment 18 and 20
Ah yes, now it makes sense: 5254248 linlarge execute. aa92620 R 1-00:38:58 3 gisath[093,187,367] [2022-03-01T17:04:16.183] sched: Allocate JobId=5254248 NodeList=gisath[093,187,367] #CPUs=84 Partition=linlarge -- 5254301 linlarge execute. aa92620 R 22:03:50 3 gisath[100,320,368] [2022-03-01T19:39:24.413] sched: Allocate JobId=5254301 NodeList=gisath[100,320,368] #CPUs=84 Partition=linlarge I can't see then any strange job. sacct seems correct. So I'm going to try to reproduce locally the same issue by replicating as much as possible your config. Additionally, I see the slurmctld is not very loaded, so you can try to reproduce the issue again for a short while, and then issue: scontrol setdebug debug2 scontrol setdebugflags +Accrue scontrol setdebugflags +NodeFeatures scontrol setdebugflags +Priority scontrol setdebugflags +Reservation To spot any hint on the jobs that are stuck, and the ones that run. And then send us back the extract from the log by the time the debug was enabled. To reset everything back to prod, issue: scontrol setdebugflags -Reservation scontrol setdebugflags -Priority scontrol setdebugflags -NodeFeatures scontrol setdebugflags -Accrue scontrol setdebug 0 As a side note, you are running version 20.11, so I need to test this with 21.08 as well. Cheers, Carlos.
fyi. due to job backlog our cluster is currently pretty much 100% allocated so I won't be able to do any troubleshooting (at least today).
Patrick, I've found this to happen due to: #SchedulerType=sched/backfill # needs users to specify runtime!!!! SchedulerType=sched/builtin As you aren't using backfill the highest prio job is blocking the queue because the nodes with 32 cores are in use. You should change this into SchedulerType=sched/backfill because builtin one is pretty simple and has this limitation. This is what backfill is about, to enable lower prio or smaller jobs start for a variety of reasons when the full queue is stuck. Regards, Carlos.
Hello Carlos - it wasn't clear to me that sched/builtin is actually disabling the scheduling queue and only considers one job at a time (blocking on FIFO). Our understanding was that sched/backfill would require job duration to be specified at submission time (-s) to be able to do backfilling and this is not done in our environment. Can you confirm that changing to sched/backfill will still work correctly without job duration specified at submission ? thanks
Patrick, I'm using 20.11 testbed and I've been able to reproduce your issue with jobs w/o timelimit specified using builtin scheduler. And I've been able to send jobs w/o timelimit specified using backfill scheduler, and there the issue is not happening. Regardless this, *yes* backfill works better if you specify a timelimit. Take a look at [1], [2], [3] for more information. But yes builtin can block the queue, as you said. In any case, if you don't switch to backfill the issue can't be workarounded the way the cluster is configured. Another option is to put the 32-core nodes in a separate partition. Cheers, Carlos. [1] https://slurm.schedmd.com/faq.html#pending [2] https://slurm.schedmd.com/faq.html#backfill [3] https://slurm.schedmd.com/sched_config.html#backfill
ok I guess we'll have to schedule config change then to set Scheduler back to backfill; I believe this requires a restart of slurmctld ? We could possibly also try to combine with a upgrade to 21.8 later this year. fyi the issue actually re-occurred last night, even with these 2 special nodes being offline (drained). as per documentation this should actually allow the sched/builtin to continue scheduling other jobs [ "An exception is made for jobs that can not run due to partition constraints (e.g. the time limit) or down/drained nodes." ] it looks like this exception seems to have only worked while the nodes were "draining" but not when they were "drained".
> the issue actually re-occurred last night, even with these 2 special nodes being offline (drained). Have you the information about this at hand? scontrol show job squeue sinfo slurmctld.log I want to take a look at the job that was blocking the queue and the whole cluster/queue status. Thanks, Carlos.
Created attachment 23729 [details] updated slurmctld
Created attachment 23730 [details] sprio blocking job
I've attached the updated slurmctld.log and sprio output; I unfortunately don't have the other logs from the time of issue I was able to unblock the situation by just resuming one of the nodes with 32 cores. As temporary solution I will take out the 2 extra nodes from this partition once free of jobs. thanks
Thank you, let's see if I can reproduce this as well locally.
By now I can't reproduce that you experienced last night. I guess it wasn't happening in another partition with the other 32-core nodes but again in linlarge. But if this was the case it makes sense because it was ion other partition with other 32-core nodes still online. I can't see anything strange in the log, and as I can't see que squeue output I can't know more from that even by now. Cheers, Carlos.
Hello Carlos - the repeated issue was again with linlarge; I agree that this doesn't seem to make sense as the extra-nodes were closed (and this initially fixed the issue). We have now taken out these extra-nodes from the linlarge partition as temporary workaround until we can have a scheduled downtime to change the scheduler to backfill. Do you see any other configuration which would need to be added or modified if we switch from sched/builtin to sched/backfill ? thanks
Hi Patrick, I think the best you can do is to read through these docs: https://slurm.schedmd.com/SUG14/sched_tutorial.pdf https://slurm.schedmd.com/sched_config.html#backfill https://slurm.schedmd.com/slurm.conf.html#OPT_SchedulerParameters And if you have any specific doubt after, I'll glad to help you. Cheers, Carlos.
Hi Patrick, From now, if you don't have any further question, I'm going to close the issue as info given. And, if after switching to backfill you experience any problem, feel free to reopen the it. Regards, Carlos.