Ticket 13549 - All jobs in primary partition "linlarge" stay pending with Reason: Priority
Summary: All jobs in primary partition "linlarge" stay pending with Reason: Priority
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 20.11.8
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Carlos Tripiana Montes
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-03-02 09:18 MST by Patrick
Modified: 2022-03-09 08:09 MST (History)
0 users

See Also:
Site: Goodyear
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmctld log (3.40 MB, application/x-gzip)
2022-03-02 09:52 MST, Patrick
Details
slurm conf (part1) (5.15 KB, text/plain)
2022-03-02 09:53 MST, Patrick
Details
slurm conf (part2) (3.07 KB, text/plain)
2022-03-02 09:53 MST, Patrick
Details
sdiag (71.10 KB, text/plain)
2022-03-02 09:53 MST, Patrick
Details
sinfo (4.55 KB, text/plain)
2022-03-02 09:54 MST, Patrick
Details
sprio (8.80 KB, text/plain)
2022-03-02 09:54 MST, Patrick
Details
squeue (17.97 KB, text/plain)
2022-03-02 09:54 MST, Patrick
Details
job example 1 (1.17 KB, text/plain)
2022-03-02 09:55 MST, Patrick
Details
job example 2 (1.21 KB, text/plain)
2022-03-02 09:55 MST, Patrick
Details
sacct info comment 18 (4.76 KB, text/plain)
2022-03-03 06:29 MST, Patrick
Details
sacct info comment 18 and 20 (8.77 KB, text/plain)
2022-03-03 06:45 MST, Patrick
Details
updated slurmctld (825.79 KB, application/x-gzip)
2022-03-04 01:36 MST, Patrick
Details
sprio blocking job (39.29 KB, text/plain)
2022-03-04 01:38 MST, Patrick
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Patrick 2022-03-02 09:18:40 MST
Since a few hours the slurmctld in our environment is no longer dispatching jobs to our "linlarge" partition, even when there are about 100 idle nodes available.
Other partitions are still working fine.
We didn't have any config change recently and I haven't been able to find any relevant error in any log file so far.

Only possible indication is in slurmctld that build_node_list started failing around that time (grep on linlarge):

[2022-03-02T10:50:01.025] sched: Allocate JobId=5259123 NodeList=gisath085 #CPUs=28 Partition=linlarge
[2022-03-02T10:50:11.246] sched: Allocate JobId=5259124 NodeList=gisath325 #CPUs=28 Partition=linlarge
[2022-03-02T10:50:26.571] _build_node_list: No nodes satisfy JobId=5259126 requirements in partition linlarge
[2022-03-02T10:50:35.572] _build_node_list: No nodes satisfy JobId=5259127 requirements in partition linlarge

I tried restart of slurmdbd but no change
I'm hesitating to restart slurmctld if not absolutely required as this could have potential impact on job tracking processes.

Can you please advise on further troubleshooting steps that can be taken ?
thanks
Comment 1 Jason Booth 2022-03-02 09:38:12 MST
Would you please attach the following?
> slurmctld.log
> slurm.conf
> sinfo
> squeue
> sprio
> sdiag output (please run this command 5 times seperated by 30 second).
> $ for i in {1..5}; do date; sdiag; sleep 30; done

Please also attach the output of "scontrol show job <jobID>" from a few of the top jobs that should be starting in this partition.
Comment 3 Patrick 2022-03-02 09:52:43 MST
Created attachment 23692 [details]
slurmctld log
Comment 4 Patrick 2022-03-02 09:53:11 MST
Created attachment 23693 [details]
slurm conf (part1)
Comment 5 Patrick 2022-03-02 09:53:25 MST
Created attachment 23694 [details]
slurm conf (part2)
Comment 6 Patrick 2022-03-02 09:53:49 MST
Created attachment 23695 [details]
sdiag
Comment 7 Patrick 2022-03-02 09:54:12 MST
Created attachment 23696 [details]
sinfo
Comment 8 Patrick 2022-03-02 09:54:31 MST
Created attachment 23697 [details]
sprio
Comment 9 Patrick 2022-03-02 09:54:52 MST
Created attachment 23698 [details]
squeue
Comment 10 Patrick 2022-03-02 09:55:30 MST
Created attachment 23699 [details]
job example 1
Comment 11 Patrick 2022-03-02 09:55:53 MST
Created attachment 23700 [details]
job example 2
Comment 12 Patrick 2022-03-02 10:03:00 MST
Hello Jason - I believe I found the issue;
you mentioned to check jobs with highest priority and the highest priority was a job asking for 30 cores, while all except 2 hosts in linlarge have only 28.

I have "drained" these 2 nodes which have >= 30 cores now and the controller immediately started to dispatch other jobs.

So for some reason this pending job may have blocked all other jobs for submission.

I just changed priority of this incident down to 3 as the jobs are submitting again.
It would still be good to analyze if we have possibly some issue with our configuration.

thanks
Patrick
Comment 13 Patrick 2022-03-02 10:04:37 MST
the "blocking" job seems to have been 5259128
Comment 14 Carlos Tripiana Montes 2022-03-02 10:09:34 MST
Patrick,

that is good news. We will try to analyze the issue with more calm now. I'll be back asap.

Regards,
Carlos.
Comment 15 Carlos Tripiana Montes 2022-03-03 04:55:05 MST
I can't see in the slurm conf provided in Comments 4,5 the definition for those nodes with 30 CPUs.

What I'm missing? I see JobId=5259128 from Comment 10 asking for "NumNodes=1 NumCPUs=30 NumTasks=1 CPUs/Task=30" but no nodes is like that in the provided config.

Would you please tell at least the nodenames for those 2 nodes?

Thanks!
Carlos.
Comment 16 Patrick 2022-03-03 05:21:39 MST
The 2 nodes in question are gisath[367-368] which are included in the linlarge partition. (they have 32 cores each)
Comment 17 Carlos Tripiana Montes 2022-03-03 06:06:24 MST
Ahh yes, my fault. I didn't saw them. Sorry. Thanks!
Comment 18 Carlos Tripiana Montes 2022-03-03 06:23:47 MST
Wait, the sinfo says:

linlarge           up   infinite      3    mix gisath[339,367-368]

But the last jobs for 367 and 368 are:

[2022-03-01T17:32:17.784] sched: Allocate JobId=5254538 NodeList=gisath368 #CPUs=1 Partition=linlarge

[2022-03-02T09:52:14.695] sched: Allocate JobId=5257180 NodeList=gisath367 #CPUs=1 Partition=linlarge

I can't see those jobs nor those nodes in the squeue. I can't see why the nodes are in MIX state (partly free, partly ???)

Is there any possibility to look at those jobs and see if we can get the jobscript, or if they were interactive, etc. What does the details are in the accounting for those 2 job? Send us output from: sacct -lP -j 5254538,5257180.

Thanks,
Carlos.
Comment 19 Patrick 2022-03-03 06:29:13 MST
Created attachment 23710 [details]
sacct info comment 18
Comment 20 Patrick 2022-03-03 06:32:14 MST
as far as I can see the 2 jobs you reference were standard batch jobs;
If I recall well when I had checked these nodes they were running multi-node spanning jobs when we saw the issue.

in the squeue you see job 5254248 on 367 and job 5254301 on 368
Comment 21 Patrick 2022-03-03 06:45:35 MST
Created attachment 23711 [details]
sacct info comment 18 and 20
Comment 22 Carlos Tripiana Montes 2022-03-03 06:56:23 MST
Ah yes, now it makes sense:

5254248  linlarge execute.  aa92620  R 1-00:38:58      3 gisath[093,187,367]

[2022-03-01T17:04:16.183] sched: Allocate JobId=5254248 NodeList=gisath[093,187,367] #CPUs=84 Partition=linlarge

--

5254301  linlarge execute.  aa92620  R   22:03:50      3 gisath[100,320,368]

[2022-03-01T19:39:24.413] sched: Allocate JobId=5254301 NodeList=gisath[100,320,368] #CPUs=84 Partition=linlarge

I can't see then any strange job. sacct seems correct. So I'm going to try to reproduce locally the same issue by replicating as much as possible your config.

Additionally, I see the slurmctld is not very loaded, so you can try to reproduce the issue again for a short while, and then issue:

scontrol setdebug debug2
scontrol setdebugflags +Accrue
scontrol setdebugflags +NodeFeatures
scontrol setdebugflags +Priority
scontrol setdebugflags +Reservation

To spot any hint on the jobs that are stuck, and the ones that run. And then send us back the extract from the log by the time the debug was enabled.

To reset everything back to prod, issue:

scontrol setdebugflags -Reservation
scontrol setdebugflags -Priority
scontrol setdebugflags -NodeFeatures
scontrol setdebugflags -Accrue
scontrol setdebug 0

As a side note, you are running version 20.11, so I need to test this with 21.08 as well.

Cheers,
Carlos.
Comment 23 Patrick 2022-03-03 07:00:38 MST
fyi. due to job backlog our cluster is currently pretty much 100% allocated so I won't be able to do any troubleshooting (at least today).
Comment 24 Carlos Tripiana Montes 2022-03-03 08:25:18 MST
Patrick,

I've found this to happen due to:

#SchedulerType=sched/backfill  # needs users to specify runtime!!!!
SchedulerType=sched/builtin

As you aren't using backfill the highest prio job is blocking the queue because the nodes with 32 cores are in use.

You should change this into SchedulerType=sched/backfill because builtin one is pretty simple and has this limitation.

This is what backfill is about, to enable lower prio or smaller jobs start for a variety of reasons when the full queue is stuck.

Regards,
Carlos.
Comment 25 Patrick 2022-03-04 01:04:52 MST
Hello Carlos - 
it wasn't clear to me that sched/builtin is actually disabling the scheduling queue and only considers one job at a time (blocking on FIFO).
Our understanding was that sched/backfill would require job duration to be specified at submission time (-s) to be able to do backfilling and this is not done in our environment.

Can you confirm that changing to sched/backfill will still work correctly without job duration specified at submission ?
thanks
Comment 26 Carlos Tripiana Montes 2022-03-04 01:16:54 MST
Patrick,

I'm using 20.11 testbed and I've been able to reproduce your issue with jobs w/o timelimit specified using builtin scheduler.

And I've been able to send jobs w/o timelimit specified using backfill scheduler, and there the issue is not happening.

Regardless this, *yes* backfill works better if you specify a timelimit.

Take a look at [1], [2], [3] for more information. But yes builtin can block the queue, as you said. In any case, if you don't switch to backfill the issue can't be workarounded the way the cluster is configured.

Another option is to put the 32-core nodes in a separate partition.

Cheers,
Carlos.

[1] https://slurm.schedmd.com/faq.html#pending
[2] https://slurm.schedmd.com/faq.html#backfill
[3] https://slurm.schedmd.com/sched_config.html#backfill
Comment 27 Patrick 2022-03-04 01:27:11 MST
ok I guess we'll have to schedule config change then to set Scheduler back to backfill; I believe this requires a restart of slurmctld ?
We could possibly also try to combine with a upgrade to 21.8 later this year.

fyi the issue actually re-occurred last night, even with these 2 special nodes being offline (drained).
as per documentation this should actually allow the sched/builtin to continue scheduling other jobs [ "An exception is made for jobs that can not run due to partition  constraints  (e.g.  the  time limit)  or  down/drained nodes." ]

it looks like this exception seems to have only worked while the nodes were "draining" but not when they were "drained".
Comment 28 Carlos Tripiana Montes 2022-03-04 01:32:12 MST
> the issue actually re-occurred last night, even with these 2 special nodes being offline (drained).

Have you the information about this at hand?

scontrol show job
squeue
sinfo
slurmctld.log

I want to take a look at the job that was blocking the queue and the whole cluster/queue status.

Thanks,
Carlos.
Comment 29 Patrick 2022-03-04 01:36:56 MST
Created attachment 23729 [details]
updated slurmctld
Comment 30 Patrick 2022-03-04 01:38:22 MST
Created attachment 23730 [details]
sprio blocking job
Comment 31 Patrick 2022-03-04 01:41:23 MST
I've attached the updated slurmctld.log and sprio output; I unfortunately don't have the other logs from the time of issue

I was able to unblock the situation by just resuming one of the nodes with 32 cores. 
As temporary solution I will take out the 2 extra nodes from this partition once free of jobs.
thanks
Comment 32 Carlos Tripiana Montes 2022-03-04 01:43:24 MST
Thank you, let's see if I can reproduce this as well locally.
Comment 33 Carlos Tripiana Montes 2022-03-04 01:52:27 MST
By now I can't reproduce that you experienced last night.

I guess it wasn't happening in another partition with the other 32-core nodes but again in linlarge. But if this was the case it makes sense because it was ion other partition with other 32-core nodes still online.

I can't see anything strange in the log, and as I can't see que squeue output I can't know more from that even by now.

Cheers,
Carlos.
Comment 34 Patrick 2022-03-07 08:51:34 MST
Hello Carlos - the repeated issue was again with linlarge;
I agree that this doesn't seem to make sense as the extra-nodes were closed (and this initially fixed the issue).

We have now taken out these extra-nodes from the linlarge partition as temporary workaround until we can have a scheduled downtime to change the scheduler to backfill.
Do you see any other configuration which would need to be added or modified if we switch from sched/builtin to sched/backfill ?

thanks
Comment 35 Carlos Tripiana Montes 2022-03-08 01:05:41 MST
Hi Patrick,

I think the best you can do is to read through these docs:

https://slurm.schedmd.com/SUG14/sched_tutorial.pdf
https://slurm.schedmd.com/sched_config.html#backfill
https://slurm.schedmd.com/slurm.conf.html#OPT_SchedulerParameters

And if you have any specific doubt after, I'll glad to help you.

Cheers,
Carlos.
Comment 36 Carlos Tripiana Montes 2022-03-09 08:09:27 MST
Hi Patrick,

From now, if you don't have any further question, I'm going to close the issue as info given. And, if after switching to backfill you experience any problem, feel free to reopen the it.

Regards,
Carlos.