Ticket 9720

Summary:	Reason not being refreshed to be reflective of true reason
Product:	Slurm	Reporter:	Anthony DelSorbo <anthony.delsorbo>
Component:	Scheduling	Assignee:	Marcin Stolarek <cinek>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	issp2020support, sts
Version:	20.02.4
Hardware:	Linux
OS:	Linux
URL:	We
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=10069
Site:	NOAA	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	NESCC	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	20.02.6 20.11pre1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	hera slurm.conf sdiag output - about 15 min since sdiag -r

Description Anthony DelSorbo 2020-09-02 08:26:28 MDT

Last evening we concluded our Scheduled Maintenance for this month.  At that time, we released our maintenance reservation and jobs started running - that's good.  The queue of jobs can be quite high.  For example, at the moment:

Job states ts:  Wed Sep 02 14:16:41 2020 (1599056201)
Jobs pending:   451
Jobs running:   249

This can vary quite a bit over time (as you know).  However, what's not so good is that the jobs far down in the queue never get any attention.  Hence, I believe the true reason they are pending is not being updated and give user the incorrect impression that something is wrong.  For example, a pending job ranked 168 down has the following reason:

Reason=ReqNodeNotAvail,_Reserved_for_maintenance

There are no reservations in place and all nodes are available.  So the reason should really be: (Priority).

I realize this likely an issue with the scheduler not reaching down low enough to look at those jobs for scheduling.  However, there should be a mechanism in place to at least update the reason so that it is not misleading.

Is there a setting I'm not using that could correct this behavior?

I'll upload my slurm.conf momentarily.

Thanks,

Tony.

Comment 1 Anthony DelSorbo 2020-09-02 08:29:10 MDT

Created attachment 15692 [details]
hera slurm.conf

Comment 2 Marcin Stolarek 2020-09-02 10:10:43 MDT

Tony,

It looks like the job reason wasn't updated when you removed the reservation - I'll take a look into the code to check how we can improve that.

Another thing is that those jobs are not reached by the backfill scheduler. Could you please change your SchedulerParameters to be:
>SchedulerParameters=kill_invalid_depend,bf_max_job_test=500,bf_interval=60

and call `scontrol reconfigure` and `sdiag -r`. 

After a few minutes, let say 10, please execute `sdiag` (without any option) and share the results with me.

cheers,
Marcin

Comment 3 Anthony DelSorbo 2020-09-02 11:06:11 MDT

(In reply to Marcin Stolarek from comment #2)
Marcin,

Thanks for the quick reply!  

> Another thing is that those jobs are not reached by the backfill scheduler.
> Could you please change your SchedulerParameters to be:
> >SchedulerParameters=kill_invalid_depend,bf_max_job_test=500,bf_interval=60
> 
> and call `scontrol reconfigure` and `sdiag -r`. 
> 
> After a few minutes, let say 10, please execute `sdiag` (without any option)
> and share the results with me.

I suspected the backfill scheduler depth would be the response to my query. However, since this is a production system, I am hesitant to make any changes since I don't know what the effect that will have on overall scheduling and system performance.  Would you provide more guidance on this?  Perhaps some pros and cons that I can include in my justification for a change request that management can review?  I'd appreciate any insight you may have.

Best,

Tony.

Comment 4 Marcin Stolarek 2020-09-02 11:15:29 MDT

OK.. let's just share `sdiag` first.

In your current config backfill depth is limited by the default value of bf_max_job_test=100. Another parameter like that is bf_interval - which defines a frequency (and by default a time limit) each backfill iteration can take. Increasing it generally makes the system more responsive for other RPCs, but means that jobs won't be backfilled more frequently than the defined value.
We'll probably have to add bf_continue as well to reach the whole queue - this switch allows backfill to continue work over the queue after locks release.

You may find a detailed explanation of those under SchedulerParameters section[1] of man slurm.conf

cheers,
Marcin

[1]https://slurm.schedmd.com/slurm.conf.html#OPT_SchedulerParameters

Comment 5 Anthony DelSorbo 2020-09-02 11:41:31 MDT

Created attachment 15697 [details]
sdiag output - about 15 min since sdiag -r

Comment 6 Marcin Stolarek 2020-09-03 05:33:53 MDT

Tony,

Based on the sdiag output I'd suggest increasing bf_max_job_test=500 and add bf_continue to your scheduler parameters as a general recommendation, however, after some additional code analysis I see it won't change pending jobs reason.

From what I see those jobs are still evaluated by both backfill and main scheduler but the reason is not updated. I'll have to check the details of it to see how we can improve that.

I'll keep you posted on the progress.

cheers,
Marcin

Comment 7 Marcin Stolarek 2020-09-03 05:47:36 MDT

PS. As a workaround, you can call `scontrol release JOBID` on those jobs. After that, they will get displayed with "(None)" reason which will be updated in next scheduler cycle.

Comment 9 Anthony DelSorbo 2020-09-03 10:06:45 MDT

(In reply to Marcin Stolarek from comment #7)
> PS. As a workaround, you can call `scontrol release JOBID` on those jobs.
> After that, they will get displayed with "(None)" reason which will be
> updated in next scheduler cycle.

Marcin,

Thanks for the update.  It appears the queue got short enough overnight for the backfill to reach all the jobs.  Thanks for the workaround.  I will try that next time this happens.

It seems the default bf_interval is 30 min.  This seems awfully high.  While "Higher values result in less overhead..." - and I understand that - how long does it typically take to run that part of the scheduling code.  I realize that would depend on the number of jobs it had to process, is the time on the order of seconds or minutes?  I'm considering bringing that value down to 5 min from the default.  Would that be excessive?  What could I monitor to determine how effective or impactful that change is?

sdiag seems to provide a lot of good info.  I suspect that's not all.  Is there a way to get a more comprehensive set of stats?  While I could do an sdiag -r followed by an sdiag say 5 min later, to get near-real time data, is there a better way to do this in order to analyze scheduler behavior?  What is the impact of doing that?

Best,

Tony.

Comment 10 Marcin Stolarek 2020-09-03 12:07:50 MDT

Tony,

>It seems the default bf_interval is 30 min.
No, it's 30 seconds.

>how long does it typically take to run that part of the scheduling code.
This really depends on too many factors to give any meaningful answer.

>What could I monitor to determine how effective or impactful that change is?
Last/Mean/Max cycle. It's the number of microseconds spent in backfill, this time doesn't include a time when backfill stops itself to relinquish locks to allow other operations processing. This process is controlled by bf_yield_interval=# and bf_yield_sleep=# parameters. By default backfill stops every 2 seconds for 0.5 second, which means that for bf_interval of 30 seconds the cycle time (as shown by sdiag) can be 30*(1-1/2*0.5) ~ 22.5. If you see that backfill is reaching this value it means that it stopped further processing of jobs because of the time limit, in other words, it didn't go deeper in the queue because of low bf_interval.

>sdiag seems to provide a lot of good info.  I suspect that's not all.  Is there a way to get a more comprehensive set of stats?  While I could do an sdiag -r followed by a sdiag say 5 min later, to get near-real time data, is there a better way to do this in order to analyze scheduler behavior?  What is the impact of doing that?
Calling sdiag has a minimal impact. Some sites execute it periodically and store returned values in a database with visualization frontends like ELK or TICK. This info can also be retrived from slurmrestd[1]. There are also many opensource projects you can find easily using "slurm dashboard" keywords. I'm not mentioning anythings specifically since I don't have enough experience with those tools to recommend one.

An equivalent tool for slurmdbd is `sacctmgr show stats`.

Looking at your config one more time I've noticed that your longest allowed time limit is 30hours. You should adjust bf_window to two times that value (it's in minutes), so you need bf_window=3600

cheers
Marcin

[1]https://slurm.schedmd.com/SLUG19/REST_API.pdf

Comment 21 Marcin Stolarek 2020-10-30 02:13:26 MDT

Tony,

We improved handling of "ReqNodeNotAvail, Reserved for maintenance" state reason, this is cleared now automaticall when the reservation ends so should switch to "None" and get set by next scheduler iteration.

Changed by 3d6902ebe9d75[1] which will be released in Slurm 20.02.6.

cheers,
Marcin
[1]https://github.com/SchedMD/slurm/commit/3d6902ebe9d757f752dd57a1cca69db08d4b956a

Comment 22 Marcin Stolarek 2020-12-03 04:24:58 MST

*** Ticket 10331 has been marked as a duplicate of this ticket. ***