Ticket 5534

Summary:	Tunable to set behavior when dbd.msgs is at capacity: discard (current) or refuse to proceed (new)
Product:	Slurm	Reporter:	S Senator <sts>
Component:	Accounting	Assignee:	Danny Auble <da>
Status:	RESOLVED FIXED	QA Contact:
Severity:	5 - Enhancement
Priority:	---	CC:	bsantos, fullop, jacob, mej
Version:	20.02.x
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=109
Site:	LANL	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:	test cluster, fire
CLE Version:		Version Fixed:	20.02.0-pre1
Target Release:	20.02	DevPrio:	1 - Paid
Emory-Cloud Sites:	---

Description S Senator 2018-08-07 14:14:41 MDT

One of our summer projects was to characterize slurm failure modes and suggest system administration procedures when various slurm components are unable to function correctly. The root cause could be any external-to-slurm stimulus: unexpected burst in external stress (power, cooling, etc), a file system full on the location where dbd.msgs is located, an unresponsive data base, or uncommunicative network link between components.

We found that there were cases where the slurmdbd discarded records. In our environment oversight and accounting is a high priority. Discarding records is, at least in some cases, more severe than not running jobs.

We would like to request a tunable parameter which would be used to specify the appropriate behavior: "discard" as in the present policy vs. "refuse operations which would generate a record."

It is clear that some number of events would continue to trickle in for previously initiated job steps, jobs and node states. However, it is likely that those could be reconstructed from their un-mated predecessor records, similar to the way that runaway jobs records are detectable and subsequently repaired.

A secondary request, which may merit a separate ticket if the above is reclassified as purely an enhancement, would be to:
1) document the exact priority, thresholds and type of records which are discarded when the "discard" policy is in place
2) explicitly set the size of the dbd.messages spool rather than derive it, emitting a warning message and/or refuse if it appears to be configured too low
3) document that the spool is sized by MaxJobCount*2+NodeCount*4

Thank you.

Comment 2 Tim Wickberg 2018-08-20 19:23:40 MDT

Updating metadata on the ticket here.

> We would like to request a tunable parameter which would be used to specify
> the appropriate behavior: "discard" as in the present policy vs. "refuse
> operations which would generate a record."

Is this something you would like us to work up as an SoW?

> It is clear that some number of events would continue to trickle in for
> previously initiated job steps, jobs and node states. However, it is likely
> that those could be reconstructed from their un-mated predecessor records,
> similar to the way that runaway jobs records are detectable and subsequently
> repaired.
> 
> A secondary request, which may merit a separate ticket if the above is
> reclassified as purely an enhancement, would be to:
> 1) document the exact priority, thresholds and type of records which are
> discarded when the "discard" policy is in place
> 2) explicitly set the size of the dbd.messages spool rather than derive it,
> emitting a warning message and/or refuse if it appears to be configured too
> low
> 3) document that the spool is sized by MaxJobCount*2+NodeCount*4

This is mentioned already on:
https://slurm.schedmd.com/sdiag.html

Comment 3 S Senator 2018-08-21 08:48:12 MDT

Yes, please construct a SOW estimate, at least at the granularity of an initial scope of work. (easy/2 days vs. hard/1 month)

Comment 4 S Senator 2018-08-23 15:46:44 MDT

Although this was requested due to evidence on a test cluster it appears to be manifesting in production on a secure cluster since earlier this week, running v.17.02.10. Please consider reclassifying this as a non-enhancement.

This is manifesting as sacct errors where jobs are showing incorrect or unknown end times, which cause subsequent accounting and oversight reporting errors.

Comment 5 Tim Wickberg 2018-08-23 18:17:58 MDT

(In reply to S Senator from comment #4)
> Although this was requested due to evidence on a test cluster it appears to
> be manifesting in production on a secure cluster since earlier this week,
> running v.17.02.10. Please consider reclassifying this as a non-enhancement.

Even if support for the 17.02 release series was not expiring in eight days, our release process places a moratorium on adding additional features or configuration items to the stable releases except in usual circumstances.
 
The behavior you're describing is quite well established at this point, and will not change before 19.05.

If you wish to increase this size today, it is simple enough to increase the MaxJobCount on the system at this point in time, and rely on that indirectly influencing the size of that state file.

> This is manifesting as sacct errors where jobs are showing incorrect or
> unknown end times, which cause subsequent accounting and oversight reporting
> errors.

Please open a separate ticket if you'd like to discuss this.

Comment 6 S Senator 2018-08-27 10:24:11 MDT

We're in the middle of a transition to 17.11 ourselves, so there's no need to discuss the 17.02 release. I did want you aware, though, that is was manifesting on actual production systems. We understand, and already have adjusted MaxJobCount. This DBD message is now specifically in our alerting system as well.

Comment 7 Tim Wickberg 2018-08-29 20:37:03 MDT

I learn something new every day... I think you could accomplish the shutdown portion of your request right now.

Look at the strigger man page, under the --primary_slurmctld_acct_buffer_full trigger. I believe you could use that to call 'scontrol shutdown' if desired.

Comment 8 S Senator 2018-11-06 17:09:44 MST

Per this morning's teleconference call with da@schedmd.com, please revisit this issue and consider classifying this as a bug while the default action is to discard messages in some circumstances.

We do intend to use the 'strigger' mechanism supplied by timw until this issue is addressed.

Comment 11 S Senator 2018-11-20 16:21:42 MST

Please provide a Statement of Work so that we can direct this task to the LANL/SchedMD contract.

Comment 12 Danny Auble 2018-12-10 15:50:44 MST

Sorry for the delay Steve, the SOW is in the final stages right now.  We will get it to you soon.

Comment 19 Danny Auble 2019-10-23 14:51:13 MDT

Steve, this has been added to our master branch (20.02) in commit 9a0010b132ab7c.  Please go test and see what you think.

The option to set the max dbd message is

MaxDBDMsgs=#

The option to control what happens when it reaches this limit is

SlurmctldParameters=max_dbd_msg_action=[discard|exit]

Let me know what you think.

Comment 20 S Senator 2019-10-23 15:25:25 MDT

Thank you for the update. I will test it as soon as I rebuild my playpen, which is in progress as we speak.^H^H^H^H^H type.

Thank you,
-Steve Senator

________________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Wednesday, October 23, 2019 2:51:13 PM
To: Senator, Steven Terry
Subject: [Bug 5534] Tunable to set behavior when dbd.msgs is at capacity: discard (current) or refuse to proceed (new)

Comment # 19<https://bugs.schedmd.com/show_bug.cgi?id=5534#c19> on bug 5534<https://bugs.schedmd.com/show_bug.cgi?id=5534> from Danny Auble<mailto:da@schedmd.com>

Steve, this has been added to our master branch (20.02) in commit
9a0010b132ab7c.  Please go test and see what you think.

The option to set the max dbd message is

MaxDBDMsgs=#

The option to control what happens when it reaches this limit is

SlurmctldParameters=max_dbd_msg_action=[discard|exit]

Let me know what you think.

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 21 Danny Auble 2019-11-05 10:47:41 MST

Steve, any update on your tests?  Is it ok to close this?

Comment 22 Danny Auble 2019-12-18 15:51:06 MST

Steve, I am assuming this works as expected.  Please reopen if not.