One of our summer projects was to characterize slurm failure modes and suggest system administration procedures when various slurm components are unable to function correctly. The root cause could be any external-to-slurm stimulus: unexpected burst in external stress (power, cooling, etc), a file system full on the location where dbd.msgs is located, an unresponsive data base, or uncommunicative network link between components. We found that there were cases where the slurmdbd discarded records. In our environment oversight and accounting is a high priority. Discarding records is, at least in some cases, more severe than not running jobs. We would like to request a tunable parameter which would be used to specify the appropriate behavior: "discard" as in the present policy vs. "refuse operations which would generate a record." It is clear that some number of events would continue to trickle in for previously initiated job steps, jobs and node states. However, it is likely that those could be reconstructed from their un-mated predecessor records, similar to the way that runaway jobs records are detectable and subsequently repaired. A secondary request, which may merit a separate ticket if the above is reclassified as purely an enhancement, would be to: 1) document the exact priority, thresholds and type of records which are discarded when the "discard" policy is in place 2) explicitly set the size of the dbd.messages spool rather than derive it, emitting a warning message and/or refuse if it appears to be configured too low 3) document that the spool is sized by MaxJobCount*2+NodeCount*4 Thank you.
Updating metadata on the ticket here. > We would like to request a tunable parameter which would be used to specify > the appropriate behavior: "discard" as in the present policy vs. "refuse > operations which would generate a record." Is this something you would like us to work up as an SoW? > It is clear that some number of events would continue to trickle in for > previously initiated job steps, jobs and node states. However, it is likely > that those could be reconstructed from their un-mated predecessor records, > similar to the way that runaway jobs records are detectable and subsequently > repaired. > > A secondary request, which may merit a separate ticket if the above is > reclassified as purely an enhancement, would be to: > 1) document the exact priority, thresholds and type of records which are > discarded when the "discard" policy is in place > 2) explicitly set the size of the dbd.messages spool rather than derive it, > emitting a warning message and/or refuse if it appears to be configured too > low > 3) document that the spool is sized by MaxJobCount*2+NodeCount*4 This is mentioned already on: https://slurm.schedmd.com/sdiag.html
Yes, please construct a SOW estimate, at least at the granularity of an initial scope of work. (easy/2 days vs. hard/1 month)
Although this was requested due to evidence on a test cluster it appears to be manifesting in production on a secure cluster since earlier this week, running v.17.02.10. Please consider reclassifying this as a non-enhancement. This is manifesting as sacct errors where jobs are showing incorrect or unknown end times, which cause subsequent accounting and oversight reporting errors.
(In reply to S Senator from comment #4) > Although this was requested due to evidence on a test cluster it appears to > be manifesting in production on a secure cluster since earlier this week, > running v.17.02.10. Please consider reclassifying this as a non-enhancement. Even if support for the 17.02 release series was not expiring in eight days, our release process places a moratorium on adding additional features or configuration items to the stable releases except in usual circumstances. The behavior you're describing is quite well established at this point, and will not change before 19.05. If you wish to increase this size today, it is simple enough to increase the MaxJobCount on the system at this point in time, and rely on that indirectly influencing the size of that state file. > This is manifesting as sacct errors where jobs are showing incorrect or > unknown end times, which cause subsequent accounting and oversight reporting > errors. Please open a separate ticket if you'd like to discuss this.
We're in the middle of a transition to 17.11 ourselves, so there's no need to discuss the 17.02 release. I did want you aware, though, that is was manifesting on actual production systems. We understand, and already have adjusted MaxJobCount. This DBD message is now specifically in our alerting system as well.
I learn something new every day... I think you could accomplish the shutdown portion of your request right now. Look at the strigger man page, under the --primary_slurmctld_acct_buffer_full trigger. I believe you could use that to call 'scontrol shutdown' if desired.
Per this morning's teleconference call with da@schedmd.com, please revisit this issue and consider classifying this as a bug while the default action is to discard messages in some circumstances. We do intend to use the 'strigger' mechanism supplied by timw until this issue is addressed.
Please provide a Statement of Work so that we can direct this task to the LANL/SchedMD contract.
Sorry for the delay Steve, the SOW is in the final stages right now. We will get it to you soon.
Steve, this has been added to our master branch (20.02) in commit 9a0010b132ab7c. Please go test and see what you think. The option to set the max dbd message is MaxDBDMsgs=# The option to control what happens when it reaches this limit is SlurmctldParameters=max_dbd_msg_action=[discard|exit] Let me know what you think.
Thank you for the update. I will test it as soon as I rebuild my playpen, which is in progress as we speak.^H^H^H^H^H type. Thank you, -Steve Senator ________________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Wednesday, October 23, 2019 2:51:13 PM To: Senator, Steven Terry Subject: [Bug 5534] Tunable to set behavior when dbd.msgs is at capacity: discard (current) or refuse to proceed (new) Comment # 19<https://bugs.schedmd.com/show_bug.cgi?id=5534#c19> on bug 5534<https://bugs.schedmd.com/show_bug.cgi?id=5534> from Danny Auble<mailto:da@schedmd.com> Steve, this has been added to our master branch (20.02) in commit 9a0010b132ab7c. Please go test and see what you think. The option to set the max dbd message is MaxDBDMsgs=# The option to control what happens when it reaches this limit is SlurmctldParameters=max_dbd_msg_action=[discard|exit] Let me know what you think. ________________________________ You are receiving this mail because: * You reported the bug.
Steve, any update on your tests? Is it ok to close this?
Steve, I am assuming this works as expected. Please reopen if not.