Ticket 12523 - enable_rpc_queue
Summary: enable_rpc_queue
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 20.11.7
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-09-21 09:43 MDT by Paul Edmon
Modified: 2021-11-12 10:41 MST (History)
1 user (show)

See Also:
Site: Harvard University
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Paul Edmon 2021-09-21 09:43:17 MDT
I was just watching the Field Notes in SLUG when mention was made of the undocumented enable_rpc_queue option.  We do suffer from frequent RPC storms and I was curious if you thought this option would be a good idea in our environment here?
Comment 4 Michael Hinton 2021-09-22 12:57:46 MDT
Hey Paul,

Generally, enable_rpc_queue was meant to help during RPC storms, so I would think it would be something you might want to try out.

Tim mentioned that the main issue he remembers is that one of the RPC handlers can dispatch a message directly to an srun, in certain edge cases. If that srun doesn't exist any longer, it ties up that RPC processing thread for TcpTimeout, which can definitely cause some problems. Tim isn't sure if this issue has already been fixed, or if it's still a problem. So I'll get back to you on that when we have more information. Also, the engineer that created this feature is on vacation this week, so by next week he should be able to shed some more light on any outstanding issues.

Thanks!
-Michael
Comment 5 Paul Edmon 2021-09-22 13:01:18 MDT
Certainly.  Right now I'm planning on upgrading to 21.08.x (whatever 
latest release is at the time) on November 1st.  I wouldn't plan on 
implementing this until we do that upgrade as I know you guys have 
overhauled quite a bit of the RPC framework in that version.  I'm happy 
to try this out after we do that upgrade as I frequently see us capping 
out on RPC count and so I'm happy to try anything that alleviate that.

-Paul Edmon-

On 9/22/2021 2:57 PM, bugs@schedmd.com wrote:
>
> *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=12523#c4> on 
> bug 12523 <https://bugs.schedmd.com/show_bug.cgi?id=12523> from 
> Michael Hinton <mailto:hinton@schedmd.com> *
> Hey Paul,
>
> Generally, enable_rpc_queue was meant to help during RPC storms, so I would
> think it would be something you might want to try out.
>
> Tim mentioned that the main issue he remembers is that one of the RPC handlers
> can dispatch a message directly to an srun, in certain edge cases. If that srun
> doesn't exist any longer, it ties up that RPC processing thread for TcpTimeout,
> which can definitely cause some problems. Tim isn't sure if this issue has
> already been fixed, or if it's still a problem. So I'll get back to you on that
> when we have more information. Also, the engineer that created this feature is
> on vacation this week, so by next week he should be able to shed some more
> light on any outstanding issues.
>
> Thanks!
> -Michael
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 6 Michael Hinton 2021-10-07 16:26:08 MDT
The only open issue I was made aware of with enable_rpc_queue is this:

When sending responses, the controller is under a lock. If any client command hangs, this blocks slurmctld for msgtimeout seconds. Users can inadvertently hang the controller this way with an accidental "Ctrl+Z", a kill -19, or a host freeze. Or they can intentionally do this to disrupt the slurmctld. I'm not sure yet when this will be fixed, but I'll let you know when I do.

-Michael
Comment 7 Paul Edmon 2021-10-08 07:46:08 MDT
Ouch, yeah that would be bad.  We have quite a few client commands 
flying around at any given moment and they do terminate prematurely or 
stall due to node fault or impatience of the user. So it would be good 
to hold off on turning this on in our environment until that is resolved.

-Paul Edmon-

On 10/7/2021 6:26 PM, bugs@schedmd.com wrote:
>
> *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=12523#c6> on 
> bug 12523 <https://bugs.schedmd.com/show_bug.cgi?id=12523> from 
> Michael Hinton <mailto:hinton@schedmd.com> *
> The only open issue I was made aware of with enable_rpc_queue is this:
>
> When sending responses, the controller is under a lock. If any client command
> hangs, this blocks slurmctld for msgtimeout seconds. Users can inadvertently
> hang the controller this way with an accidental "Ctrl+Z", a kill -19, or a host
> freeze. Or they can intentionally do this to disrupt the slurmctld. I'm not
> sure yet when this will be fixed, but I'll let you know when I do.
>
> -Michael
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 9 Michael Hinton 2021-11-12 10:41:52 MST
(In reply to Paul Edmon from comment #7)
> Ouch, yeah that would be bad.  We have quite a few client commands 
> flying around at any given moment and they do terminate prematurely or 
> stall due to node fault or impatience of the user. So it would be good 
> to hold off on turning this on in our environment until that is resolved.
Ok. I will go ahead and mark this as resolved. Hopefully we can figure out this issue with enable_rpc_queue so that it's safer to use. Feel free to reopen if you have further questions.

Thanks!
-Michael