Ticket 12523

Summary: enable_rpc_queue
Product: Slurm Reporter: Paul Edmon <pedmon>
Component: slurmctldAssignee: Director of Support <support>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: bart
Version: 20.11.7   
Hardware: Linux   
OS: Linux   
Site: Harvard University Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Paul Edmon 2021-09-21 09:43:17 MDT
I was just watching the Field Notes in SLUG when mention was made of the undocumented enable_rpc_queue option.  We do suffer from frequent RPC storms and I was curious if you thought this option would be a good idea in our environment here?
Comment 4 Michael Hinton 2021-09-22 12:57:46 MDT
Hey Paul,

Generally, enable_rpc_queue was meant to help during RPC storms, so I would think it would be something you might want to try out.

Tim mentioned that the main issue he remembers is that one of the RPC handlers can dispatch a message directly to an srun, in certain edge cases. If that srun doesn't exist any longer, it ties up that RPC processing thread for TcpTimeout, which can definitely cause some problems. Tim isn't sure if this issue has already been fixed, or if it's still a problem. So I'll get back to you on that when we have more information. Also, the engineer that created this feature is on vacation this week, so by next week he should be able to shed some more light on any outstanding issues.

Thanks!
-Michael
Comment 5 Paul Edmon 2021-09-22 13:01:18 MDT
Certainly.  Right now I'm planning on upgrading to 21.08.x (whatever 
latest release is at the time) on November 1st.  I wouldn't plan on 
implementing this until we do that upgrade as I know you guys have 
overhauled quite a bit of the RPC framework in that version.  I'm happy 
to try this out after we do that upgrade as I frequently see us capping 
out on RPC count and so I'm happy to try anything that alleviate that.

-Paul Edmon-

On 9/22/2021 2:57 PM, bugs@schedmd.com wrote:
>
> *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=12523#c4> on 
> bug 12523 <https://bugs.schedmd.com/show_bug.cgi?id=12523> from 
> Michael Hinton <mailto:hinton@schedmd.com> *
> Hey Paul,
>
> Generally, enable_rpc_queue was meant to help during RPC storms, so I would
> think it would be something you might want to try out.
>
> Tim mentioned that the main issue he remembers is that one of the RPC handlers
> can dispatch a message directly to an srun, in certain edge cases. If that srun
> doesn't exist any longer, it ties up that RPC processing thread for TcpTimeout,
> which can definitely cause some problems. Tim isn't sure if this issue has
> already been fixed, or if it's still a problem. So I'll get back to you on that
> when we have more information. Also, the engineer that created this feature is
> on vacation this week, so by next week he should be able to shed some more
> light on any outstanding issues.
>
> Thanks!
> -Michael
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 6 Michael Hinton 2021-10-07 16:26:08 MDT
The only open issue I was made aware of with enable_rpc_queue is this:

When sending responses, the controller is under a lock. If any client command hangs, this blocks slurmctld for msgtimeout seconds. Users can inadvertently hang the controller this way with an accidental "Ctrl+Z", a kill -19, or a host freeze. Or they can intentionally do this to disrupt the slurmctld. I'm not sure yet when this will be fixed, but I'll let you know when I do.

-Michael
Comment 7 Paul Edmon 2021-10-08 07:46:08 MDT
Ouch, yeah that would be bad.  We have quite a few client commands 
flying around at any given moment and they do terminate prematurely or 
stall due to node fault or impatience of the user. So it would be good 
to hold off on turning this on in our environment until that is resolved.

-Paul Edmon-

On 10/7/2021 6:26 PM, bugs@schedmd.com wrote:
>
> *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=12523#c6> on 
> bug 12523 <https://bugs.schedmd.com/show_bug.cgi?id=12523> from 
> Michael Hinton <mailto:hinton@schedmd.com> *
> The only open issue I was made aware of with enable_rpc_queue is this:
>
> When sending responses, the controller is under a lock. If any client command
> hangs, this blocks slurmctld for msgtimeout seconds. Users can inadvertently
> hang the controller this way with an accidental "Ctrl+Z", a kill -19, or a host
> freeze. Or they can intentionally do this to disrupt the slurmctld. I'm not
> sure yet when this will be fixed, but I'll let you know when I do.
>
> -Michael
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 9 Michael Hinton 2021-11-12 10:41:52 MST
(In reply to Paul Edmon from comment #7)
> Ouch, yeah that would be bad.  We have quite a few client commands 
> flying around at any given moment and they do terminate prematurely or 
> stall due to node fault or impatience of the user. So it would be good 
> to hold off on turning this on in our environment until that is resolved.
Ok. I will go ahead and mark this as resolved. Hopefully we can figure out this issue with enable_rpc_queue so that it's safer to use. Feel free to reopen if you have further questions.

Thanks!
-Michael