I was just watching the Field Notes in SLUG when mention was made of the undocumented enable_rpc_queue option. We do suffer from frequent RPC storms and I was curious if you thought this option would be a good idea in our environment here?
Hey Paul, Generally, enable_rpc_queue was meant to help during RPC storms, so I would think it would be something you might want to try out. Tim mentioned that the main issue he remembers is that one of the RPC handlers can dispatch a message directly to an srun, in certain edge cases. If that srun doesn't exist any longer, it ties up that RPC processing thread for TcpTimeout, which can definitely cause some problems. Tim isn't sure if this issue has already been fixed, or if it's still a problem. So I'll get back to you on that when we have more information. Also, the engineer that created this feature is on vacation this week, so by next week he should be able to shed some more light on any outstanding issues. Thanks! -Michael
Certainly. Right now I'm planning on upgrading to 21.08.x (whatever latest release is at the time) on November 1st. I wouldn't plan on implementing this until we do that upgrade as I know you guys have overhauled quite a bit of the RPC framework in that version. I'm happy to try this out after we do that upgrade as I frequently see us capping out on RPC count and so I'm happy to try anything that alleviate that. -Paul Edmon- On 9/22/2021 2:57 PM, bugs@schedmd.com wrote: > > *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=12523#c4> on > bug 12523 <https://bugs.schedmd.com/show_bug.cgi?id=12523> from > Michael Hinton <mailto:hinton@schedmd.com> * > Hey Paul, > > Generally, enable_rpc_queue was meant to help during RPC storms, so I would > think it would be something you might want to try out. > > Tim mentioned that the main issue he remembers is that one of the RPC handlers > can dispatch a message directly to an srun, in certain edge cases. If that srun > doesn't exist any longer, it ties up that RPC processing thread for TcpTimeout, > which can definitely cause some problems. Tim isn't sure if this issue has > already been fixed, or if it's still a problem. So I'll get back to you on that > when we have more information. Also, the engineer that created this feature is > on vacation this week, so by next week he should be able to shed some more > light on any outstanding issues. > > Thanks! > -Michael > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
The only open issue I was made aware of with enable_rpc_queue is this: When sending responses, the controller is under a lock. If any client command hangs, this blocks slurmctld for msgtimeout seconds. Users can inadvertently hang the controller this way with an accidental "Ctrl+Z", a kill -19, or a host freeze. Or they can intentionally do this to disrupt the slurmctld. I'm not sure yet when this will be fixed, but I'll let you know when I do. -Michael
Ouch, yeah that would be bad. We have quite a few client commands flying around at any given moment and they do terminate prematurely or stall due to node fault or impatience of the user. So it would be good to hold off on turning this on in our environment until that is resolved. -Paul Edmon- On 10/7/2021 6:26 PM, bugs@schedmd.com wrote: > > *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=12523#c6> on > bug 12523 <https://bugs.schedmd.com/show_bug.cgi?id=12523> from > Michael Hinton <mailto:hinton@schedmd.com> * > The only open issue I was made aware of with enable_rpc_queue is this: > > When sending responses, the controller is under a lock. If any client command > hangs, this blocks slurmctld for msgtimeout seconds. Users can inadvertently > hang the controller this way with an accidental "Ctrl+Z", a kill -19, or a host > freeze. Or they can intentionally do this to disrupt the slurmctld. I'm not > sure yet when this will be fixed, but I'll let you know when I do. > > -Michael > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
(In reply to Paul Edmon from comment #7) > Ouch, yeah that would be bad. We have quite a few client commands > flying around at any given moment and they do terminate prematurely or > stall due to node fault or impatience of the user. So it would be good > to hold off on turning this on in our environment until that is resolved. Ok. I will go ahead and mark this as resolved. Hopefully we can figure out this issue with enable_rpc_queue so that it's safer to use. Feel free to reopen if you have further questions. Thanks! -Michael