Hi! I'm re-posting this as a public follow up to bug #12660, to hopefully bring more attention to that issue, where we realized that any user could run a denial-of-service on `slurmctld` by running endless `srun` loops within running jobs. Something like this: $ salloc -c bash 'while true; do srun true & done' will bring a Slurm site to a halt, as slurmctld will be overwhelmed with RPCs and basically stop scheduling altogether. To make that DoS a DDoS, all it takes is an array job: $ sbatch --array=1-10 --wrap='while true; do srun true & done' and then the agent queue size explodes, as shown in sdiag output: Server thread count: 256 Agent queue size: 4380 Agent count: 82 Agent thread count: 246 DBD Agent queue size: 0 and the SRUN_STEP_SIGNAL counter shoots through the roof, with an infinite list of those as pending RPCs. We understand there's no mechanism to limit RPCs rates per user at the moment. And existing limiting mechanisms (account limits, partition or QoS limits, as well as cli_filter/job_submit scripts or SPANK plugins) are ineffective in that case, since they all operate at the job allocation level and won't provide any protection against `srun` storms that happen *within* already-started jobs. As it is, there doesn't seem to be any way to prevent a single user from performing a DoS (intentionally or not) against any Slurm installation, by simply submitting jobs. That's probably worth a CVE, as any DoS vulnerability. Could we please get a fix implemented (not as a RFE, but as a security fix)? Thanks! -- Kilian
Hi Kilian, I just wanted to send a quick note that we are looking at this ticket and will get back to you soon with more information. Thanks, Ben
Kilian - I'm tagging this as a duplicate of the outstanding development request in bug 5225. This is nothing new, and, as discussed privately, I do not believe this rises to the level of a CVE. Developing and deploying a rate-limiting extension is not something that I will do on the stable release branches. I agree that it may be useful long-term, but it's not something I am committing to developing right now. For those reading this ticket without further context: the 21.08 release is much more robust against this type of issue due to performance work on step management. Other aspects can be rate-limited through cli_filter, but we do not generally recommend that unless your site has had a history of abuse of the client commands. (And even then, anything done through cli_filter can be bypassed by a malicious user.) As one parting reminder: if you believe you have discovered a security issue within Slurm, please refer to https://www.schedmd.com/security.php for details on responsibly reporting those issues. - Tim *** This ticket has been marked as a duplicate of ticket 5225 ***
Hi all, For the sake of completeness, I just wanted to add that we moved to 21.08, and activated SlurmctldParameters=enable_rpc_queue. Although the global Slurm responsiveness seems to have significantly improved (a lot thanks to slurmscriptd, apparently), I'm afraid that this is issue is still present. A simple job like "sbatch --array=1-10 --wrap='while true; do srun true & done'" is still enough to produce sdiag stats like this: Server thread count: 48 Agent queue size: 6514 Agent count: 82 Agent thread count: 246 DBD Agent queue size: 0 and practically stall scheduling. Cheers, -- Kilian
Kilian, > SlurmctldParameters=enable_rpc_queue. For now, please hold off on adding enable_rpc_queue at this time. I will reach out to you directly.
Hi all, To definitely close the loop on this, I wanted to say thank you for ending up implementing a per-user RPC rate limiter in 23.02. This effectively resolves this issue (as well as #5225, I think). So, thanks! :D Cheers, -- Kilian