Ticket 12703 - slurmctld DoS with a simple srun loop
Summary: slurmctld DoS with a simple srun loop
Status: RESOLVED DUPLICATE of ticket 5225
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 20.11.8
Hardware: Linux Linux
: 2 - High Impact
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-10-20 09:52 MDT by Kilian Cavalotti
Modified: 2023-07-13 16:09 MDT (History)
6 users (show)

See Also:
Site: Stanford
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Kilian Cavalotti 2021-10-20 09:52:56 MDT
Hi!

I'm re-posting this as a public follow up to bug #12660, to hopefully bring more attention to that issue, where we realized that any user could run a denial-of-service on `slurmctld` by running endless `srun` loops within running jobs.


Something like this:

$ salloc -c bash 'while true; do srun true & done'

will bring a Slurm site to a halt, as slurmctld will be
overwhelmed with RPCs and basically stop scheduling altogether.


To make that DoS a DDoS, all it takes is an array job:

$ sbatch --array=1-10 --wrap='while true; do srun true & done'

and then the agent queue size explodes, as shown in sdiag output:

Server thread count:  256
Agent queue size:     4380
Agent count:          82
Agent thread count:   246
DBD Agent queue size: 0

and the SRUN_STEP_SIGNAL counter shoots through the roof, with an infinite list of those as pending RPCs.


We understand there's no mechanism to limit RPCs rates per user at the moment.

And existing limiting mechanisms (account limits, partition or QoS
limits, as well as cli_filter/job_submit scripts or SPANK plugins) are
ineffective in that case, since they all operate at the job allocation
level and won't provide any protection against `srun` storms that
happen *within* already-started jobs.

As it is, there doesn't seem to be any way to prevent a single user
from performing a DoS (intentionally or not) against any Slurm
installation, by simply submitting jobs.

That's probably worth a CVE, as any DoS vulnerability. Could we please
get a fix implemented (not as a RFE, but as a security fix)?


Thanks!
--
Kilian
Comment 3 Ben Roberts 2021-10-20 16:11:24 MDT
Hi Kilian,

I just wanted to send a quick note that we are looking at this ticket and will get back to you soon with more information.

Thanks,
Ben
Comment 7 Tim Wickberg 2021-10-21 12:15:50 MDT
Kilian -

I'm tagging this as a duplicate of the outstanding development request in bug 5225.

This is nothing new, and, as discussed privately, I do not believe this rises to the level of a CVE.

Developing and deploying a rate-limiting extension is not something that I will do on the stable release branches. I agree that it may be useful long-term, but it's not something I am committing to developing right now.

For those reading this ticket without further context: the 21.08 release is much more robust against this type of issue due to performance work on step management. Other aspects can be rate-limited through cli_filter, but we do not generally recommend that unless your site has had a history of abuse of the client commands. (And even then, anything done through cli_filter can be bypassed by a malicious user.)

As one parting reminder: if you believe you have discovered a security issue within Slurm, please refer to https://www.schedmd.com/security.php for details on responsibly reporting those issues.

- Tim

*** This ticket has been marked as a duplicate of ticket 5225 ***
Comment 8 Kilian Cavalotti 2021-11-18 19:27:30 MST
Hi all,

For the sake of completeness, I just wanted to add that we moved to 21.08, and activated SlurmctldParameters=enable_rpc_queue.

Although the global Slurm responsiveness seems to have significantly improved (a lot thanks to slurmscriptd, apparently), I'm afraid that this is issue is still present. 

A simple job like "sbatch --array=1-10 --wrap='while true; do srun true & done'" is still enough to produce sdiag stats like this:

Server thread count:  48
Agent queue size:     6514
Agent count:          82
Agent thread count:   246
DBD Agent queue size: 0

and practically stall scheduling.


Cheers,
--
Kilian
Comment 9 Jason Booth 2021-11-19 10:37:22 MST
Kilian,

> SlurmctldParameters=enable_rpc_queue.

For now, please hold off on adding enable_rpc_queue at this time. I will reach out to you directly.
Comment 10 Kilian Cavalotti 2023-07-13 16:09:43 MDT
Hi all, 

To definitely close the loop on this, I wanted to say thank you for ending up implementing a per-user RPC rate limiter in 23.02. This effectively resolves this issue (as well as #5225, I think).

So, thanks! :D

Cheers,
--
Kilian