| Summary: | slurmctld DoS with a simple srun loop | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Kilian Cavalotti <kilian> |
| Component: | slurmctld | Assignee: | Director of Support <support> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | bart, bas.vandervlies, cinek, csamuel, marshall, tim |
| Version: | 20.11.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=5225 | ||
| Site: | Stanford | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Kilian Cavalotti
2021-10-20 09:52:56 MDT
Hi Kilian, I just wanted to send a quick note that we are looking at this ticket and will get back to you soon with more information. Thanks, Ben Kilian - I'm tagging this as a duplicate of the outstanding development request in bug 5225. This is nothing new, and, as discussed privately, I do not believe this rises to the level of a CVE. Developing and deploying a rate-limiting extension is not something that I will do on the stable release branches. I agree that it may be useful long-term, but it's not something I am committing to developing right now. For those reading this ticket without further context: the 21.08 release is much more robust against this type of issue due to performance work on step management. Other aspects can be rate-limited through cli_filter, but we do not generally recommend that unless your site has had a history of abuse of the client commands. (And even then, anything done through cli_filter can be bypassed by a malicious user.) As one parting reminder: if you believe you have discovered a security issue within Slurm, please refer to https://www.schedmd.com/security.php for details on responsibly reporting those issues. - Tim *** This ticket has been marked as a duplicate of ticket 5225 *** Hi all, For the sake of completeness, I just wanted to add that we moved to 21.08, and activated SlurmctldParameters=enable_rpc_queue. Although the global Slurm responsiveness seems to have significantly improved (a lot thanks to slurmscriptd, apparently), I'm afraid that this is issue is still present. A simple job like "sbatch --array=1-10 --wrap='while true; do srun true & done'" is still enough to produce sdiag stats like this: Server thread count: 48 Agent queue size: 6514 Agent count: 82 Agent thread count: 246 DBD Agent queue size: 0 and practically stall scheduling. Cheers, -- Kilian Kilian,
> SlurmctldParameters=enable_rpc_queue.
For now, please hold off on adding enable_rpc_queue at this time. I will reach out to you directly.
Hi all, To definitely close the loop on this, I wanted to say thank you for ending up implementing a per-user RPC rate limiter in 23.02. This effectively resolves this issue (as well as #5225, I think). So, thanks! :D Cheers, -- Kilian |