Ticket 5080

Summary:	Handle squeue filtering for --nodelist within the slurmctld to minimize data transfers
Product:	Slurm	Reporter:	CSC sysadmins <csc-slurm-tickets>
Component:	slurmctld	Assignee:	Unassigned Developer <dev-unassigned>
Status:	OPEN ---	QA Contact:
Severity:	5 - Enhancement
Priority:	---
Version:	17.02.10
Hardware:	Linux
OS:	Linux
Site:	CSC - IT Center for Science	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description CSC sysadmins 2018-04-19 09:50:58 MDT

Hi,

After debugging strange error which one user generated we finally figured out that simple and innocent squeue-command in epilog script may cause massive data transfer from the slurmctld. We have some thousands of jobs in queues and following simple command on the compute node will generate over 2MB tcpdump which is full of unnecessary data (all running job info, node info etc.)


squeue --noheader --format=%A --user=$SLURM_UID --node=localhost`

User can simply cause a DoS by submitting a lot of jobs and scancelling them simultaneously. All nodes will start running epilog at the same time and all transfer that 2MB info which will saturate slurmctld 1GB Ethernet interface for a long time.

Comment 1 Tim Wickberg 2018-04-19 17:04:55 MDT

Hi Tommi -

I'd recommend reworking your epilog to avoid that call.

We've discussed moving some of this filtering into the slurmctld - but there's an inherent tradeoff in having it work out a lot of this filtering instead of the squeue command itself.

If you'd like us to look into that on 18.08+ I can reclassify this as an enhancement, but at the moment the RPCs for 17.02 and 17.11 cannot be changed to accommodate that.

I will note that we highly recommend use of proctrack/cgroup and task/cgroup alongside pam_slurm_adopt to avoid the need for this type of node cleanup in an Epilog script.

- Tim

Comment 2 CSC sysadmins 2018-04-20 05:52:03 MDT

Hi,

I'd guess that on the road towards exascale systems this may need some attention.

I was surprised that squeue will transfer also job out/err/script paths and one particular user likes hyper long paths and long argument lists + lot of jobs which caused this problem for us (paths looks more like an ascii art).

I'd still like to get a simple tool which tells if node is occupied or not (enable power saving, clean up shm/ipcs which are not in control of cgroup? etc.)

Node slurmd does not have that information (count of running jobs overall and per user) ?

-Tommi

Comment 3 Tim Wickberg 2018-05-08 23:35:02 MDT

(In reply to Tommi Tervo from comment #2)
> Hi,
> 
> I'd guess that on the road towards exascale systems this may need some
> attention.
> 
> I was surprised that squeue will transfer also job out/err/script paths and
> one particular user likes hyper long paths and long argument lists + lot of
> jobs which caused this problem for us (paths looks more like an ascii art).
> 
> I'd still like to get a simple tool which tells if node is occupied or not
> (enable power saving, clean up shm/ipcs which are not in control of cgroup?
> etc.)
> 
> Node slurmd does not have that information (count of running jobs overall
> and per user) ?

Not directly.

If you're on the node, you can enumerate that (and fetch some additional details) from the slurmstepd processes over their unix sockets.

I'm retagging this as an enhancement to look at handling node-filtering on the slurmctld side, although I can't promise when/if we'll tackle this.

- Tim