Ticket 8246

Summary:	Slow handling of jobs with long command lines
Product:	Slurm	Reporter:	Janne Blomqvist <jblomqvist>
Component:	slurmctld	Assignee:	Tim Wickberg <tim>
Status:	OPEN ---	QA Contact:
Severity:	C - Contributions
Priority:	---	CC:	erich.birngruber, uemit.seren
Version:	18.08.8
Hardware:	Linux
OS:	Linux
Site:	-Other-	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Patch to avoid Schlemiel the Painter's algorithm in _pack_default_job_details

Description Janne Blomqvist 2019-12-18 03:20:38 MST

Created attachment 12590 [details]
Patch to avoid  Schlemiel the Painter's algorithm in _pack_default_job_details

We recently ran into an issue where a user cancelled all his 20k jobs (scancel -u $USERNAME). This should be a relatively quick operation, but in this case it took 8 hours, and slurmctld effectively ground to a halt during that time. Most efforts to talk with slurmctld failed with "Socket timed out on send/recv operation", and the slurmctld logged lots of messages like

2019-12-13T19:37:52.148703+02:00 install2 slurmctld[25304]: Resending TERMINATE_JOB request JobId=48536653_17(48536671) Nodelist=pe52
2019-12-13T19:37:52.150485+02:00 install2 slurmctld[25304]: Resending TERMINATE_JOB request JobId=48536653_18(48536672) Nodelist=pe52
2019-12-13T19:37:52.151045+02:00 install2 slurmctld[25304]: Resending TERMINATE_JOB request JobId=48536653_19(48536653) Nodelist=pe52
2019-12-13T19:37:52.152502+02:00 install2 slurmctld[25304]: cleanup_completing: JobId=48514913 completion process took 12794 seconds
2019-12-13T19:37:52.153824+02:00 install2 slurmctld[25304]: cleanup_completing: JobId=48514910 completion process took 12794 seconds
2019-12-13T19:37:52.154818+02:00 install2 slurmctld[25304]: cleanup_completing: JobId=48514895 completion process took 12794 seconds
2019-12-13T19:37:52.156219+02:00 install2 slurmctld[25304]: cleanup_completing: JobId=48514902 completion process took 12794 seconds
2019-12-13T19:37:52.158332+02:00 install2 slurmctld[25304]: cleanup_completing: JobId=48514912 completion process took 12794 seconds
2019-12-13T19:37:52.158847+02:00 install2 slurmctld[25304]: cleanup_completing: JobId=48514915 completion process took 12794 seconds
2019-12-13T19:37:52.160246+02:00 install2 slurmctld[25304]: cleanup_completing: JobId=48514903 completion process took 12794 seconds
2019-12-13T19:37:52.161593+02:00 install2 slurmctld[25304]: cleanup_completing: JobId=48514908 completion process took 12794 seconds
2019-12-13T19:37:52.163425+02:00 install2 slurmctld[25304]: cleanup_completing: JobId=48514905 completion process took 12794 seconds
2019-12-13T19:37:52.163981+02:00 install2 slurmctld[25304]: cleanup_completing: JobId=48514909 completion process took 12794 seconds

All these resending TERMINATE_JOB requests were strange, because on the nodes itself there was no longer any trace of the jobs in question.

Running 'perf top' on the slurmctld node showed the top users as:

  16,66%  slurmctld                  [.] _pack_default_job_details
  14,58%  [kernel]                   [k] rwsem_spin_on_owner
   7,75%  [kernel]                   [k] move_page_tables
   6,18%  [kernel]                   [k] generic_exec_single
   5,22%  [kernel]                   [k] flat_send_IPI_mask


perf record on slurmctld failed, apparently because slurmctld had so many file mappings (mmap) that something in perf went awry. One can also see in the profile that rwsem_spin_on_owner and move_page_tables are quite high in the profile, hinting at mmap overhead. It seems that for some reason slurmctld has started to use mmap to read the job state files, which isn't a good idea as mmap is usually slower than just read/write syscalls, particularly for usually small files like the job script and env files.

Looking close at _pack_default_job_details, unfortunately the slurm binaries on our production system has symbols stripped, but most of the time is spent in "repnz scas" relatively soon after the _find_node_conf call at line 10293, hinting at the str*() functions. Looking at the jobs that the user tried to cancel (note, these were really 20k separate jobs, not array jobs), the one thing that stand out is that they have pretty long command lines, about 100 arguments or 6 kB in total. And indeed I found a problem in _pack_default_job_details, attached as a patch. I wasn't able to recreate the 8 hour time to cancel jobs on my workstation, but canceling 10k jobs with the 100 args/6 kB command lines went from 6 seconds to 2 seconds, so at least some improvement.

Comment 1 Tim Wickberg 2020-01-31 18:03:07 MST

My apologies for not responding sooner.

I'd rather not use raw memory functions here in this fashion. We did just recently add xstrfmtcatat() which can track the end of the string, and can considerably improve construction of unwieldy strings such as this one.

In 20.02 there are a handful of associated fixes to the xstring functions, so I believe the switch over to xstrfmtcatat() would get us nearly identical performance improvements, without the maintenance issue of raw memcpy() calls.

If you'd like to propose a fix in that direction I'd be glad to review that instead.

- Tim

Comment 2 timeu 2022-02-24 03:33:28 MST

We have the same issue with Slurm 20.11.5. 

A single user submitted around 1000 jobs with a very long command line (~118kb) and since then we are seeing similar log messages in the slurmctld log and users have reported they see frequent "Socket timed out on send/recv operation" errors when running sbatch/squeue and other commands. 

We also see increased traffic on /var/spool/slurm on the SLURM controller compared to baseline.

After cancelling the user's job with the long command line, traffic on /var/spool/slurm went to baseline and "Socket timed out on send/recv operation" went away. 
Also we can see that the size of the job_state in /var/spool/slurm on the SLURM controller decreased quite a bit:


before:
-rw-------.   1 slurm slurm 119M Feb 24 11:12 job_state
 
after:
-rw-------.  1 slurm slurm 4.5M Feb 24 11:14 job_state