Ticket 5480

Summary:	slurmd log file output corruption
Product:	Slurm	Reporter:	GSK-ONYX-SLURM <slurm-support>
Component:	slurmd	Assignee:	Director of Support <support>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	5 - Enhancement
Priority:	---	CC:	jbooth, tim
Version:	17.11.7
Hardware:	Linux
OS:	Linux
Site:	GSK	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	?
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description GSK-ONYX-SLURM 2018-07-25 04:32:38 MDT

Hi.
I am seeing corruption in the slurmd log file.  See how other output is getting injected into the "done with job" message.

[2018-07-24T14:25:59.325] Launching batch job 13412 for UID 62356
[2018-07-24T14:25:59.999] [13361.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
[2018-07-24T14:26:00.008] [13361.[2018-07-24T14:26:00.008] [13411.batch] task_p_pre_launch: Using sched_affinity for tasks
batch] done with job
[2018-07-24T14:26:00.043] task_p_slurmd_batch_request: 13413
[2018-07-24T14:26:00.043] task/affinity: job 13413 CPU input mask for node: 0x000000000400
[2018-07-24T14:26:00.043] task/affinity: job 13413 CPU final HW mask for node: 0x000000000020


[2018-07-24T19:41:04.036] [20243.batch] task_p_pre_launch: Using sched_affinity for tasks
[2018-07-24T19:41:07.137] [20193.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
[2018-07-24T19:41:07.165] [20193.[2018-07-24T19:41:07.166] task_p_slurmd_batch_request: 20244
batch] done with job
[2018-07-24T19:41:07.166] task/affinity: job 20244 CPU input mask for node: 0x200000000000
[2018-07-24T19:41:07.166] task/affinity: job 20244 CPU final HW mask for node: 0x400000000000
[2018-07-24T19:41:07.167] _run_prolog: run job script took usec=12


[2018-07-24T21:25:01.709] [27884.batch] task_p_pre_launch: Using sched_affinity for tasks
[2018-07-24T21:25:01.718] [27858.[2018-07-24T21:25:01.718] [27835.batch] done with job
batch] done with job
[2018-07-24T21:25:01.788] [27885.batch] task/cgroup: /slurm/uid_62356/job_27885: alloc=5000MB mem.limit=5000MB memsw.limit=5000MB
[2018-07-24T21:25:01.845] [27885.batch] task/cgroup: /slurm/uid_62356/job_27885/step_batch: alloc=5000MB mem.limit=5000MB memsw.limit=5000MB


[2018-07-24T21:26:02.226] _run_prolog: run job script took usec=6
[2018-07-24T21:26:02.226] _run_prolog: prolog with lock for job 28024 ran for 0 seconds
[2018-07-24T21:26:02.230] [27982.[2018-07-24T21:26:02.230] Launching batch job 28024 for UID 62356
batch] done with job
[2018-07-24T21:26:02.250] task_p_slurmd_batch_request: 28025
[2018-07-24T21:26:02.250] task/affinity: job 28025 CPU input mask for node: 0x000008000000
[2018-07-24T21:26:02.250] task/affinity: job 28025 CPU final HW mask for node: 0x002000000000

Not a major issue, but makes it messy when looking at issues.

Thanks.
Mark.

Comment 1 Michael Hinton 2018-07-25 14:19:07 MDT

That does look annoying. I'll look into it! :)

Comment 2 Michael Hinton 2018-07-26 10:32:48 MDT

Hey Mark,

Could you send more example output demonstrating this issue? I noticed that the log lines get interrupted after the 33rd character, which is very curious. I would have expected it to be more random than that.

Do you have any more info regarding when exactly this happens, or how I can duplicate this issue?

Thanks,
Michael

Comment 4 Michael Hinton 2018-07-26 10:50:48 MDT

Can you also tell me how frequently you see this issue?

Comment 5 GSK-ONYX-SLURM 2018-07-27 03:59:42 MDT

Hi.
Yesterday afternoon we ran around 20,000 jobs and saw:

[root@uk1salx00553 slurm]# grep -F ".[" slurmd-uk1salx00553.log | more
[2018-07-26T15:40:46.579] [34009.[2018-07-26T15:40:46.579] [34054.batch] task/cgroup: /slurm/uid_62356/job_34054: alloc=5000MB mem.limit
=5000MB memsw.limit=5000MB
[2018-07-26T16:29:47.087] [34641.[2018-07-26T16:29:47.087] [34631.batch] done with job
[2018-07-26T16:40:42.075] [35009.[2018-07-26T16:40:42.075] [34978.batch] done with job
[2018-07-26T16:49:27.675] [36005.[2018-07-26T16:49:27.675] [36048.batch] task/cgroup: /slurm/uid_62356/job_36048: alloc=5000MB mem.limit
=5000MB memsw.limit=5000MB
[2018-07-26T16:57:26.112] [36136.[2018-07-26T16:57:26.112] [36137.batch] done with job
[2018-07-26T16:57:26.412] [36143.[2018-07-26T16:57:26.412] [36205.batch] task/cgroup: /slurm/uid_62356/job_36205: alloc=5000MB mem.limit
=5000MB memsw.limit=5000MB
[2018-07-26T17:23:20.297] [36981.[2018-07-26T17:23:20.297] [36977.batch] done with job
[2018-07-26T18:14:46.867] [38506.[2018-07-26T18:14:46.867] [38509.batch] done with job
[2018-07-26T19:46:17.288] [41068.[2018-07-26T19:46:17.288] [41070.batch] done with job

From the 23 / 24th July is similar:

[root@uk1salx00553 slurm]# grep -F ".[" slurmd-uk1salx00553-25Jul2018.log | more
[2018-07-23T20:42:12.329] [10759.[2018-07-23T20:42:12.329] [10754.batch] done with job
[2018-07-24T14:26:00.008] [13361.[2018-07-24T14:26:00.008] [13411.batch] task_p_pre_launch: Using sched_affinity for tasks
[2018-07-24T19:41:07.165] [20193.[2018-07-24T19:41:07.166] task_p_slurmd_batch_request: 20244
[2018-07-24T20:46:17.696] [23567.[2018-07-24T20:46:17.696] [23582.batch] done with job
[2018-07-24T20:50:39.015] [24023.[2018-07-24T20:50:39.015] [24080.batch] task_p_pre_launch: Using sched_affinity for tasks
[2018-07-24T21:05:33.494] [25558.[2018-07-24T21:05:33.494] [25595.batch] task_p_pre_launch: Using sched_affinity for tasks
[2018-07-24T21:12:35.593] [26377.[2018-07-24T21:12:35.593] [26385.batch] done with job
[2018-07-24T21:17:00.815] [26929.[2018-07-24T21:17:00.815] [26928.batch] done with job
[2018-07-24T21:19:46.381] [27254.[2018-07-24T21:19:46.381] [27246.batch] done with job
[2018-07-24T21:21:58.198] [27494.[2018-07-24T21:21:58.198] [27478.batch] done with job
[2018-07-24T21:23:34.607] [27685.[2018-07-24T21:23:34.607] [27694.batch] done with job
[2018-07-24T21:25:01.718] [27858.[2018-07-24T21:25:01.718] [27835.batch] done with job
[2018-07-24T21:26:02.230] [27982.[2018-07-24T21:26:02.230] Launching batch job 28024 for UID 62356
[2018-07-24T21:26:19.820] [28021.[2018-07-24T21:26:19.820] [28013.batch] done with job
[2018-07-24T21:30:15.926] [28406.[2018-07-24T21:30:15.926] [28412.batch] done with job
[2018-07-24T21:35:08.732] [28996.[2018-07-24T21:35:08.732] [28999.batch] done with job
[2018-07-24T21:37:59.156] [29347.[2018-07-24T21:37:59.156] [29341.batch] done with job
[2018-07-24T21:42:01.533] [29815.[2018-07-24T21:42:01.533] [29784.batch] done with job
[2018-07-24T21:44:23.917] [30077.[2018-07-24T21:44:23.917] [30071.batch] done with job
[2018-07-24T21:53:13.065] [30944.[2018-07-24T21:53:13.065] [30938.batch] done with job
[2018-07-24T21:53:43.007] [31007.[2018-07-24T21:53:43.007] [31005.batch] done with job

Thanks.
Mark.

Comment 8 Michael Hinton 2018-07-27 14:30:56 MDT

Hey Mark, it looks like this is a known issue. The reason this wasn't fixed was that it causes a performance penalty, due to the synchronization steps that would need to take place.

For now, as a workaround, you could switch to using syslog instead, and then route the slurmd-specific output to a dedicated file. This will prevent messages from interfering with each other, as syslog has built-in synchronization. Just leave SlurmdLogFile unspecified in slurm.conf for Slurm to use syslog. You may want to set SlurmdSyslogDebug as well. See https://slurm.schedmd.com/slurm.conf.html

We're looking into possibly creating a SynchronizeSlurmdLog config in slurm.conf where you could choose to trade performance for a synchronized slurmd log. If we did, it wouldn’t be slated until at least 19.05.

I’m changing this bug to a feature enhancement.

Comment 10 GSK-ONYX-SLURM 2018-07-30 05:53:15 MDT

OK, thanks for the info.  Please go ahead and close the bug.

Thanks.
mark.

Comment 11 Michael Hinton 2018-07-31 09:57:55 MDT

Closing out bug.