Ticket 9068

Summary:	SlurmdDebug=info reports more than SlurmdSyslogDebug=verbose
Product:	Slurm	Reporter:	Chris Samuel (NERSC) <csamuel>
Component:	Other	Assignee:	Nate Rini <nate>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	dmjacobsen
Version:	19.05.6
Hardware:	Linux
OS:	Linux
Site:	NERSC	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Chris Samuel (NERSC) 2020-05-15 19:11:32 MDT

Hi there,

I'm experimenting with logging "verbose" information to syslog (well journald as a ring buffer) alongside the current "info" information going to a log file.

From "scontrol show config":

SlurmdDebug             = info
SlurmdLogFile           = /var/spool/slurmd/%h.log
SlurmdSyslogDebug       = verbose

What I'm seeing though is that some information is only going to the log file, despite the syslog level being higher, for instance here's me doing some testing after a restart of slurmd.

First the logfile:

[2020-05-15T12:33:43.827] Message aggregation disabled
[2020-05-15T12:33:43.921] slurmd version 19.05.6 started
[2020-05-15T12:33:43.923] error: No /var/spool/slurmd/job_container_state file for job_container/cncu state recovery
[2020-05-15T12:33:43.924] core_spec/cray_aries: init
[2020-05-15T12:33:43.934] slurmd started on Fri, 15 May 2020 12:33:43 -0700
[2020-05-15T12:33:43.935] CPUs=272 Boards=1 Sockets=1 Cores=68 Threads=4 Memory=96512 TmpDisk=48256 Uptime=38187 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2020-05-15T16:18:54.771] task_p_slurmd_batch_request: 1932005
[2020-05-15T16:18:54.771] task/affinity: job 1932005 CPU input mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
[2020-05-15T16:18:54.771] task/affinity: job 1932005 CPU final HW mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
[2020-05-15T16:18:54.971] _run_prolog: run job script took usec=238992
[2020-05-15T16:18:54.972] _run_prolog: prolog with lock for job 1932005 ran for 0 seconds
[2020-05-15T16:18:55.029] [1932005.extern] core_spec/cray_aries: init
[2020-05-15T16:18:55.057] [1932005.extern] error: nersc_perf: failed to read perf status file: /var/spool/slurmd/nersc_perf_job1932005
[2020-05-15T16:18:55.151] [1932005.extern] task/cgroup: /slurm/uid_78672/job_1932005: alloc=89088MB mem.limit=89088MB memsw.limit=unlimited
[2020-05-15T16:18:55.152] [1932005.extern] task/cgroup: /slurm/uid_78672/job_1932005/step_extern: alloc=89088MB mem.limit=89088MB memsw.limit=unlimited
[2020-05-15T16:18:55.159] [1932005.extern] error: Couldn't find shifterConfig.json, cannot do extern step processing.
[2020-05-15T16:18:55.160] Launching batch job 1932005 for UID 78672
[2020-05-15T16:18:55.217] [1932005.batch] core_spec/cray_aries: init
[2020-05-15T16:18:55.246] [1932005.batch] error: nersc_perf: failed to read perf status file: /var/spool/slurmd/nersc_perf_job1932005
[2020-05-15T16:18:55.249] [1932005.batch] task/cgroup: /slurm/uid_78672/job_1932005: alloc=89088MB mem.limit=89088MB memsw.limit=unlimited
[2020-05-15T16:18:55.250] [1932005.batch] task/cgroup: /slurm/uid_78672/job_1932005/step_batch: alloc=89088MB mem.limit=89088MB memsw.limit=unlimited
[2020-05-15T16:18:55.553] [1932005.batch] task_p_pre_launch: Using sched_affinity for tasks
[2020-05-15T16:18:56.104] launch task 1932005.0 request from UID:78672 GID:78672 HOST:10.128.0.201 PORT:48850
[2020-05-15T16:18:56.105] lllp_distribution jobid [1932005] auto binding off: mask_cpu
[2020-05-15T16:18:56.162] [1932005.0] core_spec/cray_aries: init
[2020-05-15T16:18:56.199] [1932005.0] error: nersc_perf: failed to read perf status file: /var/spool/slurmd/nersc_perf_job1932005
[2020-05-15T16:18:56.246] [1932005.0] zonesort: set sort interval to 0 seconds (2 bytes written), Success
[2020-05-15T16:18:56.780] [1932005.0] zonesort: initiated memory compaction (2 bytes written), Success
[2020-05-15T16:18:56.781] [1932005.0] zonesort: found 1 numa domains
[2020-05-15T16:18:56.781] [1932005.0] zonesort: will perform zonesort for numa node 0
[2020-05-15T16:18:56.782] [1932005.0] zonesort: wrote 2 bytes to zone_sort_free_pages: Success
[2020-05-15T16:18:56.950] [1932005.0] (switch_cray_aries.c: 688: switch_p_job_init) gres_cnt: 2072 0
[2020-05-15T16:18:56.954] [1932005.0] task/cgroup: /slurm/uid_78672/job_1932005: alloc=89088MB mem.limit=89088MB memsw.limit=unlimited
[2020-05-15T16:18:56.954] [1932005.0] task/cgroup: /slurm/uid_78672/job_1932005/step_0: alloc=89088MB mem.limit=89088MB memsw.limit=unlimited
[2020-05-15T16:18:57.278] [1932005.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-05-15T16:18:57.422] [1932005.0] (switch_cray_aries.c: 824: switch_p_job_postfini) alpsc_flush_lustre: opendir(/proc/fs/lustre/ldlm/namespaces) failed, No such file or directory


and now from "journalctl -u slurmd":

May 15 12:33:43 nid00200 slurmd[33560]: Message aggregation disabled
May 15 12:33:43 nid00200 slurmd[33560]: Gres Name=craynetwork Type=(null) Count=4
May 15 12:33:43 nid00200 slurmd[33560]: Gres Name=preemptable Type=(null) Count=1
May 15 12:33:43 nid00200 slurmd[33560]: topology tree plugin loaded
May 15 12:33:43 nid00200 slurmd[33560]: route default plugin loaded
May 15 12:33:43 nid00200 slurmd[33560]: task affinity plugin loaded with CPU mask 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000>
May 15 12:33:43 nid00200 slurmd[33560]: Munge credential signature plugin loaded
May 15 12:33:43 nid00200 slurmd[33568]: slurmd version 19.05.6 started
May 15 12:33:43 nid00200 systemd[1]: slurmd.service: PID file /var/run/slurmd.pid not readable (yet?) after start: No such file or directory
May 15 12:33:43 nid00200 slurmd[33568]: error: No /var/spool/slurmd/job_container_state file for job_container/cncu state recovery
May 15 12:33:43 nid00200 slurmd[33568]: core_spec/cray_aries: init
May 15 12:33:43 nid00200 slurmd[33568]: SyscfgPath is not configured
May 15 12:33:43 nid00200 slurmd[33568]: slurmd started on Fri, 15 May 2020 12:33:43 -0700
May 15 12:33:43 nid00200 slurmd[33568]: CPUs=272 Boards=1 Sockets=1 Cores=68 Threads=4 Memory=96512 TmpDisk=48256 Uptime=38187 CPUSpecList=(null) FeaturesAvail=(null) FeaturesAct>
May 15 12:33:43 nid00200 systemd[1]: Started Slurm node daemon.
May 15 16:18:54 nid00200 slurmd[33568]: task_p_slurmd_batch_request: 1932005
May 15 16:18:54 nid00200 slurmd[33568]: task/affinity: job 1932005 CPU input mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
May 15 16:18:54 nid00200 slurmd[33568]: task/affinity: job 1932005 CPU final HW mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
May 15 16:18:54 nid00200 slurmd[33568]: _run_prolog: run job script took usec=238992
May 15 16:18:54 nid00200 slurmd[33568]: _run_prolog: prolog with lock for job 1932005 ran for 0 seconds
May 15 16:18:55 nid00200 slurmd[33568]: Launching batch job 1932005 for UID 78672
May 15 16:18:56 nid00200 slurmd[33568]: launch task 1932005.0 request from UID:78672 GID:78672 HOST:10.128.0.201 PORT:48850
May 15 16:18:56 nid00200 slurmd[33568]: lllp_distribution jobid [1932005] auto binding off: mask_cpu



It seems like none of the step logs are going to the journal, any ideas?

I don't see anything that separately relates to the logging of job step information, which is what I need logged there at a higher level as we want to catch the timer information if an operation takes longer than 3 seconds, but if we set the SlurmdDebug to verbose then it's too, well, verbose for the RAMdisk it's being written to.

Any ideas?

All the best & stay safe!
Chris

Comment 1 Gavin D. Howard 2020-05-18 11:28:41 MDT

Thank you for the report. I am looking into it.

Comment 2 Chris Samuel (NERSC) 2020-06-02 23:24:53 MDT

Hi Gavin,

Any luck with this?

All the best,
Chris

Comment 3 Gavin D. Howard 2020-06-03 09:31:21 MDT

Not yet, unfortunately.

Comment 4 Chris Samuel (NERSC) 2020-06-03 10:52:11 MDT

(In reply to Gavin D. Howard from comment #3)

> Not yet, unfortunately.

No worries!  Just to let you know that in the near future (hopefully this week) we'll be moving Gerty (our test Cray) to 20.02 in preparation for hopefully migrating Cori in the June maintenance.

Once that's done I'll retest this to see if there's any difference in behaviour.

All the best!
Chris

Comment 5 Gavin D. Howard 2020-08-10 15:36:24 MDT

Chris,

I cannot reproduce this problem. I have my setup done the same way as yours, and the syslog is getting more data. I did have to run `journalctl -xe` and search for slurmd to see it all, however, so it may be a problem in the implementation of `journalctl -u`.

Comment 6 Nate Rini 2020-08-17 16:08:40 MDT

Chris

I have opened bug#7231 for viewing as this is basically the same issue.

Thanks,
--Nate

*** This ticket has been marked as a duplicate of ticket 7231 ***

Comment 7 Chris Samuel (NERSC) 2020-08-17 16:24:39 MDT

Thanks Nate!