Ticket 12249

Summary: No syslog output from slurmstepd when slurmd runs in foreground
Product: Slurm Reporter: David Gloe <david.gloe>
Component: slurmstepdAssignee: Director of Support <support>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 20.11.5   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=10625
https://bugs.schedmd.com/show_bug.cgi?id=10922
https://bugs.schedmd.com/show_bug.cgi?id=7231
https://bugs.schedmd.com/show_bug.cgi?id=12493
Site: CRAY Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: Cray Internal DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Ticket Depends on:    
Ticket Blocks: 7231    

Description David Gloe 2021-08-10 14:22:27 MDT
We have Slurm configured to log to syslog, by not setting SlurmdLogFile in slurm.conf. When we upgraded to Slurm 20.11, we were no longer seeing any logging from slurmd at all, most likely due to the change to the systemd unit file to run slurmd in the foreground.

Adding SlurmdSyslogDebug=info to slurm.conf results in getting slurmd logs, but still nothing from slurmstepd. For example, here's what we get in syslog for a typical job step:

Aug 10 15:19:09 nid001418 slurmd[74577]: launch task StepId=80870.0 request from
 UID:0 GID:0 HOST:10.100.1.96 PORT:51726
Aug 10 15:19:09 nid001418 slurmd[74577]: task/affinity: lllp_distribution: JobId=80870 auto binding off: mask_cpu

And here's what we get in a log file:

[2021-08-10T15:19:09.989] launch task StepId=80870.0 request from UID:0 GID:0 HOST:10.100.1.96 PORT:51726
[2021-08-10T15:19:09.989] task/affinity: lllp_distribution: JobId=80870 auto binding off: mask_cpu
[2021-08-10T15:19:10.777] [80870.0] task/cgroup: _memcg_initialize: /slurm/uid_0/job_80870: alloc=227328MB mem.limit=227328MB memsw.limit=unlimited
[2021-08-10T15:19:10.777] [80870.0] task/cgroup: _memcg_initialize: /slurm/uid_0/job_80870/step_0: alloc=227328MB mem.limit=227328MB memsw.limit=unlimited
[2021-08-10T15:19:10.859] [80870.0] done with job

This looks similar to bug 2631.
Comment 2 Michael Hinton 2021-08-11 11:22:45 MDT
Hi David,

This is a known issue when running the slurmd systemd service in the foreground, which is what the slurmd service file does starting in 20.11. To revert to the old behavior, change `Type` back to `forking`, remove `-D` from the ExecStart command, and add `PIDFile=/var/run/slurmd.pid` back in. See https://bugs.schedmd.com/show_bug.cgi?id=10625#c10 for more context.

We are still looking into a permanent fix for users who rely on syslog output, but for now, let me know if this workaround works for you.

Thanks!
-Michael
Comment 5 Michael Hinton 2021-10-22 12:41:08 MDT
Hi David,

I'm going to go ahead and merge this into bug 10625. Feel free to continue engaging with us there.

Thanks!
-Michael

*** This ticket has been marked as a duplicate of ticket 10625 ***