Ticket 13511

Summary: Error when restarting slurmctld after 21.08.5 upgrade
Product: Slurm Reporter: HPC Admin <hpcadmin>
Component: slurmctldAssignee: Jason Booth <jbooth>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 20.11.8   
Hardware: Linux   
OS: Linux   
Site: Auburn Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description HPC Admin 2022-02-24 14:03:18 MST
Hello,

We just upgraded a test cluster from 20.11.8 to 21.08.5 and encountered an error that we haven't seen before when restarting slurmctld:
'error: chdir(/var/log): Permission denied'.

Here's where we found it:

hpcmgt:utility > systemctl status slurmctld

â— slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2022-02-24 14:09:59 CST; 29min ago
  Process: 4800 ExecStart=/mnt/nfs01/shared/apps/slurm/current/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 4803 (slurmctld)
   CGroup: /system.slice/slurmctld.service
           ├─4803 /mnt/nfs01/shared/apps/slurm/current/sbin/slurmctld
           └─4804 slurmctld: slurmscriptd

Feb 24 14:09:59 hpcmgt.casic.hpc.auburn.edu systemd[1]: Starting Slurm controller daemon...
Feb 24 14:09:59 hpcmgt.casic.hpc.auburn.edu systemd[1]: Can't open PID file /var/run/slurmctld.pid (yet?) after start: No su...ctory

Feb 24 14:09:59 hpcmgt.casic.hpc.auburn.edu slurmctld[4803]: error: chdir(/var/log): Permission denied
Feb 24 14:09:59 hpcmgt.casic.hpc.auburn.edu systemd[1]: Started Slurm controller daemon.
Feb 24 14:09:59 hpcmgt.casic.hpc.auburn.edu slurmctld[4803]: Job accounting information stored, but details not gathered
Feb 24 14:09:59 hpcmgt.casic.hpc.auburn.edu slurmctld[4803]: slurmctld version 21.08.5 started on cluster casic
Feb 24 14:09:59 hpcmgt.casic.hpc.auburn.edu slurmctld[4803]: accounting_storage/slurmdbd: clusteracct_storage_p_register_ctl...rmdbd
Feb 24 14:10:00 hpcmgt.casic.hpc.auburn.edu slurmctld[4803]: No memory enforcing mechanism configured.


Started slurmctld in debug mode, but didn't see anything more than provided by the above command. Didn't see anything unusual with permissions either.

But everything seems to be running normally as we've submitted jobs, monitored them with various commands along with their usage.

Any idea what this error is and if it needs to be resolved?

Thanks.
Keenan
Comment 1 Jason Booth 2022-02-24 15:20:26 MST
Keenan - 

Slurm is configured to run as "SlurmUser", and if that user can not access locations such as "/var/run/" or "/var/log", then you will see access errors.
 

> Feb 24 14:09:59 hpcmgt.casic.hpc.auburn.edu systemd[1]: Can't open PID file /var/run/slurmctld.pid (yet?) after start: No su...ctory

> Feb 24 14:09:59 hpcmgt.casic.hpc.auburn.edu slurmctld[4803]: error: chdir(/var/log): Permission denied


Please check which user slurmctld is set up to run as by looking at the slurm.conf's SlurmUser setting. You may also want to set up directories under those locations that the SlurmUser can access
Comment 2 HPC Admin 2022-02-25 09:22:09 MST
Thanks again. You can close this ticket.
Keenan
Comment 3 Jason Booth 2022-02-25 10:27:33 MST
Resolving.