| Summary: | Error when restarting slurmctld after 21.08.5 upgrade | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | HPC Admin <hpcadmin> |
| Component: | slurmctld | Assignee: | Jason Booth <jbooth> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 20.11.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Auburn | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
Keenan - Slurm is configured to run as "SlurmUser", and if that user can not access locations such as "/var/run/" or "/var/log", then you will see access errors. > Feb 24 14:09:59 hpcmgt.casic.hpc.auburn.edu systemd[1]: Can't open PID file /var/run/slurmctld.pid (yet?) after start: No su...ctory > Feb 24 14:09:59 hpcmgt.casic.hpc.auburn.edu slurmctld[4803]: error: chdir(/var/log): Permission denied Please check which user slurmctld is set up to run as by looking at the slurm.conf's SlurmUser setting. You may also want to set up directories under those locations that the SlurmUser can access Thanks again. You can close this ticket. Keenan Resolving. |
Hello, We just upgraded a test cluster from 20.11.8 to 21.08.5 and encountered an error that we haven't seen before when restarting slurmctld: 'error: chdir(/var/log): Permission denied'. Here's where we found it: hpcmgt:utility > systemctl status slurmctld ◠slurmctld.service - Slurm controller daemon Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2022-02-24 14:09:59 CST; 29min ago Process: 4800 ExecStart=/mnt/nfs01/shared/apps/slurm/current/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 4803 (slurmctld) CGroup: /system.slice/slurmctld.service ├─4803 /mnt/nfs01/shared/apps/slurm/current/sbin/slurmctld └─4804 slurmctld: slurmscriptd Feb 24 14:09:59 hpcmgt.casic.hpc.auburn.edu systemd[1]: Starting Slurm controller daemon... Feb 24 14:09:59 hpcmgt.casic.hpc.auburn.edu systemd[1]: Can't open PID file /var/run/slurmctld.pid (yet?) after start: No su...ctory Feb 24 14:09:59 hpcmgt.casic.hpc.auburn.edu slurmctld[4803]: error: chdir(/var/log): Permission denied Feb 24 14:09:59 hpcmgt.casic.hpc.auburn.edu systemd[1]: Started Slurm controller daemon. Feb 24 14:09:59 hpcmgt.casic.hpc.auburn.edu slurmctld[4803]: Job accounting information stored, but details not gathered Feb 24 14:09:59 hpcmgt.casic.hpc.auburn.edu slurmctld[4803]: slurmctld version 21.08.5 started on cluster casic Feb 24 14:09:59 hpcmgt.casic.hpc.auburn.edu slurmctld[4803]: accounting_storage/slurmdbd: clusteracct_storage_p_register_ctl...rmdbd Feb 24 14:10:00 hpcmgt.casic.hpc.auburn.edu slurmctld[4803]: No memory enforcing mechanism configured. Started slurmctld in debug mode, but didn't see anything more than provided by the above command. Didn't see anything unusual with permissions either. But everything seems to be running normally as we've submitted jobs, monitored them with various commands along with their usage. Any idea what this error is and if it needs to be resolved? Thanks. Keenan