Summary: | auth/jwt: Could not load key file | ||
---|---|---|---|
Product: | Slurm | Reporter: | GSK-ONYX-SLURM <slurm-support> |
Component: | slurmctld | Assignee: | Nate Rini <nate> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | nate |
Version: | 23.02.3 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | GSK | Slinky Site: | --- |
Alineos Sites: | --- | Atos/Eviden Sites: | --- |
Confidential Site: | --- | Coreweave sites: | --- |
Cray Sites: | --- | DS9 clusters: | --- |
Google sites: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | RHEL | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: |
slurm.conf
slurmctld service the strace log |
Description
GSK-ONYX-SLURM
2023-07-27 07:36:33 MDT
(In reply to GSK-EIS-SLURM from comment #0) > I decided to change permissions of the key from 700 to 755 and the service > started working. > > Do I need to change the permission of the key across all the clusters? The jwt key should never be visible to world. Any one who can read the key will effectively have root access to the cluster. Most likely cause of the issue was the user/group was incorrect. Please provide the output of: > systemctl show slurmctld.service and attach slurm.conf Created attachment 31481 [details]
slurm.conf
Created attachment 31482 [details]
slurmctld service
Please try and paste the log:
> stat /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key
> chown slurm /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key
> chmod 0660 /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key
> stat /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key
> sudo -u slurm stat /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key
> sudo -u nobody stat /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key
(In reply to Nate Rini from comment #5) > Please try and paste the log: > > > stat /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key > > chown slurm /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key > > chmod 0660 /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key > > stat /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key > > sudo -u slurm stat /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key > > sudo -u nobody stat /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key Here you are: -bash-4.2$ stat /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key File: ‘/home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key’ Size: 1679 Blocks: 8 IO Block: 8192 regular file Device: 2dh/45d Inode: 76744497 Links: 1 Access: (0755/-rwxr-xr-x) Uid: (63124/ slurm) Gid: (63124/ slurm) Access: 2022-05-10 10:08:56.308967000 +0100 Modify: 2021-01-12 10:22:48.753907000 +0000 Change: 2023-07-27 16:45:22.494768000 +0100 Birth: - -bash-4.2$ -bash-4.2$ -bash-4.2$ chown slurm /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key -bash-4.2$ -bash-4.2$ -bash-4.2$ chmod 0660 /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key -bash-4.2$ -bash-4.2$ -bash-4.2$ stat /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key File: ‘/home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key’ Size: 1679 Blocks: 8 IO Block: 8192 regular file Device: 2dh/45d Inode: 76744497 Links: 1 Access: (0660/-rw-rw----) Uid: (63124/ slurm) Gid: (63124/ slurm) Access: 2022-05-10 10:08:56.308967000 +0100 Modify: 2021-01-12 10:22:48.753907000 +0000 Change: 2023-07-27 16:45:45.289773000 +0100 Birth: - -bash-4.2$ -bash-4.2$ -bash-4.2$ sudo -u slurm stat /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key sudo: Password expired, contact your system administrator -bash-4.2$ -bash-4.2$ -bash-4.2$ sudo -u nobody stat /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key We trust you have received the usual lecture from the local System Administrator. It usually boils down to these three things: #1) Respect the privacy of others. #2) Think before you type. #3) With great power comes great responsibility. [sudo] password for slurm: -bash-4.2$ -bash-4.2$ -bash-4.2$ To be honest I don't know the password for the slurm user, I have never needed it. I'm going to request the password to be restored. (In reply to GSK-EIS-SLURM from comment #6) > To be honest I don't know the password for the slurm user, I have never > needed it. I'm going to request the password to be restored. There is no reason for slurm user to have a password (and several for it to not). It appears the sudoers on this cluster is strict, which is fine, so please try this instead: > su -c "stat /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key" slurm > su -c "stat /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key" nobody (In reply to Nate Rini from comment #7) > please try this instead: > > su -c "stat /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key" slurm > > su -c "stat /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key" nobody [root@uk1sxlx00128 ~]# su -c "stat /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key" slurm File: '/home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key' Size: 1679 Blocks: 8 IO Block: 8192 regular file Device: 2dh/45d Inode: 76744497 Links: 1 Access: (0660/-rw-rw----) Uid: (63124/ slurm) Gid: (63124/ slurm) Access: 2022-05-10 10:08:56.308967000 +0100 Modify: 2021-01-12 10:22:48.753907000 +0000 Change: 2023-07-27 16:45:45.289773000 +0100 Birth: - [root@uk1sxlx00128 ~]# su -c "stat /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key" nobody This account is currently not available. [root@uk1sxlx00128 ~]# (In reply to GSK-EIS-SLURM from comment #8) > [root@uk1sxlx00128 ~]# su -c "stat /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key" > Access: (0660/-rw-rw----) Uid: (63124/ slurm) Gid: (63124/ slurm) This looks correct. Please try restarting slurmctld or slurmdbd to verify access is now correct. > [root@uk1sxlx00128 ~]# su -c "stat > /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key" nobody > This account is currently not available. This was the to check other users can't access the file. If possible, please try this with a normal user. (In reply to Nate Rini from comment #9) > This looks correct. Please try restarting slurmctld or slurmdbd to verify > access is now correct. I restarted both and it's still the same: [root@uk1sxlx00128 ~]# systemctl -l status slurmctld.service ● slurmctld.service - Slurm controller daemon Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Fri 2023-07-28 17:58:03 BST; 31s ago Process: 39640 ExecStart=/home/slurm/Software/RHEL7/slurm/23.02.3/sbin/slurmctld -D $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE) Main PID: 39640 (code=exited, status=1/FAILURE) Jul 28 17:58:03 uk1sxlx00128.corpnet2.com systemd[1]: Started Slurm controller daemon. Jul 28 17:58:03 uk1sxlx00128.corpnet2.com slurmctld[39640]: slurmctld: fatal: auth/jwt: Could not load key file (/home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key) Jul 28 17:58:03 uk1sxlx00128.corpnet2.com slurmctld[39640]: fatal: auth/jwt: Could not load key file (/home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key) Jul 28 17:58:03 uk1sxlx00128.corpnet2.com systemd[1]: slurmctld.service: main process exited, code=exited, status=1/FAILURE Jul 28 17:58:03 uk1sxlx00128.corpnet2.com systemd[1]: Unit slurmctld.service entered failed state. Jul 28 17:58:03 uk1sxlx00128.corpnet2.com systemd[1]: slurmctld.service failed. [root@uk1sxlx00128 ~]# > This was the to check other users can't access the file. If possible, please > try this with a normal user. When I execute this command as me, there's a password prompt: rd178639@uk1sxlx00128 ~ su -c "stat /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key" nobody Password: (In reply to GSK-EIS-SLURM from comment #10) > I restarted both and it's still the same: > > Jul 28 17:58:03 uk1sxlx00128.corpnet2.com slurmctld[39640]: fatal: auth/jwt: > Could not load key file (/home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key) This error is from slurmctld being unable to open and memory map in the jwt_hs256.key file. We can use strace to see if we can find out why: > strace -o strace.log -e openat,open,mmap -- /home/slurm/Software/RHEL7/slurm/23.02.3/sbin/slurmctld -D $SLURMCTLD_OPTIONS Note that $SLURMCTLD_OPTIONS will need to be filled in by value in /etc/sysconfig/slurmctld (if it exists). Please attach the strace.log. > When I execute this command as me, there's a password prompt: > > rd178639@uk1sxlx00128 ~ su -c "stat > /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key" nobody> Password: These commands need to be executed as root. (In reply to Nate Rini from comment #11) > This error is from slurmctld being unable to open and memory map in the > jwt_hs256.key file. We can use strace to see if we can find out why: > > strace -o strace.log -e openat,open,mmap -- /home/slurm/Software/RHEL7/slurm/23.02.3/sbin/slurmctld -D $SLURMCTLD_OPTIONS I'm attaching the strace log. > > rd178639@uk1sxlx00128 ~ su -c "stat > > /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key" nobody> Password: > > These commands need to be executed as root. I had executed it as root and then I was told to do it as a normal user, that's why I tried to execute it from my account. The output when it's executed as root: [root@uk1sxlx00128 ~]# su -c "stat /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key" nobody This account is currently not available. [root@uk1sxlx00128 ~]# Created attachment 31526 [details]
the strace log
(In reply to GSK-EIS-SLURM from comment #13) > Created attachment 31526 [details] > the strace log > > open("/home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key", O_RDONLY|O_CLOEXEC) = -1 EACCES (Permission denied) Permissions are still failing. Is this a shared filesystem? Is Selinux active? Please call (as root): > namei /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key > sestatus (In reply to GSK-EIS-SLURM from comment #12) > I had executed it as root and then I was told to do it as a normal user, > that's why I tried to execute it from my account. I wanted to verify the result of both types of users. When su/sudo are called via a normal user, they will activate the pam configuration if not normally allowed to impersonate the other user, which happened here: > rd178639@uk1sxlx00128 ~ su -c "stat /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key" nobody > Password: . > The output when it's executed as root: > > [root@uk1sxlx00128 ~]# su -c "stat /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key" nobody > This account is currently not available. Looks like nobody has nologin setup too. Please call this as rd178639 instead (assuming this user doesn't have any special permissions): > stat /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key (In reply to Nate Rini from comment #14) > > Permissions are still failing. Is this a shared filesystem? Is Selinux > active? Yes, this is NFS and no, Selinux is disabled. > > Please call (as root): > > namei /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key > > sestatus [root@uk1sxlx00128 ~]# namei /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key f: /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key d / d home d uk_hpc_crash d StateSaveLocation - jwt_hs256.key [root@uk1sxlx00128 ~]# sestatus SELinux status: disabled [root@uk1sxlx00128 ~]# > Looks like nobody has nologin setup too. Please call this as rd178639 > instead (assuming this user doesn't have any special permissions): > > stat /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key rd178639@uk1sxlx00128 ~ stat /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key File: '/home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key' Size: 1679 Blocks: 8 IO Block: 8192 regular file Device: 2dh/45d Inode: 76744497 Links: 1 Access: (0660/-rw-rw----) Uid: (63124/ slurm) Gid: (63124/ slurm) Access: 2022-05-10 10:08:56.308967000 +0100 Modify: 2021-01-12 10:22:48.753907000 +0000 Change: 2023-07-27 16:45:45.289773000 +0100 Birth: - rd178639@uk1sxlx00128 ~ (In reply to GSK-EIS-SLURM from comment #15) > (In reply to Nate Rini from comment #14) > > > > Permissions are still failing. Is this a shared filesystem? Is Selinux > > active? > > Yes, this is NFS and no, Selinux is disabled. Is rootsquash enabled? Is this nfs 3 or 4? Is the lock daemon running? Is kerberos being used for auth? Do slurmctld and slurmdbd run on the same host? Are they being run as the same user? The config in comment#3 doesn't have a user= entry so looks like it is being started as root. I would suggest having both daemons be configured with systemd unit file based on the file's user/group: > user=slurm > group=slurm . > > Looks like nobody has nologin setup too. Please call this as rd178639 > > instead (assuming this user doesn't have any special permissions): > > > stat /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key > > rd178639@uk1sxlx00128 ~ stat /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key > File: '/home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key' > Access: (0660/-rw-rw----) Uid: (63124/ slurm) Gid: (63124/ slurm) This user can read the parent directory, so let's see if it can read the file. Please call as rd178639: > file /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key You could also just hexdump the file but please don't post it here. (In reply to Nate Rini from comment #16) > Is rootsquash enabled? Is this nfs 3 or 4? These are all the options the share is mounted with: /home/slurm type nfs (rw,relatime,vers=3,rsize=8192,wsize=8192,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.184.24.115,mountvers=3,mountport=635,mountproto=udp,local_lock=none,addr=10.184.24.115) > Is the lock daemon running? Is > kerberos being used for auth? No and no. > > Do slurmctld and slurmdbd run on the same host? Yes. > Are they being run as the > same user? The config in comment#3 doesn't have a user= entry so looks like > it is being started as root. Yes - it's root. > I would suggest having both daemons be > configured with systemd unit file based on the file's user/group: > > user=slurm > > group=slurm Once added, the slurmctl daemon started working: [root@uk1sxlx00128 ~]# systemctl restart slurmctld.service [root@uk1sxlx00128 ~]# systemctl status slurmctld.service ● slurmctld.service - Slurm controller daemon Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2023-08-01 06:59:26 BST; 5s ago Main PID: 32511 (slurmctld) CGroup: /system.slice/slurmctld.service ├─32511 /home/slurm/Software/RHEL7/slurm/23.02.3/sbin/slurmctld -D └─32512 slurmctld: slurmscriptd Aug 01 06:59:31 uk1sxlx00128.corpnet2.com slurmctld[32511]: slurmctld: Down nodes: uk1salx00717 Aug 01 06:59:31 uk1sxlx00128.corpnet2.com slurmctld[32511]: slurmctld: Recovered information about 0 jobs Aug 01 06:59:31 uk1sxlx00128.corpnet2.com slurmctld[32511]: slurmctld: select/cons_res: part_data_create_array: select/cons_res: preparing for 1 partitions Aug 01 06:59:31 uk1sxlx00128.corpnet2.com slurmctld[32511]: slurmctld: Recovered state of 0 reservations Aug 01 06:59:31 uk1sxlx00128.corpnet2.com slurmctld[32511]: slurmctld: State of 0 triggers recovered Aug 01 06:59:31 uk1sxlx00128.corpnet2.com slurmctld[32511]: slurmctld: select/cons_res: select_p_reconfigure: select/cons_res: reconfigure Aug 01 06:59:31 uk1sxlx00128.corpnet2.com slurmctld[32511]: slurmctld: select/cons_res: part_data_create_array: select/cons_res: preparing for 1 partitions Aug 01 06:59:31 uk1sxlx00128.corpnet2.com slurmctld[32511]: slurmctld: Running as primary controller Aug 01 06:59:31 uk1sxlx00128.corpnet2.com slurmctld[32511]: slurmctld: No parameter for mcs plugin, default values set Aug 01 06:59:31 uk1sxlx00128.corpnet2.com slurmctld[32511]: slurmctld: mcs: MCSParameters = (null). ondemand set. [root@uk1sxlx00128 ~]# > This user can read the parent directory, so let's see if it can read the > file. Please call as rd178639: > > file /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key -bash-4.2$ file /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key: PEM RSA private key -bash-4.2$ > You could also just hexdump the file but please don't post it here. Yes, I can. It seems the problem is now resolved. I will make sure the slurmctl and slurmdb daemons are running as a slurm user on all the clusters. Thanks a lot for your support! Radek (In reply to GSK-EIS-SLURM from comment #17) > (In reply to Nate Rini from comment #16) > > This user can read the parent directory, so let's see if it can read the > > file. Please call as rd178639: > > > file /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key > -bash-4.2$ file /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key > /home/uk_hpc_crash/StateSaveLocation/jwt_hs256.key: PEM RSA private key > > You could also just hexdump the file but please don't post it here. > > Yes, I can. Just make sure that normal users can't read jwt_hs256.key or they will have root over the cluster. > It seems the problem is now resolved. I will make sure the slurmctl and > slurmdb daemons are running as a slurm user on all the clusters. Closing out ticket. (In reply to Nate Rini from comment #16) > it is being started as root. I would suggest having both daemons be > configured with systemd unit file based on the file's user/group: > > user=slurm > > group=slurm Hi Nate -- one quick question to this. I've already added a user and a group to the slurmctld and slurmdbd systemd config files. However I noticed that the following warning / info appears after the service restart: Aug 11 00:58:11 us1salx09012.corpnet2.com slurmdbd[3529030]: slurmdbd: Not running as root. Can't drop supplementary groups Just wanted to check with you when potentially it could cause an issue (if any)? What groups it is about? Thanks, Radek (In reply to GSK-EIS-SLURM from comment #19) > Hi Nate -- one quick question to this. I've already added a user and a group > to the slurmctld and slurmdbd systemd config files. However I noticed that > the following warning / info appears after the service restart: > > Aug 11 00:58:11 us1salx09012.corpnet2.com slurmdbd[3529030]: slurmdbd: Not > running as root. Can't drop supplementary groups > > Just wanted to check with you when it could potentially cause an issue (if > any)? What groups it is about? It is a warning about not being able to drop the supplementary group IDs. This warning predated systemd and was there to ensure that the old sysv init didn't have extra groups on the process that could allow a user to attach to the daemon using ptrace to cause security issues. I've opened bug#17412 to modify this log message. |