| Summary: | pam_slurm_adopt connect spool file got permission denied | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Yuxing Peng <yuxing> |
| Component: | slurmstepd | Assignee: | Marshall Garey <marshall> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 18.08.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=10551 | ||
| Site: | University of Chicago | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 18.08.3 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Yuxing Peng
2018-11-27 22:14:04 MST
I'm looking into this. Could you upload a current slurm.conf file? Is this a one-off occurrence? Or does pam_slurm_adopt never work? The contents of slurm.conf and cgroup.conf are attached at the end of this message. But, please note that the error was "permission denied" for the sock file that slurmstepd created. Another thing is that the it also complained that the MUNGED sock file was also "permission denied", when I was using an old version of MUNGE (0.5.9). After I upgraded MUNGE to 0.5.11, this error disappeared and then the error for SLURMSTEPD appears. My testing environment is the AWS EC2 instances, and all servers are using public IPs (for testing, I opened all ports on firewall). The OS is Red Hat Enterprise Linux 7 AMI provided by AWS. The sock files do exists (either MUNGE or slurm.spool) /etc/slurm/slurm.conf ClusterName=skyway ControlMachine=skyway.rcc.uchicago.edu ControlAddr=skyway.rcc.uchicago.edu SlurmUser=slurm FastSchedule=1 TreeWidth=50 GresTypes=gpu AuthType=auth/munge SwitchType=switch/none JobCompType=jobcomp/filetxt JobSubmitPlugins=lua PrologFlags=contain ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup CryptoType=crypto/munge SlurmctldPort=6817 SlurmctldTimeout=300 SlurmdPort=6818 SlurmdTimeout=300 SrunPortRange=60001-63000 SlurmctldLogFile=/var/log/slurmctld.log SlurmctldDebug=info SlurmdLogFile=/var/log/slurmd.log SlurmdDebug=info SlurmdSpoolDir=/var/spool/slurm/slurmd.spool StateSaveLocation=/var/spool/slurm/slurm.state AccountingStorageHost=skyway.rcc.uchicago.edu AccountingStorageEnforce=associations,limits,qos,safe AccountingStorageTRES=gres/gpu AccountingStorageType=accounting_storage/slurmdbd /etc/slurm/cgroup.conf CgroupAutomount=yes ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes MaxRAMPercent=90 AllowedRAMSpace=90 AllowedSwapSpace=0 MaxSwapPercent=0 (In reply to Marshall Garey from comment #2) > I'm looking into this. Could you upload a current slurm.conf file? > > Is this a one-off occurrence? Or does pam_slurm_adopt never work? Sorry, I should directly reply this comment instead of adding a new one. Here I copied the content of my answers in that comment. And, this is not a one-off occurrence. It never work. Also, when I switched to use pam_slurm.so, it complains that "cannot contact controller". However, I did testing and connecting to controller from that node is working, (i.e., ping and squeue, sinfo were all working). The contents of slurm.conf and cgroup.conf are attached at the end of this message. But, please note that the error was "permission denied" for the sock file that slurmstepd created. Another thing is that the it also complained that the MUNGED sock file was also "permission denied", when I was using an old version of MUNGE (0.5.9). After I upgraded MUNGE to 0.5.11, this error disappeared and then the error for SLURMSTEPD appears. My testing environment is the AWS EC2 instances, and all servers are using public IPs (for testing, I opened all ports on firewall). The OS is Red Hat Enterprise Linux 7 AMI provided by AWS. The sock files do exists (either MUNGE or slurm.spool) Any reason to the OS that may cause the permission deny on a sock-file or unix socket? /etc/slurm/slurm.conf ClusterName=skyway ControlMachine=skyway.rcc.uchicago.edu ControlAddr=skyway.rcc.uchicago.edu SlurmUser=slurm FastSchedule=1 TreeWidth=50 GresTypes=gpu AuthType=auth/munge SwitchType=switch/none JobCompType=jobcomp/filetxt JobSubmitPlugins=lua PrologFlags=contain ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup CryptoType=crypto/munge SlurmctldPort=6817 SlurmctldTimeout=300 SlurmdPort=6818 SlurmdTimeout=300 SrunPortRange=60001-63000 SlurmctldLogFile=/var/log/slurmctld.log SlurmctldDebug=info SlurmdLogFile=/var/log/slurmd.log SlurmdDebug=info SlurmdSpoolDir=/var/spool/slurm/slurmd.spool StateSaveLocation=/var/spool/slurm/slurm.state AccountingStorageHost=skyway.rcc.uchicago.edu AccountingStorageEnforce=associations,limits,qos,safe AccountingStorageTRES=gres/gpu AccountingStorageType=accounting_storage/slurmdbd /etc/slurm/cgroup.conf CgroupAutomount=yes ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes MaxRAMPercent=90 AllowedRAMSpace=90 AllowedSwapSpace=0 MaxSwapPercent=0 Thanks. I just wanted to verify your slurm configuration was correct to eliminate any obvious possibilities. Don't use pam_slurm.so. It's old and not actively maintained.
Yes, I do see that permission denied on the slurmstepd socket file is the current issue.
First, what user is sshd running as? I expect it to be running as root.
Could you run a new job, and then do the following?
ls -ld /path/to/slurmd/spool/dir (it should contain the slurmstepd socket files)
ls -l /path/to/slurmd/spool/dir (it should contain the slurmstepd socket files)
Assuming you have netcat installed, you can connect to a stepd socket like so:
nc -U <socket_name>
<socket_name> will be in the form <nodename>_<jobid>.<stepid>
The extern step id is 4294967295 while the batch step id is 4294967294, and other step id's are just the step id's of your srun steps inside your job.
For example,
marshall@voyager:~/slurm/17.11/byu/spool/slurmd-v1$ ls -l
total 24
drwx------ 2 root root 4096 May 7 2018 cpu
-rw------- 1 root root 112 Dec 3 17:02 cred_state
-rw------- 1 root root 64 Dec 3 16:54 cred_state.old
-rw-r--r-- 1 root root 4148 Nov 30 15:45 hwloc_topo_whole.xml
drwxr-x--- 2 root marshall 4096 Dec 3 17:02 job00005
srwxrwxrwx 1 root root 0 Dec 3 17:02 v1_5.0
srwxrwxrwx 1 root root 0 Dec 3 17:02 v1_5.4294967294
srwxrwxrwx 1 root root 0 Dec 3 17:02 v1_5.4294967295
marshall@voyager:~/slurm/17.11/byu/spool/slurmd-v1$ nc -U v1_5.4294967295
asdf
In my slurmd log file, I see the following message indicating that I successfully connected to the socket, but the message I sent ("asdf") was invalid.
[2018-12-03T17:05:21.492] [5.extern] error: First message must be REQUEST_CONNECT
[2018-12-03T17:05:21.492] [5.extern] debug: Leaving _handle_accept on an error
The user running as sshd needs to have write permissions on the socket file and the path to it. If I don't have write permissions on the socket file or the path to it, then I'll get the permission denied error with nc:
marshall@voyager:~/slurm/17.11/byu/spool/slurmd-v1$ sudo chmod 775 v1_5.4294967295
marshall@voyager:~/slurm/17.11/byu/spool/slurmd-v1$ ls -l
total 24
drwx------ 2 root root 4096 May 7 2018 cpu
-rw------- 1 root root 112 Dec 3 17:02 cred_state
-rw------- 1 root root 64 Dec 3 16:54 cred_state.old
-rw-r--r-- 1 root root 4148 Nov 30 15:45 hwloc_topo_whole.xml
drwxr-x--- 2 root marshall 4096 Dec 3 17:02 job00005
srwxrwxrwx 1 root root 0 Dec 3 17:02 v1_5.0
srwxrwxrwx 1 root root 0 Dec 3 17:02 v1_5.4294967294
srwxrwxr-x 1 root root 0 Dec 3 17:02 v1_5.4294967295
marshall@voyager:~/slurm/17.11/byu/spool/slurmd-v1$ nc -U v1_5.4294967295
nc: unix connect failed: Permission denied
(In reply to Marshall Garey from comment #6) > Thanks. I just wanted to verify your slurm configuration was correct to > eliminate any obvious possibilities. Don't use pam_slurm.so. It's old and > not actively maintained. > > Yes, I do see that permission denied on the slurmstepd socket file is the > current issue. > > First, what user is sshd running as? I expect it to be running as root. > > Could you run a new job, and then do the following? > > ls -ld /path/to/slurmd/spool/dir (it should contain the slurmstepd socket > files) > ls -l /path/to/slurmd/spool/dir (it should contain the slurmstepd socket > files) > > Assuming you have netcat installed, you can connect to a stepd socket like > so: > > nc -U <socket_name> > > <socket_name> will be in the form <nodename>_<jobid>.<stepid> > > The extern step id is 4294967295 while the batch step id is 4294967294, and > other step id's are just the step id's of your srun steps inside your job. > > For example, > marshall@voyager:~/slurm/17.11/byu/spool/slurmd-v1$ ls -l > total 24 > drwx------ 2 root root 4096 May 7 2018 cpu > -rw------- 1 root root 112 Dec 3 17:02 cred_state > -rw------- 1 root root 64 Dec 3 16:54 cred_state.old > -rw-r--r-- 1 root root 4148 Nov 30 15:45 hwloc_topo_whole.xml > drwxr-x--- 2 root marshall 4096 Dec 3 17:02 job00005 > srwxrwxrwx 1 root root 0 Dec 3 17:02 v1_5.0 > srwxrwxrwx 1 root root 0 Dec 3 17:02 v1_5.4294967294 > srwxrwxrwx 1 root root 0 Dec 3 17:02 v1_5.4294967295 > > marshall@voyager:~/slurm/17.11/byu/spool/slurmd-v1$ nc -U v1_5.4294967295 > asdf > > In my slurmd log file, I see the following message indicating that I > successfully connected to the socket, but the message I sent ("asdf") was > invalid. > > [2018-12-03T17:05:21.492] [5.extern] error: First message must be > REQUEST_CONNECT > [2018-12-03T17:05:21.492] [5.extern] debug: Leaving _handle_accept on an > error > > The user running as sshd needs to have write permissions on the socket file > and the path to it. If I don't have write permissions on the socket file or > the path to it, then I'll get the permission denied error with nc: > > marshall@voyager:~/slurm/17.11/byu/spool/slurmd-v1$ sudo chmod 775 > v1_5.4294967295 > marshall@voyager:~/slurm/17.11/byu/spool/slurmd-v1$ ls -l > total 24 > drwx------ 2 root root 4096 May 7 2018 cpu > -rw------- 1 root root 112 Dec 3 17:02 cred_state > -rw------- 1 root root 64 Dec 3 16:54 cred_state.old > -rw-r--r-- 1 root root 4148 Nov 30 15:45 hwloc_topo_whole.xml > drwxr-x--- 2 root marshall 4096 Dec 3 17:02 job00005 > srwxrwxrwx 1 root root 0 Dec 3 17:02 v1_5.0 > srwxrwxrwx 1 root root 0 Dec 3 17:02 v1_5.4294967294 > srwxrwxr-x 1 root root 0 Dec 3 17:02 v1_5.4294967295 > marshall@voyager:~/slurm/17.11/byu/spool/slurmd-v1$ nc -U v1_5.4294967295 > nc: unix connect failed: Permission denied Hi Marshall, Thanks again for you reply. I did all tests that you suggested and copied the results here, [root@rcc-aws-t2-micro-001 ~]# ps aux | grep slurmd root 1430 0.0 0.2 134692 2896 ? S 06:14 0:00 /usr/sbin/slurmd root 1611 0.0 0.0 112704 964 pts/1 S+ 06:18 0:00 grep --color=auto slurmd [root@rcc-aws-t2-micro-001 ~]# ps aux | grep slurmstepd root 1437 0.0 0.2 128208 2556 ? Sl 06:14 0:00 slurmstepd: [207.extern] root 1445 0.0 0.2 263392 2784 ? Sl 06:14 0:00 slurmstepd: [207.0] root 1613 0.0 0.0 112704 972 pts/1 S+ 06:18 0:00 grep --color=auto slurmstepd [ec2-user@rcc-aws-t2-micro-001 ~]$ ps aux | grep sshd root 1423 0.0 0.4 112792 4268 ? Ss 06:14 0:00 /usr/sbin/sshd -D root 1624 0.0 0.6 168896 6504 ? Ss 06:20 0:00 sshd: ec2-user [priv] ec2-user 1627 0.0 0.2 168896 2784 ? S 06:20 0:00 sshd: ec2-user@pts/1 root 1769 0.7 0.6 168896 6500 ? Ss 07:03 0:00 sshd: ec2-user [priv] ec2-user 1772 0.0 0.2 168896 2680 ? S 07:03 0:00 sshd: ec2-user@pts/2 [root@rcc-aws-t2-micro-001 slurmd.spool]# ls -ld drwxr-xr-x. 2 slurm slurm 101 Dec 4 06:14 . [root@rcc-aws-t2-micro-001 slurmd.spool]# ls -l total 4 -rw-------. 1 root root 84 Dec 4 06:14 cred_state srwxrwxrwx. 1 root root 0 Dec 4 06:14 rcc-aws-t2-micro-001_207.0 srwxrwxrwx. 1 root root 0 Dec 4 06:14 rcc-aws-t2-micro-001_207.4294967295 [root@rcc-aws-t2-micro-001 slurmd.spool]# nc -U rcc-aws-t2-micro-001_207.4294967295 asdf Ncat: Connection reset by peer. [root@rcc-aws-t2-micro-001 slurmd.spool]# sudo -u yuxing nc -U rcc-aws-t2-micro-001_207.4294967295 asdf Ncat: Connection reset by peer. [root@rcc-aws-t2-micro-001 slurmd.spool]# su - yuxing [yuxing@rcc-aws-t2-micro-001 ~]$ cd /var/spool/slurm/slurmd.spool/ [yuxing@rcc-aws-t2-micro-001 slurmd.spool]$ nc -U rcc-aws-t2-micro-001_207.4294967295 asdf Ncat: Connection reset by peer. As you can see from the logs above, SSHD is launched by root, and both root and my user (yuxing) can connect to the sock file that has been created. However, when I tried to ssh to this node from another terminal session, I still got permission denied, and the node has following secure log, Dec 4 07:12:19 ip-172-31-36-153 pam_slurm_adopt[1982]: debug: Reading cgroup.conf file /etc/slurm/cgroup.conf Dec 4 07:12:19 ip-172-31-36-153 pam_slurm_adopt[1982]: debug: Reading slurm.conf file: /etc/slurm/slurm.conf Dec 4 07:12:19 ip-172-31-36-153 pam_slurm_adopt[1982]: debug3: Trying to load plugin /usr/lib64/slurm/auth_munge.so Dec 4 07:12:19 ip-172-31-36-153 pam_slurm_adopt[1982]: debug: Munge authentication plugin loaded Dec 4 07:12:19 ip-172-31-36-153 pam_slurm_adopt[1982]: debug3: Success. Dec 4 07:12:19 ip-172-31-36-153 pam_slurm_adopt[1982]: debug: _step_connect: connect() failed dir /var/spool/slurm/slurmd.spool node rcc-aws-t2-micro-001 step 207.4294967295 Permission denied Dec 4 07:12:19 ip-172-31-36-153 pam_slurm_adopt[1982]: debug3: unable to connect to step 207.4294967295 on rcc-aws-t2-micro-001: Permission denied Dec 4 07:12:19 ip-172-31-36-153 pam_slurm_adopt[1982]: send_user_msg: Access denied by pam_slurm_adopt: you have no active jobs on this node Dec 4 07:12:19 ip-172-31-36-153 sshd[1982]: pam_access(sshd:account): access denied for user `yuxing' from `skyway.rcc.uchicago.edu' Please let me know if there's anything else I can test or provide. Best regards, Yuxing (In reply to Yuxing Peng from comment #7) I also found something interesting here, which may provide some clues about this issue. [root@rcc-aws-t2-micro-001 ~]# sudo -u yuxing nc -U /var/spool/slurm/slurmd.spool/rcc-aws-t2-micro-001_207.4294967295 asdf Ncat: Connection reset by peer. [root@rcc-aws-t2-micro-001 ~]# sudo -u yuxing nc -U rcc-aws-t2-micro-001_207.4294967295 Ncat: Permission denied. [root@rcc-aws-t2-micro-001 ~]# nc -U rcc-aws-t2-micro-001_207.4294967295 Ncat: No such file or directory. When I test connecting the sock with a local user (yuxing), it got successful while use the full path, but permission denied without the path. This is the first time I can see "permission denied" from nc command. And, I was in the home folder, definitely the sock file was not in that folder. I don't know how Linux still find out this sock file but got permission denied. When I tried the same command using root (not including file path), I got different error: No such file or directory. Hope this information can help. Best, Yuxing RE comment 8 - that's interesting and confusing. I wouldn't expect that. Trying it out myself, I see "no such file or directory" whether I'm user root or marshall. > [root@rcc-aws-t2-micro-001 slurmd.spool]# ls -ld > drwxr-xr-x. 2 slurm slurm 101 Dec 4 06:14 . The slurmd spool directory ownership is slurm:slurm. It should be root:root. I'm guessing that the reason you get permission denied is because root doesn't have write permissions on the directory, but I'm not totally sure. Can you try changing the ownership of that directory to root:root, leaving the permissions the same? Also, we take our severity levels seriously, and severity 1 and 2 tickets disrupt the normal workflow of the support team. so please use the most accurate severity level for the tickets. Taken from our support page here: https://www.schedmd.com/support.php "Severity 2 — High Impact A Severity 2 issue is a high-impact problem that is causing sporadic outages or is consistently encountered by end users with adverse impact to end user interaction with the system." This does not seem like a severity 2 bug. I've changed the ticket to sev-3 again, but please correct me if I'm wrong. If the changing ownership of the slurmd spool directory to root:root doesn't work, you can always temporarily disable pam_slurm_adopt to allow ssh'ing to the compute nodes until you do get pam_slurm_adopt working. (In reply to Marshall Garey from comment #9) > RE comment 8 - that's interesting and confusing. I wouldn't expect that. > Trying it out myself, I see "no such file or directory" whether I'm user > root or marshall. > > > > [root@rcc-aws-t2-micro-001 slurmd.spool]# ls -ld > > drwxr-xr-x. 2 slurm slurm 101 Dec 4 06:14 . > > The slurmd spool directory ownership is slurm:slurm. It should be root:root. > I'm guessing that the reason you get permission denied is because root > doesn't have write permissions on the directory, but I'm not totally sure. > Can you try changing the ownership of that directory to root:root, leaving > the permissions the same? > > > > Also, we take our severity levels seriously, and severity 1 and 2 tickets > disrupt the normal workflow of the support team. so please use the most > accurate severity level for the tickets. > > Taken from our support page here: > https://www.schedmd.com/support.php > > "Severity 2 — High Impact > > A Severity 2 issue is a high-impact problem that is causing sporadic outages > or is consistently encountered by end users with adverse impact to end user > interaction with the system." > > This does not seem like a severity 2 bug. I've changed the ticket to sev-3 > again, but please correct me if I'm wrong. > > If the changing ownership of the slurmd spool directory to root:root doesn't > work, you can always temporarily disable pam_slurm_adopt to allow ssh'ing to > the compute nodes until you do get pam_slurm_adopt working. Hi Marshal, Thanks very much for the help. We finally figured out the issue. The Amazon Cloud Images turn on SELinux (enforcing mode), which restrict all accesses requests from the login procedure. However, this mode is set to "permissive" as default on most other OS image sources. Hope this case can provide you some information if next time there are permission denied issues from other people. This ticket can be closed now. Best regards, Yuxing Thanks very much for the information. That will definitely help us in the future. FYI, we don't have you listed as a supported contact for Chicago. We only have Mengxing and Stephen listed. In the future, we ask that they be the ones that submit tickets. Or, if they plan on having you submitting tickets in the future, please contact Jacob Jenson at jacob@schedmd.com to see about adding you to the supported contacts. |