Ticket 6104

Summary: pam_slurm_adopt connect spool file got permission denied
Product: Slurm Reporter: Yuxing Peng <yuxing>
Component: slurmstepdAssignee: Marshall Garey <marshall>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 18.08.3   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=10551
Site: University of Chicago Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 18.08.3 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Yuxing Peng 2018-11-27 22:14:04 MST
When trying to SSH to a node with allocated jobs, PAM denied with reason "Access denied by pam_slurm_adopt: you have no active jobs on this node". From log file it seems that the pam_slurm_adopt is not permitted to the slurmstepd sock file. All logs (level debug5) are provided here.

1. The log information from /var/log/secure

Nov 28 04:55:59 ip-172-31-27-106 pam_slurm_adopt[21449]: debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
Nov 28 04:55:59 ip-172-31-27-106 pam_slurm_adopt[21449]: debug:  Reading slurm.conf file: /etc/slurm/slurm.conf
Nov 28 04:55:59 ip-172-31-27-106 pam_slurm_adopt[21449]: debug4: found jobid = 205, stepid = 4294967295
Nov 28 04:55:59 ip-172-31-27-106 pam_slurm_adopt[21449]: debug4: found jobid = 205, stepid = 0
Nov 28 04:55:59 ip-172-31-27-106 pam_slurm_adopt[21449]: debug3: Trying to load plugin /usr/lib64/slurm/auth_munge.so
Nov 28 04:55:59 ip-172-31-27-106 pam_slurm_adopt[21449]: debug:  Munge authentication plugin loaded
Nov 28 04:55:59 ip-172-31-27-106 pam_slurm_adopt[21449]: debug3: Success.
Nov 28 04:55:59 ip-172-31-27-106 pam_slurm_adopt[21449]: debug:  _step_connect: connect() failed dir /var/spool/slurm/slurmd.spool node rcc-aws-t2-micro-002 step 2
Nov 28 04:55:59 ip-172-31-27-106 pam_slurm_adopt[21449]: debug3: unable to connect to step 205.4294967295 on rcc-aws-t2-micro-002: Permission denied
Nov 28 04:55:59 ip-172-31-27-106 pam_slurm_adopt[21449]: send_user_msg: Access denied by pam_slurm_adopt: you have no active jobs on this node
Nov 28 04:55:59 ip-172-31-27-106 sshd[21449]: pam_access(sshd:account): access denied for user `yuxing' from `skyway.rcc.uchicago.edu'
Nov 28 04:55:59 ip-172-31-27-106 sshd[21449]: fatal: Access denied for user yuxing by PAM account configuration [preauth]

2. /var/log/slurmd.log

[2018-11-28T04:55:55.102] debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
[2018-11-28T04:55:55.102] [205.0] debug3: Entered job_manager for 205.0 pid=21441
[2018-11-28T04:55:55.102] [205.0] debug3: Trying to load plugin /usr/lib64/slurm/core_spec_none.so
[2018-11-28T04:55:55.102] [205.0] debug3: Success.
[2018-11-28T04:55:55.102] [205.0] debug3: Trying to load plugin /usr/lib64/slurm/proctrack_cgroup.so
[2018-11-28T04:55:55.102] [205.0] debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
[2018-11-28T04:55:55.102] [205.0] debug3: Success.
[2018-11-28T04:55:55.102] [205.0] debug3: Trying to load plugin /usr/lib64/slurm/task_cgroup.so
[2018-11-28T04:55:55.102] [205.0] debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
[2018-11-28T04:55:55.102] [205.0] debug:  task/cgroup: now constraining jobs allocated cores
[2018-11-28T04:55:55.102] [205.0] debug3: xcgroup_set_param: parameter 'memory.use_hierarchy' set to '1' for '/sys/fs/cgroup/memory'
[2018-11-28T04:55:55.102] [205.0] debug:  task/cgroup/memory: total:990M allowed:90%(enforced), swap:0%(enforced), max:90%(891M) max+swap:0%(891M) min:30M kmem:100%(990M permissive) min:30M swappiness:0(unset)
[2018-11-28T04:55:55.102] [205.0] debug:  task/cgroup: now constraining jobs allocated memory
[2018-11-28T04:55:55.102] [205.0] debug:  task/cgroup: loaded
[2018-11-28T04:55:55.102] [205.0] debug3: Success.
[2018-11-28T04:55:55.102] [205.0] debug3: Trying to load plugin /usr/lib64/slurm/checkpoint_none.so
[2018-11-28T04:55:55.102] [205.0] debug3: Success.
[2018-11-28T04:55:55.102] [205.0] debug:  Checkpoint plugin loaded: checkpoint/none
[2018-11-28T04:55:55.103] [205.0] debug3: Trying to load plugin /usr/lib64/slurm/crypto_munge.so
[2018-11-28T04:55:55.103] [205.0] Munge cryptographic signature plugin loaded
[2018-11-28T04:55:55.103] [205.0] debug3: Success.
[2018-11-28T04:55:55.103] [205.0] debug3: Trying to load plugin /usr/lib64/slurm/job_container_none.so
[2018-11-28T04:55:55.103] [205.0] debug:  job_container none plugin loaded
[2018-11-28T04:55:55.103] [205.0] debug3: Success.
[2018-11-28T04:55:55.103] [205.0] debug:  mpi type = none
[2018-11-28T04:55:55.103] [205.0] debug3: Trying to load plugin /usr/lib64/slurm/mpi_none.so
[2018-11-28T04:55:55.103] [205.0] debug3: Success.
[2018-11-28T04:55:55.103] [205.0] debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/freezer/slurm' already exists
[2018-11-28T04:55:55.103] [205.0] debug3: xcgroup_set_param: parameter 'notify_on_release' set to '0' for '/sys/fs/cgroup/freezer/slurm'
[2018-11-28T04:55:55.103] [205.0] debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/freezer/slurm/uid_24642' already exists
[2018-11-28T04:55:55.103] [205.0] debug3: xcgroup_set_param: parameter 'notify_on_release' set to '0' for '/sys/fs/cgroup/freezer/slurm/uid_24642'
[2018-11-28T04:55:55.103] [205.0] debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/freezer/slurm/uid_24642/job_205' already exists
[2018-11-28T04:55:55.103] [205.0] debug3: xcgroup_set_param: parameter 'notify_on_release' set to '0' for '/sys/fs/cgroup/freezer/slurm/uid_24642/job_205'
[2018-11-28T04:55:55.103] [205.0] debug3: xcgroup_set_param: parameter 'notify_on_release' set to '0' for '/sys/fs/cgroup/freezer/slurm/uid_24642/job_205/step_0'
[2018-11-28T04:55:55.103] [205.0] debug3: xcgroup_set_param: parameter 'notify_on_release' set to '0' for '/sys/fs/cgroup/freezer/slurm/uid_24642/job_205/step_0'
[2018-11-28T04:55:55.103] [205.0] debug2: Before call to spank_init()
[2018-11-28T04:55:55.103] [205.0] debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
[2018-11-28T04:55:55.103] [205.0] debug2: After call to spank_init()
[2018-11-28T04:55:55.103] [205.0] debug:  mpi type = (null)
[2018-11-28T04:55:55.103] [205.0] debug:  mpi/none: slurmstepd prefork
[2018-11-28T04:55:55.103] [205.0] debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuset/slurm' already exists
[2018-11-28T04:55:55.103] [205.0] debug3: xcgroup_set_param: parameter 'notify_on_release' set to '0' for '/sys/fs/cgroup/cpuset/slurm'
[2018-11-28T04:55:55.103] [205.0] debug3: slurm cgroup /slurm successfully created for ns cpuset: File exists
[2018-11-28T04:55:55.103] [205.0] debug2: _file_read_content: unable to open '/sys/fs/cgroup/cpuset/slurm/cpus' for reading : No such file or directory
[2018-11-28T04:55:55.103] [205.0] debug2: xcgroup_get_param: unable to get parameter 'cpus' for '/sys/fs/cgroup/cpuset/slurm'
[2018-11-28T04:55:55.103] [205.0] debug:  task/cgroup: job abstract cores are '0'
[2018-11-28T04:55:55.103] [205.0] debug:  task/cgroup: step abstract cores are '0'
[2018-11-28T04:55:55.103] [205.0] debug:  task/cgroup: job physical cores are '0'
[2018-11-28T04:55:55.103] [205.0] debug:  task/cgroup: step physical cores are '0'
[2018-11-28T04:55:55.103] [205.0] debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuset/slurm/uid_24642' already exists
[2018-11-28T04:55:55.103] [205.0] debug3: xcgroup_set_param: parameter 'notify_on_release' set to '0' for '/sys/fs/cgroup/cpuset/slurm/uid_24642'
[2018-11-28T04:55:55.103] [205.0] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0,0' for '/sys/fs/cgroup/cpuset/slurm/uid_24642'
[2018-11-28T04:55:55.103] [205.0] debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuset/slurm/uid_24642/job_205' already exists
[2018-11-28T04:55:55.103] [205.0] debug3: xcgroup_set_param: parameter 'notify_on_release' set to '0' for '/sys/fs/cgroup/cpuset/slurm/uid_24642/job_205'
[2018-11-28T04:55:55.103] [205.0] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0' for '/sys/fs/cgroup/cpuset/slurm/uid_24642/job_205'
[2018-11-28T04:55:55.103] [205.0] debug3: xcgroup_set_param: parameter 'cpuset.mems' set to '0' for '/sys/fs/cgroup/cpuset/slurm/uid_24642/job_205'
[2018-11-28T04:55:55.103] [205.0] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0' for '/sys/fs/cgroup/cpuset/slurm/uid_24642/job_205'
[2018-11-28T04:55:55.103] [205.0] debug3: xcgroup_set_param: parameter 'notify_on_release' set to '0' for '/sys/fs/cgroup/cpuset/slurm/uid_24642/job_205/step_0'
[2018-11-28T04:55:55.104] [205.0] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0' for '/sys/fs/cgroup/cpuset/slurm/uid_24642/job_205/step_0'
[2018-11-28T04:55:55.104] [205.0] debug3: xcgroup_set_param: parameter 'cpuset.mems' set to '0' for '/sys/fs/cgroup/cpuset/slurm/uid_24642/job_205/step_0'
[2018-11-28T04:55:55.104] [205.0] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0' for '/sys/fs/cgroup/cpuset/slurm/uid_24642/job_205/step_0'
[2018-11-28T04:55:55.104] [205.0] debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm' already exists
[2018-11-28T04:55:55.104] [205.0] debug3: xcgroup_set_param: parameter 'notify_on_release' set to '0' for '/sys/fs/cgroup/memory/slurm'
[2018-11-28T04:55:55.104] [205.0] debug3: slurm cgroup /slurm successfully created for ns memory: File exists
[2018-11-28T04:55:55.104] [205.0] debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm/uid_24642' already exists
[2018-11-28T04:55:55.104] [205.0] debug3: xcgroup_set_param: parameter 'notify_on_release' set to '0' for '/sys/fs/cgroup/memory/slurm/uid_24642'
[2018-11-28T04:55:55.104] [205.0] debug3: xcgroup_set_param: parameter 'memory.use_hierarchy' set to '1' for '/sys/fs/cgroup/memory/slurm/uid_24642'
[2018-11-28T04:55:55.104] [205.0] debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm/uid_24642/job_205' already exists
[2018-11-28T04:55:55.104] [205.0] debug3: xcgroup_set_param: parameter 'notify_on_release' set to '0' for '/sys/fs/cgroup/memory/slurm/uid_24642/job_205'
[2018-11-28T04:55:55.104] [205.0] debug3: xcgroup_set_param: parameter 'memory.use_hierarchy' set to '1' for '/sys/fs/cgroup/memory/slurm/uid_24642/job_205'
[2018-11-28T04:55:55.104] [205.0] debug3: xcgroup_set_uint64_param: parameter 'memory.limit_in_bytes' set to '934281216' for '/sys/fs/cgroup/memory/slurm/uid_24642/job_205'
[2018-11-28T04:55:55.104] [205.0] debug3: xcgroup_set_uint64_param: parameter 'memory.soft_limit_in_bytes' set to '934281216' for '/sys/fs/cgroup/memory/slurm/uid_24642/job_205'
[2018-11-28T04:55:55.104] [205.0] debug3: xcgroup_set_uint64_param: parameter 'memory.memsw.limit_in_bytes' set to '934281216' for '/sys/fs/cgroup/memory/slurm/uid_24642/job_205'
[2018-11-28T04:55:55.104] [205.0] task/cgroup: /slurm/uid_24642/job_205: alloc=0MB mem.limit=891MB memsw.limit=891MB
[2018-11-28T04:55:55.104] [205.0] debug3: xcgroup_set_param: parameter 'notify_on_release' set to '0' for '/sys/fs/cgroup/memory/slurm/uid_24642/job_205/step_0'
[2018-11-28T04:55:55.104] [205.0] debug3: xcgroup_set_param: parameter 'memory.use_hierarchy' set to '1' for '/sys/fs/cgroup/memory/slurm/uid_24642/job_205/step_0'
[2018-11-28T04:55:55.104] [205.0] debug3: xcgroup_set_uint64_param: parameter 'memory.limit_in_bytes' set to '934281216' for '/sys/fs/cgroup/memory/slurm/uid_24642/job_205/step_0'
[2018-11-28T04:55:55.104] [205.0] debug3: xcgroup_set_uint64_param: parameter 'memory.soft_limit_in_bytes' set to '934281216' for '/sys/fs/cgroup/memory/slurm/uid_24642/job_205/step_0'
[2018-11-28T04:55:55.104] [205.0] debug3: xcgroup_set_uint64_param: parameter 'memory.memsw.limit_in_bytes' set to '934281216' for '/sys/fs/cgroup/memory/slurm/uid_24642/job_205/step_0'
[2018-11-28T04:55:55.104] [205.0] task/cgroup: /slurm/uid_24642/job_205/step_0: alloc=0MB mem.limit=891MB memsw.limit=891MB
[2018-11-28T04:55:55.104] [205.0] debug:  _oom_event_monitor: started.
[2018-11-28T04:55:55.104] [205.0] debug2: profile signaling type Task
[2018-11-28T04:55:55.105] [205.0] debug:  Message thread started pid = 21441
[2018-11-28T04:55:55.105] [205.0] debug4: eio: handling events for 1 objects
[2018-11-28T04:55:55.105] [205.0] debug3: Called _msg_socket_readable
[2018-11-28T04:55:55.105] [205.0] debug4: eio: handling events for 1 objects
[2018-11-28T04:55:55.105] [205.0] debug3: Called _msg_socket_readable
[2018-11-28T04:55:55.105] [205.0] debug4: eio: handling events for 1 objects
[2018-11-28T04:55:55.105] [205.0] debug3: Called _msg_socket_readable
[2018-11-28T04:55:55.105] [205.0] debug2: Entering _setup_normal_io
[2018-11-28T04:55:55.105] [205.0] debug4: eio: handling events for 1 objects
[2018-11-28T04:55:55.105] [205.0] debug3: Called _msg_socket_readable
[2018-11-28T04:55:55.105] [205.0] debug4: eio: handling events for 1 objects
[2018-11-28T04:55:55.105] [205.0] debug3: Called _msg_socket_readable
[2018-11-28T04:55:55.105] [205.0] debug4: eio: handling events for 1 objects
[2018-11-28T04:55:55.105] [205.0] debug3: Called _msg_socket_readable
Comment 2 Marshall Garey 2018-11-28 14:13:25 MST
I'm looking into this. Could you upload a current slurm.conf file?

Is this a one-off occurrence? Or does pam_slurm_adopt never work?
Comment 4 Yuxing Peng 2018-11-29 09:38:34 MST
The contents of slurm.conf and cgroup.conf are attached at the end of this message. But, please note that the error was "permission denied" for the sock file that slurmstepd created. Another thing is that the it also complained that the MUNGED sock file was also "permission denied", when I was using an old version of MUNGE (0.5.9). After I upgraded MUNGE to 0.5.11, this error disappeared and then the error for SLURMSTEPD appears.

My testing environment is the AWS EC2 instances, and all servers are using public IPs (for testing, I opened all ports on firewall). The OS is Red Hat Enterprise Linux 7 AMI provided by AWS. The sock files do exists (either MUNGE or slurm.spool)

/etc/slurm/slurm.conf

ClusterName=skyway
ControlMachine=skyway.rcc.uchicago.edu
ControlAddr=skyway.rcc.uchicago.edu
SlurmUser=slurm

FastSchedule=1
TreeWidth=50
GresTypes=gpu

AuthType=auth/munge
SwitchType=switch/none
JobCompType=jobcomp/filetxt
JobSubmitPlugins=lua
PrologFlags=contain
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup
CryptoType=crypto/munge

SlurmctldPort=6817
SlurmctldTimeout=300
SlurmdPort=6818
SlurmdTimeout=300
SrunPortRange=60001-63000

SlurmctldLogFile=/var/log/slurmctld.log
SlurmctldDebug=info
SlurmdLogFile=/var/log/slurmd.log
SlurmdDebug=info
SlurmdSpoolDir=/var/spool/slurm/slurmd.spool
StateSaveLocation=/var/spool/slurm/slurm.state

AccountingStorageHost=skyway.rcc.uchicago.edu
AccountingStorageEnforce=associations,limits,qos,safe
AccountingStorageTRES=gres/gpu
AccountingStorageType=accounting_storage/slurmdbd



/etc/slurm/cgroup.conf

CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
MaxRAMPercent=90
AllowedRAMSpace=90
AllowedSwapSpace=0
MaxSwapPercent=0
Comment 5 Yuxing Peng 2018-11-29 09:43:09 MST
(In reply to Marshall Garey from comment #2)
> I'm looking into this. Could you upload a current slurm.conf file?
> 
> Is this a one-off occurrence? Or does pam_slurm_adopt never work?

Sorry, I should directly reply this comment instead of adding a new one. Here I copied the content of my answers in that comment. And, this is not a one-off occurrence. It never work. Also, when I switched to use pam_slurm.so, it complains that "cannot contact controller". However, I did testing and connecting to controller from that node is working, (i.e., ping and squeue, sinfo were all working).

The contents of slurm.conf and cgroup.conf are attached at the end of this message. But, please note that the error was "permission denied" for the sock file that slurmstepd created. Another thing is that the it also complained that the MUNGED sock file was also "permission denied", when I was using an old version of MUNGE (0.5.9). After I upgraded MUNGE to 0.5.11, this error disappeared and then the error for SLURMSTEPD appears.

My testing environment is the AWS EC2 instances, and all servers are using public IPs (for testing, I opened all ports on firewall). The OS is Red Hat Enterprise Linux 7 AMI provided by AWS. The sock files do exists (either MUNGE or slurm.spool) Any reason to the OS that may cause the permission deny on a sock-file or unix socket?

/etc/slurm/slurm.conf

ClusterName=skyway
ControlMachine=skyway.rcc.uchicago.edu
ControlAddr=skyway.rcc.uchicago.edu
SlurmUser=slurm

FastSchedule=1
TreeWidth=50
GresTypes=gpu

AuthType=auth/munge
SwitchType=switch/none
JobCompType=jobcomp/filetxt
JobSubmitPlugins=lua
PrologFlags=contain
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup
CryptoType=crypto/munge

SlurmctldPort=6817
SlurmctldTimeout=300
SlurmdPort=6818
SlurmdTimeout=300
SrunPortRange=60001-63000

SlurmctldLogFile=/var/log/slurmctld.log
SlurmctldDebug=info
SlurmdLogFile=/var/log/slurmd.log
SlurmdDebug=info
SlurmdSpoolDir=/var/spool/slurm/slurmd.spool
StateSaveLocation=/var/spool/slurm/slurm.state

AccountingStorageHost=skyway.rcc.uchicago.edu
AccountingStorageEnforce=associations,limits,qos,safe
AccountingStorageTRES=gres/gpu
AccountingStorageType=accounting_storage/slurmdbd



/etc/slurm/cgroup.conf

CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
MaxRAMPercent=90
AllowedRAMSpace=90
AllowedSwapSpace=0
MaxSwapPercent=0
Comment 6 Marshall Garey 2018-12-03 17:10:43 MST
Thanks. I just wanted to verify your slurm configuration was correct to eliminate any obvious possibilities. Don't use pam_slurm.so. It's old and not actively maintained.

Yes, I do see that permission denied on the slurmstepd socket file is the current issue.

First, what user is sshd running as? I expect it to be running as root.

Could you run a new job, and then do the following?

ls -ld /path/to/slurmd/spool/dir (it should contain the slurmstepd socket files)
ls -l /path/to/slurmd/spool/dir (it should contain the slurmstepd socket files)

Assuming you have netcat installed, you can connect to a stepd socket like so:

nc -U <socket_name>

<socket_name> will be in the form <nodename>_<jobid>.<stepid>

The extern step id is 4294967295 while the batch step id is 4294967294, and other step id's are just the step id's of your srun steps inside your job.

For example,
marshall@voyager:~/slurm/17.11/byu/spool/slurmd-v1$ ls -l
total 24
drwx------ 2 root root     4096 May  7  2018 cpu
-rw------- 1 root root      112 Dec  3 17:02 cred_state
-rw------- 1 root root       64 Dec  3 16:54 cred_state.old
-rw-r--r-- 1 root root     4148 Nov 30 15:45 hwloc_topo_whole.xml
drwxr-x--- 2 root marshall 4096 Dec  3 17:02 job00005
srwxrwxrwx 1 root root        0 Dec  3 17:02 v1_5.0
srwxrwxrwx 1 root root        0 Dec  3 17:02 v1_5.4294967294
srwxrwxrwx 1 root root        0 Dec  3 17:02 v1_5.4294967295

marshall@voyager:~/slurm/17.11/byu/spool/slurmd-v1$ nc -U v1_5.4294967295
asdf

In my slurmd log file, I see the following message indicating that I successfully connected to the socket, but the message I sent ("asdf") was invalid.

[2018-12-03T17:05:21.492] [5.extern] error: First message must be REQUEST_CONNECT
[2018-12-03T17:05:21.492] [5.extern] debug:  Leaving  _handle_accept on an error

The user running as sshd needs to have write permissions on the socket file and the path to it. If I don't have write permissions on the socket file or the path to it, then I'll get the permission denied error with nc:

marshall@voyager:~/slurm/17.11/byu/spool/slurmd-v1$ sudo chmod 775 v1_5.4294967295 
marshall@voyager:~/slurm/17.11/byu/spool/slurmd-v1$ ls -l
total 24
drwx------ 2 root root     4096 May  7  2018 cpu
-rw------- 1 root root      112 Dec  3 17:02 cred_state
-rw------- 1 root root       64 Dec  3 16:54 cred_state.old
-rw-r--r-- 1 root root     4148 Nov 30 15:45 hwloc_topo_whole.xml
drwxr-x--- 2 root marshall 4096 Dec  3 17:02 job00005
srwxrwxrwx 1 root root        0 Dec  3 17:02 v1_5.0
srwxrwxrwx 1 root root        0 Dec  3 17:02 v1_5.4294967294
srwxrwxr-x 1 root root        0 Dec  3 17:02 v1_5.4294967295
marshall@voyager:~/slurm/17.11/byu/spool/slurmd-v1$ nc -U v1_5.4294967295
nc: unix connect failed: Permission denied
Comment 7 Yuxing Peng 2018-12-04 00:17:56 MST
(In reply to Marshall Garey from comment #6)
> Thanks. I just wanted to verify your slurm configuration was correct to
> eliminate any obvious possibilities. Don't use pam_slurm.so. It's old and
> not actively maintained.
> 
> Yes, I do see that permission denied on the slurmstepd socket file is the
> current issue.
> 
> First, what user is sshd running as? I expect it to be running as root.
> 
> Could you run a new job, and then do the following?
> 
> ls -ld /path/to/slurmd/spool/dir (it should contain the slurmstepd socket
> files)
> ls -l /path/to/slurmd/spool/dir (it should contain the slurmstepd socket
> files)
> 
> Assuming you have netcat installed, you can connect to a stepd socket like
> so:
> 
> nc -U <socket_name>
> 
> <socket_name> will be in the form <nodename>_<jobid>.<stepid>
> 
> The extern step id is 4294967295 while the batch step id is 4294967294, and
> other step id's are just the step id's of your srun steps inside your job.
> 
> For example,
> marshall@voyager:~/slurm/17.11/byu/spool/slurmd-v1$ ls -l
> total 24
> drwx------ 2 root root     4096 May  7  2018 cpu
> -rw------- 1 root root      112 Dec  3 17:02 cred_state
> -rw------- 1 root root       64 Dec  3 16:54 cred_state.old
> -rw-r--r-- 1 root root     4148 Nov 30 15:45 hwloc_topo_whole.xml
> drwxr-x--- 2 root marshall 4096 Dec  3 17:02 job00005
> srwxrwxrwx 1 root root        0 Dec  3 17:02 v1_5.0
> srwxrwxrwx 1 root root        0 Dec  3 17:02 v1_5.4294967294
> srwxrwxrwx 1 root root        0 Dec  3 17:02 v1_5.4294967295
> 
> marshall@voyager:~/slurm/17.11/byu/spool/slurmd-v1$ nc -U v1_5.4294967295
> asdf
> 
> In my slurmd log file, I see the following message indicating that I
> successfully connected to the socket, but the message I sent ("asdf") was
> invalid.
> 
> [2018-12-03T17:05:21.492] [5.extern] error: First message must be
> REQUEST_CONNECT
> [2018-12-03T17:05:21.492] [5.extern] debug:  Leaving  _handle_accept on an
> error
> 
> The user running as sshd needs to have write permissions on the socket file
> and the path to it. If I don't have write permissions on the socket file or
> the path to it, then I'll get the permission denied error with nc:
> 
> marshall@voyager:~/slurm/17.11/byu/spool/slurmd-v1$ sudo chmod 775
> v1_5.4294967295 
> marshall@voyager:~/slurm/17.11/byu/spool/slurmd-v1$ ls -l
> total 24
> drwx------ 2 root root     4096 May  7  2018 cpu
> -rw------- 1 root root      112 Dec  3 17:02 cred_state
> -rw------- 1 root root       64 Dec  3 16:54 cred_state.old
> -rw-r--r-- 1 root root     4148 Nov 30 15:45 hwloc_topo_whole.xml
> drwxr-x--- 2 root marshall 4096 Dec  3 17:02 job00005
> srwxrwxrwx 1 root root        0 Dec  3 17:02 v1_5.0
> srwxrwxrwx 1 root root        0 Dec  3 17:02 v1_5.4294967294
> srwxrwxr-x 1 root root        0 Dec  3 17:02 v1_5.4294967295
> marshall@voyager:~/slurm/17.11/byu/spool/slurmd-v1$ nc -U v1_5.4294967295
> nc: unix connect failed: Permission denied


Hi Marshall,

Thanks again for you reply. I did all tests that you suggested and copied the results here,

[root@rcc-aws-t2-micro-001 ~]# ps aux | grep slurmd
root      1430  0.0  0.2 134692  2896 ?        S    06:14   0:00 /usr/sbin/slurmd
root      1611  0.0  0.0 112704   964 pts/1    S+   06:18   0:00 grep --color=auto slurmd

[root@rcc-aws-t2-micro-001 ~]# ps aux | grep slurmstepd
root      1437  0.0  0.2 128208  2556 ?        Sl   06:14   0:00 slurmstepd: [207.extern]
root      1445  0.0  0.2 263392  2784 ?        Sl   06:14   0:00 slurmstepd: [207.0]
root      1613  0.0  0.0 112704   972 pts/1    S+   06:18   0:00 grep --color=auto slurmstepd

[ec2-user@rcc-aws-t2-micro-001 ~]$ ps aux | grep sshd
root      1423  0.0  0.4 112792  4268 ?        Ss   06:14   0:00 /usr/sbin/sshd -D
root      1624  0.0  0.6 168896  6504 ?        Ss   06:20   0:00 sshd: ec2-user [priv]
ec2-user  1627  0.0  0.2 168896  2784 ?        S    06:20   0:00 sshd: ec2-user@pts/1
root      1769  0.7  0.6 168896  6500 ?        Ss   07:03   0:00 sshd: ec2-user [priv]
ec2-user  1772  0.0  0.2 168896  2680 ?        S    07:03   0:00 sshd: ec2-user@pts/2

[root@rcc-aws-t2-micro-001 slurmd.spool]# ls -ld
drwxr-xr-x. 2 slurm slurm 101 Dec  4 06:14 .

[root@rcc-aws-t2-micro-001 slurmd.spool]# ls -l
total 4
-rw-------. 1 root root 84 Dec  4 06:14 cred_state
srwxrwxrwx. 1 root root  0 Dec  4 06:14 rcc-aws-t2-micro-001_207.0
srwxrwxrwx. 1 root root  0 Dec  4 06:14 rcc-aws-t2-micro-001_207.4294967295

[root@rcc-aws-t2-micro-001 slurmd.spool]# nc -U rcc-aws-t2-micro-001_207.4294967295
asdf
Ncat: Connection reset by peer.


[root@rcc-aws-t2-micro-001 slurmd.spool]# sudo -u yuxing nc -U rcc-aws-t2-micro-001_207.4294967295
asdf
Ncat: Connection reset by peer.


[root@rcc-aws-t2-micro-001 slurmd.spool]# su - yuxing
[yuxing@rcc-aws-t2-micro-001 ~]$ cd /var/spool/slurm/slurmd.spool/
[yuxing@rcc-aws-t2-micro-001 slurmd.spool]$ nc -U rcc-aws-t2-micro-001_207.4294967295
asdf
Ncat: Connection reset by peer.


As you can see from the logs above, SSHD is launched by root, and both root and my user (yuxing) can connect to the sock file that has been created. However, when I tried to ssh to this node from another terminal session, I still got permission denied, and the node has following secure log,

Dec  4 07:12:19 ip-172-31-36-153 pam_slurm_adopt[1982]: debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
Dec  4 07:12:19 ip-172-31-36-153 pam_slurm_adopt[1982]: debug:  Reading slurm.conf file: /etc/slurm/slurm.conf
Dec  4 07:12:19 ip-172-31-36-153 pam_slurm_adopt[1982]: debug3: Trying to load plugin /usr/lib64/slurm/auth_munge.so
Dec  4 07:12:19 ip-172-31-36-153 pam_slurm_adopt[1982]: debug:  Munge authentication plugin loaded
Dec  4 07:12:19 ip-172-31-36-153 pam_slurm_adopt[1982]: debug3: Success.
Dec  4 07:12:19 ip-172-31-36-153 pam_slurm_adopt[1982]: debug:  _step_connect: connect() failed dir /var/spool/slurm/slurmd.spool node rcc-aws-t2-micro-001 step 207.4294967295 Permission denied
Dec  4 07:12:19 ip-172-31-36-153 pam_slurm_adopt[1982]: debug3: unable to connect to step 207.4294967295 on rcc-aws-t2-micro-001: Permission denied
Dec  4 07:12:19 ip-172-31-36-153 pam_slurm_adopt[1982]: send_user_msg: Access denied by pam_slurm_adopt: you have no active jobs on this node
Dec  4 07:12:19 ip-172-31-36-153 sshd[1982]: pam_access(sshd:account): access denied for user `yuxing' from `skyway.rcc.uchicago.edu'

Please let me know if there's anything else I can test or provide.

Best regards,

Yuxing
Comment 8 Yuxing Peng 2018-12-04 00:29:02 MST
(In reply to Yuxing Peng from comment #7)

I also found something interesting here, which may provide some clues about this issue.

[root@rcc-aws-t2-micro-001 ~]# sudo -u yuxing nc -U /var/spool/slurm/slurmd.spool/rcc-aws-t2-micro-001_207.4294967295
asdf
Ncat: Connection reset by peer.

[root@rcc-aws-t2-micro-001 ~]# sudo -u yuxing nc -U rcc-aws-t2-micro-001_207.4294967295
Ncat: Permission denied.

[root@rcc-aws-t2-micro-001 ~]# nc -U rcc-aws-t2-micro-001_207.4294967295
Ncat: No such file or directory.

When I test connecting the sock with a local user (yuxing), it got successful while use the full path, but permission denied without the path. This is the first time I can see "permission denied" from nc command. And, I was in the home folder, definitely the sock file was not in that folder. I don't know how Linux still find out this sock file but got permission denied.

When I tried the same command using root (not including file path), I got different error: No such file or directory.

Hope this information can help.

Best,

Yuxing
Comment 9 Marshall Garey 2018-12-04 15:11:29 MST
RE comment 8 - that's interesting and confusing. I wouldn't expect that. Trying it out myself, I see "no such file or directory" whether I'm user root or marshall.


> [root@rcc-aws-t2-micro-001 slurmd.spool]# ls -ld
> drwxr-xr-x. 2 slurm slurm 101 Dec  4 06:14 .

The slurmd spool directory ownership is slurm:slurm. It should be root:root. I'm guessing that the reason you get permission denied is because root doesn't have write permissions on the directory, but I'm not totally sure. Can you try changing the ownership of that directory to root:root, leaving the permissions the same?



Also, we take our severity levels seriously, and severity 1 and 2 tickets disrupt the normal workflow of the support team. so please use the most accurate severity level for the tickets.

Taken from our support page here:
https://www.schedmd.com/support.php

"Severity 2 — High Impact

A Severity 2 issue is a high-impact problem that is causing sporadic outages or is consistently encountered by end users with adverse impact to end user interaction with the system."

This does not seem like a severity 2 bug. I've changed the ticket to sev-3 again, but please correct me if I'm wrong.

If the changing ownership of the slurmd spool directory to root:root doesn't work, you can always temporarily disable pam_slurm_adopt to allow ssh'ing to the compute nodes until you do get pam_slurm_adopt working.
Comment 10 Yuxing Peng 2018-12-04 18:26:15 MST
(In reply to Marshall Garey from comment #9)
> RE comment 8 - that's interesting and confusing. I wouldn't expect that.
> Trying it out myself, I see "no such file or directory" whether I'm user
> root or marshall.
> 
> 
> > [root@rcc-aws-t2-micro-001 slurmd.spool]# ls -ld
> > drwxr-xr-x. 2 slurm slurm 101 Dec  4 06:14 .
> 
> The slurmd spool directory ownership is slurm:slurm. It should be root:root.
> I'm guessing that the reason you get permission denied is because root
> doesn't have write permissions on the directory, but I'm not totally sure.
> Can you try changing the ownership of that directory to root:root, leaving
> the permissions the same?
> 
> 
> 
> Also, we take our severity levels seriously, and severity 1 and 2 tickets
> disrupt the normal workflow of the support team. so please use the most
> accurate severity level for the tickets.
> 
> Taken from our support page here:
> https://www.schedmd.com/support.php
> 
> "Severity 2 — High Impact
> 
> A Severity 2 issue is a high-impact problem that is causing sporadic outages
> or is consistently encountered by end users with adverse impact to end user
> interaction with the system."
> 
> This does not seem like a severity 2 bug. I've changed the ticket to sev-3
> again, but please correct me if I'm wrong.
> 
> If the changing ownership of the slurmd spool directory to root:root doesn't
> work, you can always temporarily disable pam_slurm_adopt to allow ssh'ing to
> the compute nodes until you do get pam_slurm_adopt working.

Hi Marshal,

Thanks very much for the help. We finally figured out the issue. The Amazon Cloud Images turn on SELinux (enforcing mode), which restrict all accesses requests from the login procedure. However, this mode is set to "permissive" as default on most other OS image sources.

Hope this case can provide you some information if next time there are permission denied issues from other people.

This ticket can be closed now.

Best regards,

Yuxing
Comment 11 Marshall Garey 2018-12-04 21:10:54 MST
Thanks very much for the information. That will definitely help us in the future.


FYI, we don't have you listed as a supported contact for Chicago. We only have Mengxing and Stephen listed. In the future, we ask that they be the ones that submit tickets. Or, if they plan on having you submitting tickets in the future, please contact Jacob Jenson at jacob@schedmd.com to see about adding you to the supported contacts.