| Summary: | NYU pam_slurm_adopt problem | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Tim Wickberg <tim> |
| Component: | Other | Assignee: | Tim Wickberg <tim> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | hpc-staff |
| Version: | 17.02.1 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NYU | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurmd log | ||
|
Description
Tim Wickberg
2017-04-04 10:33:50 MDT
Hey folks - Please file separate issues in cases like this in the future; bug 3550 is still marked closed, and we might overlook responding to it. (We don't monitor the resolved issues as closely, and they don't show up as outstanding in our workflows.) It's also unrelated to the original NERSC problem as best I can tell. On to my response: -1, when viewed as an unsigned 32-bit integer, is 4294967295. That special value shows up in a few different ways in Slurm. The most common use is as a special case of NO_VAL; used to indicate that an field is unset. It is also used as a "step" number to indicate the resources assigned to the "external" step. That "external" step is used to manage the cgroups that pam_slurm_adopt need to have access to, and is setup when PrologFlags=contain is set in slurm.conf. In your particular case, it looks like the job that pam_slurm_adopt was attempting to query was not responding or had just exited. Are you seeing any failures directly related to users not being able to connect? Logs from slurmd on the node from that time may also shed some light on what is happening. - Tim Created attachment 4289 [details]
slurmd log
Thank you Tim, slurmd is attached. Yes there is error, but srun sends us to computing node:
$ srun -w c25-03 --x11 --pty /bin/bash
srun: error: x11: unable to connect node c25-03
$ hostname
c25-03
The related rsyslog message is below:
Apr 4 13:09:33 c25-03 pam_slurm_adopt[149594]: Unable to dlopen libslurm.so.31.0.0: libslurm.so.31.0.0: cannot open shared object file: No such file or directory
Apr 4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug: Reading cgroup.conf file /opt/slurm/etc/cgroup.conf
Apr 4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug: Reading slurm.conf file: /opt/slurm/etc/slurm.conf
Apr 4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug4: found jobid = 243977, stepid = 4294967294
Apr 4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug4: found jobid = 243995, stepid = 4294967295
Apr 4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug4: found jobid = 243977, stepid = 4294967295
Apr 4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug3: Trying to load plugin /opt/slurm/lib64/slurm/auth_munge.so
Apr 4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug: Munge authentication plugin loaded
Apr 4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug3: Success.
Apr 4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug: ME=== uid 4294967295
Apr 4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug3: unable to determine uid of step 243995.4294967295 on c25-03
Apr 4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug: ME=== uid 4294967295
Apr 4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug3: unable to determine uid of step 243977.4294967295 on c25-03
Apr 4 13:09:34 c25-03 pam_slurm_adopt[149594]: send_user_msg: Access denied by pam_slurm_adopt: you have no active jobs on this node
In /etc pam.d configuration, when replacing pam_slurm_adopt.so with pam_slurm.so, that srun command runs without throwing error, and also there is no error as "error: x11: unable to read DISPLAY value" in slurmd log. (In reply to NYU HPC Team from comment #3) > In /etc pam.d configuration, when replacing pam_slurm_adopt.so with > pam_slurm.so, that srun command runs without throwing error, and also there > is no error as "error: x11: unable to read DISPLAY value" in slurmd log. Tim, adding the second patch as suggested in https://bugs.schedmd.com/show_bug.cgi?id=3550, the problem is now fixed. Thank you for help! |