From bug 3550 comment 18: pam_slurm_adopt failed for us too. I also saw message like the above. I added a debug statement in pam_slurm_adopt function _get_job_uid(...), after calling the function stepd_get_uid(...), saw uid is that number too: uid 4294967295 Is it somehow -1 interpreted as 4294967295 ? 2^32 = 4294967296 We are now running on 17.02.1-2. Thanks!
Hey folks - Please file separate issues in cases like this in the future; bug 3550 is still marked closed, and we might overlook responding to it. (We don't monitor the resolved issues as closely, and they don't show up as outstanding in our workflows.) It's also unrelated to the original NERSC problem as best I can tell. On to my response: -1, when viewed as an unsigned 32-bit integer, is 4294967295. That special value shows up in a few different ways in Slurm. The most common use is as a special case of NO_VAL; used to indicate that an field is unset. It is also used as a "step" number to indicate the resources assigned to the "external" step. That "external" step is used to manage the cgroups that pam_slurm_adopt need to have access to, and is setup when PrologFlags=contain is set in slurm.conf. In your particular case, it looks like the job that pam_slurm_adopt was attempting to query was not responding or had just exited. Are you seeing any failures directly related to users not being able to connect? Logs from slurmd on the node from that time may also shed some light on what is happening. - Tim
Created attachment 4289 [details] slurmd log Thank you Tim, slurmd is attached. Yes there is error, but srun sends us to computing node: $ srun -w c25-03 --x11 --pty /bin/bash srun: error: x11: unable to connect node c25-03 $ hostname c25-03 The related rsyslog message is below: Apr 4 13:09:33 c25-03 pam_slurm_adopt[149594]: Unable to dlopen libslurm.so.31.0.0: libslurm.so.31.0.0: cannot open shared object file: No such file or directory Apr 4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug: Reading cgroup.conf file /opt/slurm/etc/cgroup.conf Apr 4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug: Reading slurm.conf file: /opt/slurm/etc/slurm.conf Apr 4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug4: found jobid = 243977, stepid = 4294967294 Apr 4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug4: found jobid = 243995, stepid = 4294967295 Apr 4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug4: found jobid = 243977, stepid = 4294967295 Apr 4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug3: Trying to load plugin /opt/slurm/lib64/slurm/auth_munge.so Apr 4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug: Munge authentication plugin loaded Apr 4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug3: Success. Apr 4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug: ME=== uid 4294967295 Apr 4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug3: unable to determine uid of step 243995.4294967295 on c25-03 Apr 4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug: ME=== uid 4294967295 Apr 4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug3: unable to determine uid of step 243977.4294967295 on c25-03 Apr 4 13:09:34 c25-03 pam_slurm_adopt[149594]: send_user_msg: Access denied by pam_slurm_adopt: you have no active jobs on this node
In /etc pam.d configuration, when replacing pam_slurm_adopt.so with pam_slurm.so, that srun command runs without throwing error, and also there is no error as "error: x11: unable to read DISPLAY value" in slurmd log.
(In reply to NYU HPC Team from comment #3) > In /etc pam.d configuration, when replacing pam_slurm_adopt.so with > pam_slurm.so, that srun command runs without throwing error, and also there > is no error as "error: x11: unable to read DISPLAY value" in slurmd log. Tim, adding the second patch as suggested in https://bugs.schedmd.com/show_bug.cgi?id=3550, the problem is now fixed. Thank you for help!