Ticket 6613 - control-c on srun causes pam to prevent login on the interactive job
Summary: control-c on srun causes pam to prevent login on the interactive job
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 18.08.3
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-02-28 11:37 MST by david.whiteside
Modified: 2019-04-19 08:01 MDT (History)
1 user (show)

See Also:
Site: NREL
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 18.08.8
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmctld.log, i triggered the bug at around 10:26 (14.08 MB, text/plain)
2019-03-13 10:37 MDT, david.whiteside
Details
fix uninitialized flags (2.02 KB, patch)
2019-03-21 09:16 MDT, Dominik Bartkiewicz
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description david.whiteside 2019-02-28 11:37:47 MST
[USER@el1 ~]$ salloc -A ABC --nodes 1  -t 04:00:00  --partition=standard
salloc: Pending job allocation 604998
salloc: job 604998 queued and waiting for resources
salloc: job 604998 has been allocated resources
salloc: Granted job allocation 604998
srun: Step created for job 604998

[USER@r2i7n24 ~]$ srun sleep 1000
^Csrun: interrupt (one more within 1 sec to abort)
srun: step:604998.1 task 0: running
^Csrun: sending Ctrl-C to job 604998.1
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.


[USER@r2i7n24 ~]$ echo still here
still here

The interactive shell keeps going fine, the job is still ok.  But now you cant ssh to the node.

[USER@el2 ~]$ ssh r2i7n24
Warning: Permanently added 'r2i7n24' (RSA) to the list of known hosts.
Access denied by pam_slurm_adopt: you have no active jobs on this node
Authentication failed.
Comment 1 Dominik Bartkiewicz 2019-03-01 04:20:57 MST
Hi

Could you send me slurm.conf, and slurmd.log from r2i7n24?
Does this happen only if you cancel the extra job step?

Dominik
Comment 2 david.whiteside 2019-03-01 12:48:08 MST
This happens on all nodes.  It only happens when you control-c the srun command, if you let it run until completion you are still able to ssh to the node.

[root@r2i7n24 ~]# cat /nopt/slurm/etc/slurm.conf
# See the slurm.conf man page for more information.
#
ControlMachine=emgmt1
ControlAddr=10.148.7.234
#
#MailProg=/bin/mail
MpiDefault=pmi2
PropagateResourceLimits=NONE
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity,task/cgroup
#SallocDefaultCommand="srun -n1 -N1 --gres=gpu:0 --pty $SHELL"
SallocDefaultCommand="srun  --pty --preserve-env $SHELL"
#
#
# PROLOGUE & EPILOGUE
JobSubmitPlugins=require_timelimit,lua
PrologSlurmctld=/nopt/slurm/etc/prologslurmctld.sh
Prolog=/nopt/slurm/etc/prolog.d/*
PrologFlags=X11
X11Parameters=local_xauthority
Epilog=/nopt/slurm/etc/epilog.d/*
PrologEpilogTimeout=180
UnkillableStepTimeout=180
#
#
# PRIORITY
#
PriorityType=priority/multifactor
PriorityWeightFairshare=0

# 1000 points per minute for three days, after which you've hit the cap
PriorityMaxAge=3-0
PriorityWeightAge=4320000

PriorityWeightJobSize=184410000
# we've added a placeholder qos with priority 1000 to force the normalized range to stay consistent
# setting a priority of 10 on a qos is the same as a three day head start
PriorityWeightQOS=432000000
PriorityWeightPartition=144000000

# asking for both gpus in a node gives a 1-day head start
#PriorityFlags=MAX_TRES
#PriorityWeightTRES=node=1000000

#
# TIMERS
BatchStartTimeout=30
MessageTimeout=20
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#SchedulerParameters=defer
#TopologyPlugin=topology/tree
JobRequeue=0
#
#
#NHC
HealthCheckProgram=/usr/sbin/nhc
HealthCheckInterval=300
HealthCheckNodeState=CYCLE,ANY


# LOGGING AND ACCOUNTING
ClusterName=eagle
AccountingStorageHost=emgmt1
AccountingStorageEnforce=safe,qos,limits
AccountingStorageUser=slurm
JobAcctGatherFrequency=task=30,energy=30
AcctGatherNodeFreq=30
AccountingStorageType=accounting_storage/slurmdbd
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherParams=UsePss
#AcctGatherEnergyType=acct_gather_energy/ipmi
AcctGatherFilesystemType=acct_gather_filesystem/lustre
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
GresTypes=gpu
AccountingStorageTRES=gres/gpu,fs/lustre
TmpFS=/tmp/scratch
EnforcePartLimits=ALL

#SSK testing
# bf_window should usually match the largest MaxTime on any partition
SchedulerParameters=max_rpc_cnt=256,MessageTimeout=30,bf_window=14400,bf_resolution=600

#
#
# COMPUTE NODES
NodeName=DEFAULT Sockets=2 CoresPerSocket=18 ThreadsPerCore=1 State=UNKNOWN
NodeName=r1i[0-7]n[0-35],r2i[0-7]n[0-35],r3i[0-7]n[0-35],r4i[0-7]n[0-35],r5i[0-7]n[0-35],r6i[0-7]n[0-35] RealMemory=92000 TmpDisk=900000
NodeName=r7i[0-7]n[0-35] RealMemory=184000 TmpDisk=900000
NodeName=r102u[01-36],r103u[35-36] RealMemory=770000 TmpDisk=1500000
NodeName=r103u[25-34] RealMemory=770000 TmpDisk=24000000
NodeName=r103u[01,03,05,07,09,11,13,15,17,19] RealMemory=770000 Gres=gpu:v100:2 TmpDisk=24000000
NodeName=r103u[21,23],r104u[01,03,05,07,09,11,13,15,17,19,21,23,25,27,29,31,33],r105u[01,03,05,07,09,11,13,15,17,19,21,23,25,27,29,31,33] RealMemory=770000 Gres=gpu:v100:2 TmpDisk=1500000

PartitionName=DEFAULT OverSubscribe=EXCLUSIVE State=UP PriorityTier=1 PriorityJobFactor=0 MaxTime=2-0
PartitionName=short Nodes=r1i[0-7]n[0-35],r2i[0-7]n[0-35],r3i[0-7]n[0-35],r4i[0-7]n[0-35],r5i[0-7]n[0-35],r6i[0-7]n[0-35],r7i[0-6]n[0-35],r7i7n[0-28],r102u[01-34],r103u[25-36],r104u[01,03,05,07,09,11,13,15,17,19,21,23,25,27,29,31],r105u[01,03,05,07,09,11,13,15,17,19,21,23,25,27,29,31] MaxTime=0-4 QOS=p_short PriorityTier=2
PartitionName=standard Nodes=r1i[0-7]n[0-35],r2i[0-7]n[0-35],r3i[0-7]n[0-35],r4i[0-7]n[0-35],r5i[0-7]n[0-35],r6i[0-7]n[0-35],r7i[0-6]n[0-35],r7i7n[0-28],r102u[01-34],r103u[25-36],r104u[01,03,05,07,09,11,13,15,17,19,21,23,25,27,29,31],r105u[01,03,05,07,09,11,13,15,17,19,21,23,25,27,29,31] QOS=p_half
PartitionName=long Nodes=r1i[0-7]n[0-35],r2i[0-7]n[0-35],r3i[0-7]n[0-35],r4i[0-7]n[0-35],r5i[0-7]n[0-35],r6i[0-7]n[0-35],r7i[0-6]n[0-35],r7i7n[0-28],r102u[01-34],r103u[25-36],r104u[01,03,05,07,09,11,13,15,17,19,21,23,25,27,29,31],r105u[01,03,05,07,09,11,13,15,17,19,21,23,25,27,29,31] MaxTime=10-0 QOS=p_long
PartitionName=bigmem Nodes=r102u[01-34],r103u[25-36],r104u[01,03,05,07,09,11,13,15,17,19,21,23,25,27,29,31],r105u[01,03,05,07,09,11,13,15,17,19,21,23,25,27,29,31] PriorityJobFactor=10 QOS=p_bigmem_half
PartitionName=gpu Nodes=r103u[01,03,05,07,09,11,13,15,17,19],r104u[01,03,05,07,09,11,13,15,17,19,21,23,25,27,29,31],r105u[01,03,05,07,09,11,13,15,17,19,21,23,25,27,29,31] PriorityJobFactor=10 QOS=p_gpu_half
PartitionName=bigscratch Nodes=r103u[01,03,05,07,09,11,13,15,17,19,25-34] PriorityJobFactor=10 QOS=p_bigscratch_half
PartitionName=debug Nodes=r7i7n[29-35],r102u[35-36],r103u21,r103u23,r104u33,r105u33 MaxTime=0-24 QOS=debug PriorityTier=3
PartitionName=placeholder Hidden=YES PriorityJobFactor=1000

PartitionName=ddn Nodes=r7i6n[5-8,13-17],r6i7n[5-8,13-17],r2i5n[5-8,13-17],r3i2n[18-22,27-30],r4i1n[5-8,13-17],r7i2n[23-26,31-35],r3i7n[5-8,13-17],r2i6n[5-8,13-17],r4i3n[5-8,13-17],r4i2n[5-8,13-17],r3i1n[5-8,13-17],r3i0n[5-8,13-17],r3i5n[5-8,13-17],r3i4n[5-8,13-17],r3i3n[5-8,13-17],r3i2n[5-8,13-17],r6i1n[5-8,13-17],r3i6n[5-8,13-17],r6i3n[5-8,13-17],r6i6n[18-22,27-30],r6i5n[5-8,13-17],r6i4n[5-8,13-17],r6i7n[23-26,31-35],r6i6n[5-8,13-17],r7i1n[18-22,27-30],r7i0n[5-8,13-17],r7i3n[5-8,13-17],r7i2n[18-22,27-30],r7i5n[18-22,27-30],r7i5n[5-8,13-17] Hidden=YES
Comment 3 david.whiteside 2019-03-01 12:51:18 MST
I am not sure if I found the full slurmd log, below is what I have.

[root@r2lead HOSTS]# grep slurmd r2i7n24*
Feb 28 11:34:16 r2i7n24 slurmd[3419]: launch task 604998.0 request from UID:120005 GID:120005 HOST:10.148.7.240 PORT:38121
Feb 28 11:34:16 r2i7n24 slurmd[3419]: lllp_distribution jobid [604998] auto binding off: mask_cpu,one_thread
Feb 28 11:34:17 r2i7n24 slurmd[3419]: _run_prolog: run job script took usec=1275274
Feb 28 11:34:17 r2i7n24 slurmd[3419]: _run_prolog: prolog with lock for job 604998 ran for 1 seconds
Feb 28 11:34:25 r2i7n24 slurmd[3419]: launch task 604998.1 request from UID:120005 GID:120005 HOST:10.148.2.10 PORT:29404
Feb 28 11:34:25 r2i7n24 slurmd[3419]: lllp_distribution jobid [604998] auto binding off: mask_cpu,one_thread
Feb 28 15:47:07 r2i7n24 slurmd[3419]: _run_prolog: run job script took usec=1273840
Feb 28 15:47:07 r2i7n24 slurmd[3419]: _run_prolog: prolog with lock for job 605308 ran for 1 seconds
Feb 28 15:47:08 r2i7n24 slurmd[3419]: launch task 605308.0 request from UID:121348 GID:121348 HOST:10.148.0.65 PORT:4814
Feb 28 15:47:08 r2i7n24 slurmd[3419]: lllp_distribution jobid [605308] auto binding off: mask_cpu,one_thread
Feb 28 17:31:00 r2i7n24 slurmd[3419]: _run_prolog: run job script took usec=1275786
Feb 28 17:31:00 r2i7n24 slurmd[3419]: _run_prolog: prolog with lock for job 605429 ran for 2 seconds
Feb 28 17:31:02 r2i7n24 slurmd[3419]: launch task 605429.0 request from UID:120122 GID:120122 HOST:10.148.0.65 PORT:9929
Feb 28 17:31:02 r2i7n24 slurmd[3419]: lllp_distribution jobid [605429] auto binding off: mask_cpu,one_thread
Mar  1 10:51:41 r2i7n24 slurmd[3419]: task_p_slurmd_batch_request: 605731
Mar  1 10:51:41 r2i7n24 slurmd[3419]: task/affinity: job 605731 CPU input mask for node: 0xFFFFFFFFF
Mar  1 10:51:41 r2i7n24 slurmd[3419]: task/affinity: job 605731 CPU final HW mask for node: 0xFFFFFFFFF
Mar  1 10:51:42 r2i7n24 slurmd[3419]: _run_prolog: run job script took usec=1275539
Mar  1 10:51:42 r2i7n24 slurmd[3419]: _run_prolog: prolog with lock for job 605731 ran for 1 seconds
Mar  1 10:51:42 r2i7n24 slurmd[3419]: Launching batch job 605731 for UID 122245
Mar  1 10:57:01 r2i7n24 slurmd[3419]: task_p_slurmd_batch_request: 605746
Mar  1 10:57:01 r2i7n24 slurmd[3419]: task/affinity: job 605746 CPU input mask for node: 0xFFFFFFFFF
Mar  1 10:57:01 r2i7n24 slurmd[3419]: task/affinity: job 605746 CPU final HW mask for node: 0xFFFFFFFFF
Mar  1 10:57:02 r2i7n24 slurmd[3419]: _run_prolog: run job script took usec=1274792
Mar  1 10:57:02 r2i7n24 slurmd[3419]: _run_prolog: prolog with lock for job 605746 ran for 1 seconds
Mar  1 10:57:02 r2i7n24 slurmd[3419]: Launching batch job 605746 for UID 122245
Mar  1 10:58:13 r2i7n24 slurmd[3419]: task_p_slurmd_batch_request: 605766
Mar  1 10:58:13 r2i7n24 slurmd[3419]: task/affinity: job 605766 CPU input mask for node: 0xFFFFFFFFF
Mar  1 10:58:13 r2i7n24 slurmd[3419]: task/affinity: job 605766 CPU final HW mask for node: 0xFFFFFFFFF
Mar  1 10:58:15 r2i7n24 slurmd[3419]: _run_prolog: run job script took usec=1274360
Mar  1 10:58:15 r2i7n24 slurmd[3419]: _run_prolog: prolog with lock for job 605766 ran for 2 seconds
Mar  1 10:58:15 r2i7n24 slurmd[3419]: Launching batch job 605766 for UID 130756
Comment 4 Dominik Bartkiewicz 2019-03-05 04:45:47 MST
Hi

I can't recreate this locally.
Could you send my slurmctld.log and pam config?

Dominik
Comment 5 david.whiteside 2019-03-13 10:30:00 MDT
[dwhitesi@el1 ~]$ salloc -A csc000 --nodes 1  -t 04:00:00  --partition=standard
salloc: Granted job allocation 627471
srun: Step created for job 627471
[dwhitesi@r1i1n19 ~]$ srun sleep 6000
^Csrun: interrupt (one more within 1 sec to abort)
srun: step:627471.1 task 0: running
^Csrun: sending Ctrl-C to job 627471.1
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

[dwhitesi@r1i1n19 ~]$ ^C
[dwhitesi@r1i1n19 ~]$ ^C
[dwhitesi@r1i1n19 ~]$ Shell still active

[dwhitesi@el1 ~]$ ssh r1i1n19
Warning: Permanently added 'r1i1n19' (RSA) to the list of known hosts.
Access denied by pam_slurm_adopt: you have no active jobs on this node
Authentication failed.

[root@r1i1n19 ~]# cat /etc/pam.d/sshd
#%PAM-1.0
auth	   required	pam_sepermit.so
auth       substack     password-auth
auth       include      postlogin
# Used with polkit to reauthorize users in remote sessions
-auth      optional     pam_reauthorize.so prepare
account    required     pam_nologin.so
account    include      password-auth
account    requisite    pam_slurm_adopt.so
password   include      password-auth
# pam_selinux.so close should be the first session rule
session    required     pam_selinux.so close
session    required     pam_loginuid.so
# pam_selinux.so open should only be followed by sessions to be executed in the user context
session    required     pam_selinux.so open env_params
session    required     pam_namespace.so
session    optional     pam_keyinit.so force revoke
session    include      password-auth
session    include      postlogin
# Used with polkit to reauthorize users in remote sessions
-session   optional     pam_reauthorize.so prepare

Going to attach the slurmctld.log in a minute.....
Comment 6 david.whiteside 2019-03-13 10:37:17 MDT
Created attachment 9560 [details]
slurmctld.log, i triggered the bug at around 10:26
Comment 7 Dominik Bartkiewicz 2019-03-18 07:49:25 MDT
Hi

Could you send me /etc/pam.d/password-auth?
If you have a line like "-session     optional      pam_systemd.so" in this file, could you try to stop systemd-logind and check if this helped?
systemctl stops systemd-logind

Could you increase debug of slurmd and slurmstepd to debug3, set log_level for pam_slurm_adopt.so in  /etc/pam.d/sshd? And then repeat your test.

Dominik
Comment 8 david.whiteside 2019-03-20 15:56:30 MDT
The systemd line is commented out, I still tried stopping logind but it didn’t help.


Can you provide more details of how I would set log_level for pam_slurm_adopt.so in  /etc/pam.d/sshd?

[dwhitesi@r1i5n28 ~]$ cat  /etc/pam.d/password-auth
#%PAM-1.0
# This file is auto-generated.
# User changes will be destroyed the next time authconfig is run.
auth        required      pam_env.so
auth        sufficient    pam_unix.so try_first_pass nullok
auth        required      pam_deny.so

account     required      pam_unix.so

password    requisite     pam_pwquality.so try_first_pass local_users_only retry=3 authtok_type=
password    sufficient    pam_unix.so try_first_pass use_authtok nullok sha512 shadow
password    required      pam_deny.so

session     optional      pam_keyinit.so revoke
session     required      pam_limits.so
#-session     optional      pam_systemd.so
session     [success=1 default=ignore] pam_succeed_if.so service in crond quiet use_uid
session     required      pam_unix.so

The test was:
[d@el1 ~]$ salloc -A XXXXX --nodes 1 -t 04:00:00
salloc: Granted job allocation 641839
srun: Step created for job 641839
[dwhitesi@r1i5n28 ~]$ srun sleep 600
^Csrun: interrupt (one more within 1 sec to abort)       <--  I pressed control-c twice in a row quickly
srun: step:641839.1 task 0: running
^Csrun: sending Ctrl-C to job 641839.1
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
[dwhitesi@r1i5n28 ~]$

Then from another login shell:
[dwhitesi@el1 ~]$ ssh r1i5n28
Warning: Permanently added 'r1i5n28' (RSA) to the list of known hosts.
Access denied by pam_slurm_adopt: you have no active jobs on this node
Authentication failed.

I did the test with the increased log level, this is what I see.
slurmd: debug2: _group_cache_lookup_internal: no entry found for dwhitesi
slurmd: debug:  task affinity : before lllp distribution cpu bind method is '(null type)' ((null))
slurmd: debug3: task/affinity: slurmctld s 2 c 18; hw s 2 c 18 t 1
slurmd: debug3: task/affinity: job 641862.1 core mask from slurmctld: 0xFFFFFFFFF
slurmd: debug3: task/affinity: job 641862.1 CPU final mask for local node: 0xFFFFFFFFF
slurmd: debug3: _lllp_map_abstract_masks
slurmd: debug:  binding tasks:1 to nodes:1 sockets:2:0 cores:36:0 threads:36
slurmd: lllp_distribution jobid [641862] auto binding off: mask_cpu,one_thread
slurmd: debug:  task affinity : after lllp distribution cpu bind method is 'mask_cpu,one_thread' (0xFFFFFFFFF)
slurmd: debug:  Waiting for job 641862's prolog to complete
slurmd: debug:  Finished wait for job 641862's prolog to complete
slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
slurmd: debug3: slurmstepd rank 0 (r1i3n18), parent rank -1 (NONE), children 0, depth 0, max_depth 0
slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
slurmd: debug:  task_p_slurmd_reserve_resources: 641862
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6004
slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
slurmd: debug4: found jobid = 641862, stepid = 1
slurmd: debug4: found jobid = 641862, stepid = 0
slurmd: debug4: found jobid = 641862, stepid = 4294967295
slurmd: debug:  _rpc_signal_tasks: sending signal 2 to entire job 641862 flag 255
slurmd: debug4: found jobid = 641862, stepid = 1
slurmd: debug4: found jobid = 641862, stepid = 0
slurmd: debug4: found jobid = 641862, stepid = 4294967295
slurmd: debug2: container signal 2 to job 641862.1
slurmd: debug2: container signal 2 to job 641862.0
slurmd: debug2: container signal 2 to job 641862.4294967295
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6004
slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
slurmd: debug4: found jobid = 641862, stepid = 1
slurmd: debug4: found jobid = 641862, stepid = 0
slurmd: debug4: found jobid = 641862, stepid = 4294967295
slurmd: debug:  _rpc_signal_tasks: sending signal 9 to step 641862.1 flag 0
slurmd: debug2: container signal 9 to job 641862.1

From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Monday, March 18, 2019 at 7:49 AM
To: "Whiteside, David" <David.Whiteside@nrel.gov>
Subject: [Bug 6613] control-c on srun causes pam to prevent login on the interactive job

Comment # 7<https://bugs.schedmd.com/show_bug.cgi?id=6613#c7> on bug 6613<https://bugs.schedmd.com/show_bug.cgi?id=6613> from Dominik Bartkiewicz<mailto:bart@schedmd.com>

Hi



Could you send me /etc/pam.d/password-auth?

If you have a line like "-session     optional      pam_systemd.so" in this

file, could you try to stop systemd-logind and check if this helped?

systemctl stops systemd-logind



Could you increase debug of slurmd and slurmstepd to debug3, set log_level for

pam_slurm_adopt.so in  /etc/pam.d/sshd? And then repeat your test.



Dominik

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 9 david.whiteside 2019-03-20 16:12:23 MDT
I looked at the source for pam_slurm_adopt.so and figured it out, below is what it logged with the increase log_level=debug5.

Mar 20 16:10:30 r5i3n2 pam_slurm_adopt[282344]: debug:  Reading cgroup.conf file /nopt/slurm/etc/cgroup.conf
Mar 20 16:10:30 r5i3n2 pam_slurm_adopt[282344]: debug:  Reading slurm.conf file: /nopt/slurm/etc/slurm.conf
Mar 20 16:10:30 r5i3n2 pam_slurm_adopt[282344]: debug4: found jobid = 641891, stepid = 0
Mar 20 16:10:30 r5i3n2 pam_slurm_adopt[282344]: send_user_msg: Access denied by pam_slurm_adopt: you have no active jobs on this node
Mar 20 16:10:30 r5i3n2 sshd[282344]: fatal: Access denied for user dwhitesi by PAM account configuration [preauth]


  *   David


From: "Whiteside, David" <David.Whiteside@nrel.gov>
Date: Wednesday, March 20, 2019 at 3:56 PM
To: "bugs@schedmd.com" <bugs@schedmd.com>
Subject: Re: [Bug 6613] control-c on srun causes pam to prevent login on the interactive job

The systemd line is commented out, I still tried stopping logind but it didn’t help.


Can you provide more details of how I would set log_level for pam_slurm_adopt.so in  /etc/pam.d/sshd?

[dwhitesi@r1i5n28 ~]$ cat  /etc/pam.d/password-auth
#%PAM-1.0
# This file is auto-generated.
# User changes will be destroyed the next time authconfig is run.
auth        required      pam_env.so
auth        sufficient    pam_unix.so try_first_pass nullok
auth        required      pam_deny.so

account     required      pam_unix.so

password    requisite     pam_pwquality.so try_first_pass local_users_only retry=3 authtok_type=
password    sufficient    pam_unix.so try_first_pass use_authtok nullok sha512 shadow
password    required      pam_deny.so

session     optional      pam_keyinit.so revoke
session     required      pam_limits.so
#-session     optional      pam_systemd.so
session     [success=1 default=ignore] pam_succeed_if.so service in crond quiet use_uid
session     required      pam_unix.so

The test was:
[d@el1 ~]$ salloc -A XXXXX --nodes 1 -t 04:00:00
salloc: Granted job allocation 641839
srun: Step created for job 641839
[dwhitesi@r1i5n28 ~]$ srun sleep 600
^Csrun: interrupt (one more within 1 sec to abort)       <--  I pressed control-c twice in a row quickly
srun: step:641839.1 task 0: running
^Csrun: sending Ctrl-C to job 641839.1
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
[dwhitesi@r1i5n28 ~]$

Then from another login shell:
[dwhitesi@el1 ~]$ ssh r1i5n28
Warning: Permanently added 'r1i5n28' (RSA) to the list of known hosts.
Access denied by pam_slurm_adopt: you have no active jobs on this node
Authentication failed.

I did the test with the increased log level, this is what I see.
slurmd: debug2: _group_cache_lookup_internal: no entry found for dwhitesi
slurmd: debug:  task affinity : before lllp distribution cpu bind method is '(null type)' ((null))
slurmd: debug3: task/affinity: slurmctld s 2 c 18; hw s 2 c 18 t 1
slurmd: debug3: task/affinity: job 641862.1 core mask from slurmctld: 0xFFFFFFFFF
slurmd: debug3: task/affinity: job 641862.1 CPU final mask for local node: 0xFFFFFFFFF
slurmd: debug3: _lllp_map_abstract_masks
slurmd: debug:  binding tasks:1 to nodes:1 sockets:2:0 cores:36:0 threads:36
slurmd: lllp_distribution jobid [641862] auto binding off: mask_cpu,one_thread
slurmd: debug:  task affinity : after lllp distribution cpu bind method is 'mask_cpu,one_thread' (0xFFFFFFFFF)
slurmd: debug:  Waiting for job 641862's prolog to complete
slurmd: debug:  Finished wait for job 641862's prolog to complete
slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
slurmd: debug3: slurmstepd rank 0 (r1i3n18), parent rank -1 (NONE), children 0, depth 0, max_depth 0
slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
slurmd: debug:  task_p_slurmd_reserve_resources: 641862
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6004
slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
slurmd: debug4: found jobid = 641862, stepid = 1
slurmd: debug4: found jobid = 641862, stepid = 0
slurmd: debug4: found jobid = 641862, stepid = 4294967295
slurmd: debug:  _rpc_signal_tasks: sending signal 2 to entire job 641862 flag 255
slurmd: debug4: found jobid = 641862, stepid = 1
slurmd: debug4: found jobid = 641862, stepid = 0
slurmd: debug4: found jobid = 641862, stepid = 4294967295
slurmd: debug2: container signal 2 to job 641862.1
slurmd: debug2: container signal 2 to job 641862.0
slurmd: debug2: container signal 2 to job 641862.4294967295
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6004
slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
slurmd: debug4: found jobid = 641862, stepid = 1
slurmd: debug4: found jobid = 641862, stepid = 0
slurmd: debug4: found jobid = 641862, stepid = 4294967295
slurmd: debug:  _rpc_signal_tasks: sending signal 9 to step 641862.1 flag 0
slurmd: debug2: container signal 9 to job 641862.1

From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Monday, March 18, 2019 at 7:49 AM
To: "Whiteside, David" <David.Whiteside@nrel.gov>
Subject: [Bug 6613] control-c on srun causes pam to prevent login on the interactive job

Comment # 7<https://bugs.schedmd.com/show_bug.cgi?id=6613#c7> on bug 6613<https://bugs.schedmd.com/show_bug.cgi?id=6613> from Dominik Bartkiewicz<mailto:bart@schedmd.com>

Hi



Could you send me /etc/pam.d/password-auth?

If you have a line like "-session     optional      pam_systemd.so" in this

file, could you try to stop systemd-logind and check if this helped?

systemctl stops systemd-logind



Could you increase debug of slurmd and slurmstepd to debug3, set log_level for

pam_slurm_adopt.so in  /etc/pam.d/sshd? And then repeat your test.



Dominik

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 10 Dominik Bartkiewicz 2019-03-21 07:57:28 MDT
Hi

I can't recreate exactly the same behavior,
but with your config, it is possible to start the first step 0 before creating the external step.

Could you try to add "salloc_wait_nodes" to your SchedulerParameters or easier for testing add " --wait-all-nodes=1" to salloc option?

Could you try to find slurmstepd log?
Could you check if before submitting srun external step exists in allocation?

eg.:
[bart@centos01 ~]$ ps -ef |grep stepd
root       372     1  0 14:03 ?        00:00:00 slurmstepd: [719.extern]
root       378     1  0 14:03 ?        00:00:00 slurmstepd: [719.0]
bart       424   384  0 14:03 pts/1    00:00:00 grep --color=auto stepd

Dominik
Comment 12 Dominik Bartkiewicz 2019-03-21 09:22:49 MDT
I think I have found the cause of this problem.
We are working on the fix.

Dominik
Comment 13 Matt Bidwell 2019-03-22 06:50:36 MDT
FYI, I see this in batch jobs as well. I'm running a pdsh at the beginning and end of my benchmark jobs to grab some processor info. I cleared out my .ssh/known_hosts, but am getting on some of the hosts (not all):
Access denied by pam_slurm_adopt: you have no active jobs on this node
Authentication failed.
Comment 14 Dominik Bartkiewicz 2019-03-22 07:33:33 MDT
Hi

Similar to comment 10, could you check if this occurs after adding "--wait-all-nodes=1" to sbatch?
Could you also check if extern step exists on those nodes?

Dominik
Comment 15 Matt Bidwell 2019-03-22 07:41:00 MDT
Sorry, that's the problem with doing tickets early in the morning. I have set 
#SBATCH --wait-all-nodes=1
in the submission script and have the issue.
And if it isn't clear, this is the same system.
Comment 16 Dominik Bartkiewicz 2019-03-27 09:29:01 MDT
Hi

Let's try to fix the first issue first.
Could you try to apply attachment 9654 [details] and check if this fixed issue from initial comment?

Dominik
Comment 20 david.whiteside 2019-04-05 11:15:58 MDT
What would need to be recompiled? slurmd, slurmdctld, the pam module?

 - David
Comment 21 Dominik Bartkiewicz 2019-04-08 00:57:24 MDT
Hi

You need to recompile srun and libs.

Dominik
Comment 23 david.whiteside 2019-04-10 20:00:55 MDT
Because its a production cluster I had to test carefully, I applied the slurm patch to a locally compiled slurm (in my home directory) and tested from there.  I did the below and it did not resolve the issue.

$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/dwhitesi/slurm-slurm-18-08-3-1/src/api/.libs/

$ salloc -n1 --account=csc000 --time=30

[dwhitesi@r1i0n3 slurm-slurm-18-08-3-1]$ /home/dwhitesi/slurm-slurm-18-08-3-1/src/srun/.libs/srun sleep 1000
^Csrun: interrupt (one more within 1 sec to abort)
srun: step:670011.1 task 0: running
^Csrun: sending Ctrl-C to job 670011.1
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

From another shell...
$ ssh r1i0n3
Access denied by pam_slurm_adopt: you have no active jobs on this node
Authentication failed.
Comment 24 david.whiteside 2019-04-11 17:21:33 MDT
We finally got a test cluster setup and were able to reproduce the issue on it.  When I recompiled all of slurm using the patch (not just srun), it seems to have fixed the issue.

Thanks!
Comment 25 Dominik Bartkiewicz 2019-04-12 04:48:59 MDT
(In reply to david.whiteside from comment #24)
> We finally got a test cluster setup and were able to reproduce the issue on
> it.  When I recompiled all of slurm using the patch (not just srun), it
> seems to have fixed the issue.
> 
> Thanks!
Hi

I'm glad to hear that because I haven't got any more ideas about how to fix it.
We have found other places in API where we can have similar issues. Final patch which will be committed to the repo will look slightly different, I will inform you about progress.

Dominik
Comment 31 Dominik Bartkiewicz 2019-04-19 07:10:58 MDT
Hi

We have finally committed patches that should solve this issue.
https://github.com/SchedMD/slurm/commit/7fa46453ccdc83
https://github.com/SchedMD/slurm/commit/4728d0e816f92d
Those patches will be included in 18.08.8.
Let me know if we can close this bug now.

Dominik
Comment 32 david.whiteside 2019-04-19 07:57:28 MDT
Thanks.  You can close this bug.


  *   David

From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Friday, April 19, 2019 at 7:11 AM
To: "Whiteside, David" <David.Whiteside@nrel.gov>
Subject: [Bug 6613] control-c on srun causes pam to prevent login on the interactive job

Comment # 31<https://bugs.schedmd.com/show_bug.cgi?id=6613#c31> on bug 6613<https://bugs.schedmd.com/show_bug.cgi?id=6613> from Dominik Bartkiewicz<mailto:bart@schedmd.com>

Hi



We have finally committed patches that should solve this issue.

https://github.com/SchedMD/slurm/commit/7fa46453ccdc83

https://github.com/SchedMD/slurm/commit/4728d0e816f92d

Those patches will be included in 18.08.8.

Let me know if we can close this bug now.



Dominik

________________________________
You are receiving this mail because:

  *   You reported the bug.