| Summary: | control-c on srun causes pam to prevent login on the interactive job | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | david.whiteside |
| Component: | Other | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | matt.bidwell |
| Version: | 18.08.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NREL | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 18.08.8 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
slurmctld.log, i triggered the bug at around 10:26
fix uninitialized flags |
||
|
Description
david.whiteside
2019-02-28 11:37:47 MST
Hi Could you send me slurm.conf, and slurmd.log from r2i7n24? Does this happen only if you cancel the extra job step? Dominik This happens on all nodes. It only happens when you control-c the srun command, if you let it run until completion you are still able to ssh to the node. [root@r2i7n24 ~]# cat /nopt/slurm/etc/slurm.conf # See the slurm.conf man page for more information. # ControlMachine=emgmt1 ControlAddr=10.148.7.234 # #MailProg=/bin/mail MpiDefault=pmi2 PropagateResourceLimits=NONE ProctrackType=proctrack/cgroup ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm StateSaveLocation=/var/spool/slurmctld SwitchType=switch/none TaskPlugin=task/affinity,task/cgroup #SallocDefaultCommand="srun -n1 -N1 --gres=gpu:0 --pty $SHELL" SallocDefaultCommand="srun --pty --preserve-env $SHELL" # # # PROLOGUE & EPILOGUE JobSubmitPlugins=require_timelimit,lua PrologSlurmctld=/nopt/slurm/etc/prologslurmctld.sh Prolog=/nopt/slurm/etc/prolog.d/* PrologFlags=X11 X11Parameters=local_xauthority Epilog=/nopt/slurm/etc/epilog.d/* PrologEpilogTimeout=180 UnkillableStepTimeout=180 # # # PRIORITY # PriorityType=priority/multifactor PriorityWeightFairshare=0 # 1000 points per minute for three days, after which you've hit the cap PriorityMaxAge=3-0 PriorityWeightAge=4320000 PriorityWeightJobSize=184410000 # we've added a placeholder qos with priority 1000 to force the normalized range to stay consistent # setting a priority of 10 on a qos is the same as a three day head start PriorityWeightQOS=432000000 PriorityWeightPartition=144000000 # asking for both gpus in a node gives a 1-day head start #PriorityFlags=MAX_TRES #PriorityWeightTRES=node=1000000 # # TIMERS BatchStartTimeout=30 MessageTimeout=20 #KillWait=30 #MinJobAge=300 #SlurmctldTimeout=120 #SlurmdTimeout=300 # # # SCHEDULING FastSchedule=1 SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_Core #SchedulerParameters=defer #TopologyPlugin=topology/tree JobRequeue=0 # # #NHC HealthCheckProgram=/usr/sbin/nhc HealthCheckInterval=300 HealthCheckNodeState=CYCLE,ANY # LOGGING AND ACCOUNTING ClusterName=eagle AccountingStorageHost=emgmt1 AccountingStorageEnforce=safe,qos,limits AccountingStorageUser=slurm JobAcctGatherFrequency=task=30,energy=30 AcctGatherNodeFreq=30 AccountingStorageType=accounting_storage/slurmdbd JobAcctGatherType=jobacct_gather/linux JobAcctGatherParams=UsePss #AcctGatherEnergyType=acct_gather_energy/ipmi AcctGatherFilesystemType=acct_gather_filesystem/lustre SlurmctldDebug=info SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=info GresTypes=gpu AccountingStorageTRES=gres/gpu,fs/lustre TmpFS=/tmp/scratch EnforcePartLimits=ALL #SSK testing # bf_window should usually match the largest MaxTime on any partition SchedulerParameters=max_rpc_cnt=256,MessageTimeout=30,bf_window=14400,bf_resolution=600 # # # COMPUTE NODES NodeName=DEFAULT Sockets=2 CoresPerSocket=18 ThreadsPerCore=1 State=UNKNOWN NodeName=r1i[0-7]n[0-35],r2i[0-7]n[0-35],r3i[0-7]n[0-35],r4i[0-7]n[0-35],r5i[0-7]n[0-35],r6i[0-7]n[0-35] RealMemory=92000 TmpDisk=900000 NodeName=r7i[0-7]n[0-35] RealMemory=184000 TmpDisk=900000 NodeName=r102u[01-36],r103u[35-36] RealMemory=770000 TmpDisk=1500000 NodeName=r103u[25-34] RealMemory=770000 TmpDisk=24000000 NodeName=r103u[01,03,05,07,09,11,13,15,17,19] RealMemory=770000 Gres=gpu:v100:2 TmpDisk=24000000 NodeName=r103u[21,23],r104u[01,03,05,07,09,11,13,15,17,19,21,23,25,27,29,31,33],r105u[01,03,05,07,09,11,13,15,17,19,21,23,25,27,29,31,33] RealMemory=770000 Gres=gpu:v100:2 TmpDisk=1500000 PartitionName=DEFAULT OverSubscribe=EXCLUSIVE State=UP PriorityTier=1 PriorityJobFactor=0 MaxTime=2-0 PartitionName=short Nodes=r1i[0-7]n[0-35],r2i[0-7]n[0-35],r3i[0-7]n[0-35],r4i[0-7]n[0-35],r5i[0-7]n[0-35],r6i[0-7]n[0-35],r7i[0-6]n[0-35],r7i7n[0-28],r102u[01-34],r103u[25-36],r104u[01,03,05,07,09,11,13,15,17,19,21,23,25,27,29,31],r105u[01,03,05,07,09,11,13,15,17,19,21,23,25,27,29,31] MaxTime=0-4 QOS=p_short PriorityTier=2 PartitionName=standard Nodes=r1i[0-7]n[0-35],r2i[0-7]n[0-35],r3i[0-7]n[0-35],r4i[0-7]n[0-35],r5i[0-7]n[0-35],r6i[0-7]n[0-35],r7i[0-6]n[0-35],r7i7n[0-28],r102u[01-34],r103u[25-36],r104u[01,03,05,07,09,11,13,15,17,19,21,23,25,27,29,31],r105u[01,03,05,07,09,11,13,15,17,19,21,23,25,27,29,31] QOS=p_half PartitionName=long Nodes=r1i[0-7]n[0-35],r2i[0-7]n[0-35],r3i[0-7]n[0-35],r4i[0-7]n[0-35],r5i[0-7]n[0-35],r6i[0-7]n[0-35],r7i[0-6]n[0-35],r7i7n[0-28],r102u[01-34],r103u[25-36],r104u[01,03,05,07,09,11,13,15,17,19,21,23,25,27,29,31],r105u[01,03,05,07,09,11,13,15,17,19,21,23,25,27,29,31] MaxTime=10-0 QOS=p_long PartitionName=bigmem Nodes=r102u[01-34],r103u[25-36],r104u[01,03,05,07,09,11,13,15,17,19,21,23,25,27,29,31],r105u[01,03,05,07,09,11,13,15,17,19,21,23,25,27,29,31] PriorityJobFactor=10 QOS=p_bigmem_half PartitionName=gpu Nodes=r103u[01,03,05,07,09,11,13,15,17,19],r104u[01,03,05,07,09,11,13,15,17,19,21,23,25,27,29,31],r105u[01,03,05,07,09,11,13,15,17,19,21,23,25,27,29,31] PriorityJobFactor=10 QOS=p_gpu_half PartitionName=bigscratch Nodes=r103u[01,03,05,07,09,11,13,15,17,19,25-34] PriorityJobFactor=10 QOS=p_bigscratch_half PartitionName=debug Nodes=r7i7n[29-35],r102u[35-36],r103u21,r103u23,r104u33,r105u33 MaxTime=0-24 QOS=debug PriorityTier=3 PartitionName=placeholder Hidden=YES PriorityJobFactor=1000 PartitionName=ddn Nodes=r7i6n[5-8,13-17],r6i7n[5-8,13-17],r2i5n[5-8,13-17],r3i2n[18-22,27-30],r4i1n[5-8,13-17],r7i2n[23-26,31-35],r3i7n[5-8,13-17],r2i6n[5-8,13-17],r4i3n[5-8,13-17],r4i2n[5-8,13-17],r3i1n[5-8,13-17],r3i0n[5-8,13-17],r3i5n[5-8,13-17],r3i4n[5-8,13-17],r3i3n[5-8,13-17],r3i2n[5-8,13-17],r6i1n[5-8,13-17],r3i6n[5-8,13-17],r6i3n[5-8,13-17],r6i6n[18-22,27-30],r6i5n[5-8,13-17],r6i4n[5-8,13-17],r6i7n[23-26,31-35],r6i6n[5-8,13-17],r7i1n[18-22,27-30],r7i0n[5-8,13-17],r7i3n[5-8,13-17],r7i2n[18-22,27-30],r7i5n[18-22,27-30],r7i5n[5-8,13-17] Hidden=YES I am not sure if I found the full slurmd log, below is what I have. [root@r2lead HOSTS]# grep slurmd r2i7n24* Feb 28 11:34:16 r2i7n24 slurmd[3419]: launch task 604998.0 request from UID:120005 GID:120005 HOST:10.148.7.240 PORT:38121 Feb 28 11:34:16 r2i7n24 slurmd[3419]: lllp_distribution jobid [604998] auto binding off: mask_cpu,one_thread Feb 28 11:34:17 r2i7n24 slurmd[3419]: _run_prolog: run job script took usec=1275274 Feb 28 11:34:17 r2i7n24 slurmd[3419]: _run_prolog: prolog with lock for job 604998 ran for 1 seconds Feb 28 11:34:25 r2i7n24 slurmd[3419]: launch task 604998.1 request from UID:120005 GID:120005 HOST:10.148.2.10 PORT:29404 Feb 28 11:34:25 r2i7n24 slurmd[3419]: lllp_distribution jobid [604998] auto binding off: mask_cpu,one_thread Feb 28 15:47:07 r2i7n24 slurmd[3419]: _run_prolog: run job script took usec=1273840 Feb 28 15:47:07 r2i7n24 slurmd[3419]: _run_prolog: prolog with lock for job 605308 ran for 1 seconds Feb 28 15:47:08 r2i7n24 slurmd[3419]: launch task 605308.0 request from UID:121348 GID:121348 HOST:10.148.0.65 PORT:4814 Feb 28 15:47:08 r2i7n24 slurmd[3419]: lllp_distribution jobid [605308] auto binding off: mask_cpu,one_thread Feb 28 17:31:00 r2i7n24 slurmd[3419]: _run_prolog: run job script took usec=1275786 Feb 28 17:31:00 r2i7n24 slurmd[3419]: _run_prolog: prolog with lock for job 605429 ran for 2 seconds Feb 28 17:31:02 r2i7n24 slurmd[3419]: launch task 605429.0 request from UID:120122 GID:120122 HOST:10.148.0.65 PORT:9929 Feb 28 17:31:02 r2i7n24 slurmd[3419]: lllp_distribution jobid [605429] auto binding off: mask_cpu,one_thread Mar 1 10:51:41 r2i7n24 slurmd[3419]: task_p_slurmd_batch_request: 605731 Mar 1 10:51:41 r2i7n24 slurmd[3419]: task/affinity: job 605731 CPU input mask for node: 0xFFFFFFFFF Mar 1 10:51:41 r2i7n24 slurmd[3419]: task/affinity: job 605731 CPU final HW mask for node: 0xFFFFFFFFF Mar 1 10:51:42 r2i7n24 slurmd[3419]: _run_prolog: run job script took usec=1275539 Mar 1 10:51:42 r2i7n24 slurmd[3419]: _run_prolog: prolog with lock for job 605731 ran for 1 seconds Mar 1 10:51:42 r2i7n24 slurmd[3419]: Launching batch job 605731 for UID 122245 Mar 1 10:57:01 r2i7n24 slurmd[3419]: task_p_slurmd_batch_request: 605746 Mar 1 10:57:01 r2i7n24 slurmd[3419]: task/affinity: job 605746 CPU input mask for node: 0xFFFFFFFFF Mar 1 10:57:01 r2i7n24 slurmd[3419]: task/affinity: job 605746 CPU final HW mask for node: 0xFFFFFFFFF Mar 1 10:57:02 r2i7n24 slurmd[3419]: _run_prolog: run job script took usec=1274792 Mar 1 10:57:02 r2i7n24 slurmd[3419]: _run_prolog: prolog with lock for job 605746 ran for 1 seconds Mar 1 10:57:02 r2i7n24 slurmd[3419]: Launching batch job 605746 for UID 122245 Mar 1 10:58:13 r2i7n24 slurmd[3419]: task_p_slurmd_batch_request: 605766 Mar 1 10:58:13 r2i7n24 slurmd[3419]: task/affinity: job 605766 CPU input mask for node: 0xFFFFFFFFF Mar 1 10:58:13 r2i7n24 slurmd[3419]: task/affinity: job 605766 CPU final HW mask for node: 0xFFFFFFFFF Mar 1 10:58:15 r2i7n24 slurmd[3419]: _run_prolog: run job script took usec=1274360 Mar 1 10:58:15 r2i7n24 slurmd[3419]: _run_prolog: prolog with lock for job 605766 ran for 2 seconds Mar 1 10:58:15 r2i7n24 slurmd[3419]: Launching batch job 605766 for UID 130756 Hi I can't recreate this locally. Could you send my slurmctld.log and pam config? Dominik [dwhitesi@el1 ~]$ salloc -A csc000 --nodes 1 -t 04:00:00 --partition=standard salloc: Granted job allocation 627471 srun: Step created for job 627471 [dwhitesi@r1i1n19 ~]$ srun sleep 6000 ^Csrun: interrupt (one more within 1 sec to abort) srun: step:627471.1 task 0: running ^Csrun: sending Ctrl-C to job 627471.1 srun: Job step aborted: Waiting up to 32 seconds for job step to finish. [dwhitesi@r1i1n19 ~]$ ^C [dwhitesi@r1i1n19 ~]$ ^C [dwhitesi@r1i1n19 ~]$ Shell still active [dwhitesi@el1 ~]$ ssh r1i1n19 Warning: Permanently added 'r1i1n19' (RSA) to the list of known hosts. Access denied by pam_slurm_adopt: you have no active jobs on this node Authentication failed. [root@r1i1n19 ~]# cat /etc/pam.d/sshd #%PAM-1.0 auth required pam_sepermit.so auth substack password-auth auth include postlogin # Used with polkit to reauthorize users in remote sessions -auth optional pam_reauthorize.so prepare account required pam_nologin.so account include password-auth account requisite pam_slurm_adopt.so password include password-auth # pam_selinux.so close should be the first session rule session required pam_selinux.so close session required pam_loginuid.so # pam_selinux.so open should only be followed by sessions to be executed in the user context session required pam_selinux.so open env_params session required pam_namespace.so session optional pam_keyinit.so force revoke session include password-auth session include postlogin # Used with polkit to reauthorize users in remote sessions -session optional pam_reauthorize.so prepare Going to attach the slurmctld.log in a minute..... Created attachment 9560 [details]
slurmctld.log, i triggered the bug at around 10:26
Hi Could you send me /etc/pam.d/password-auth? If you have a line like "-session optional pam_systemd.so" in this file, could you try to stop systemd-logind and check if this helped? systemctl stops systemd-logind Could you increase debug of slurmd and slurmstepd to debug3, set log_level for pam_slurm_adopt.so in /etc/pam.d/sshd? And then repeat your test. Dominik The systemd line is commented out, I still tried stopping logind but it didn’t help. Can you provide more details of how I would set log_level for pam_slurm_adopt.so in /etc/pam.d/sshd? [dwhitesi@r1i5n28 ~]$ cat /etc/pam.d/password-auth #%PAM-1.0 # This file is auto-generated. # User changes will be destroyed the next time authconfig is run. auth required pam_env.so auth sufficient pam_unix.so try_first_pass nullok auth required pam_deny.so account required pam_unix.so password requisite pam_pwquality.so try_first_pass local_users_only retry=3 authtok_type= password sufficient pam_unix.so try_first_pass use_authtok nullok sha512 shadow password required pam_deny.so session optional pam_keyinit.so revoke session required pam_limits.so #-session optional pam_systemd.so session [success=1 default=ignore] pam_succeed_if.so service in crond quiet use_uid session required pam_unix.so The test was: [d@el1 ~]$ salloc -A XXXXX --nodes 1 -t 04:00:00 salloc: Granted job allocation 641839 srun: Step created for job 641839 [dwhitesi@r1i5n28 ~]$ srun sleep 600 ^Csrun: interrupt (one more within 1 sec to abort) <-- I pressed control-c twice in a row quickly srun: step:641839.1 task 0: running ^Csrun: sending Ctrl-C to job 641839.1 srun: Job step aborted: Waiting up to 32 seconds for job step to finish. [dwhitesi@r1i5n28 ~]$ Then from another login shell: [dwhitesi@el1 ~]$ ssh r1i5n28 Warning: Permanently added 'r1i5n28' (RSA) to the list of known hosts. Access denied by pam_slurm_adopt: you have no active jobs on this node Authentication failed. I did the test with the increased log level, this is what I see. slurmd: debug2: _group_cache_lookup_internal: no entry found for dwhitesi slurmd: debug: task affinity : before lllp distribution cpu bind method is '(null type)' ((null)) slurmd: debug3: task/affinity: slurmctld s 2 c 18; hw s 2 c 18 t 1 slurmd: debug3: task/affinity: job 641862.1 core mask from slurmctld: 0xFFFFFFFFF slurmd: debug3: task/affinity: job 641862.1 CPU final mask for local node: 0xFFFFFFFFF slurmd: debug3: _lllp_map_abstract_masks slurmd: debug: binding tasks:1 to nodes:1 sockets:2:0 cores:36:0 threads:36 slurmd: lllp_distribution jobid [641862] auto binding off: mask_cpu,one_thread slurmd: debug: task affinity : after lllp distribution cpu bind method is 'mask_cpu,one_thread' (0xFFFFFFFFF) slurmd: debug: Waiting for job 641862's prolog to complete slurmd: debug: Finished wait for job 641862's prolog to complete slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd slurmd: debug3: slurmstepd rank 0 (r1i3n18), parent rank -1 (NONE), children 0, depth 0, max_depth 0 slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd slurmd: debug: task_p_slurmd_reserve_resources: 641862 slurmd: debug3: in the service_connection slurmd: debug2: got this type of message 6004 slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS slurmd: debug4: found jobid = 641862, stepid = 1 slurmd: debug4: found jobid = 641862, stepid = 0 slurmd: debug4: found jobid = 641862, stepid = 4294967295 slurmd: debug: _rpc_signal_tasks: sending signal 2 to entire job 641862 flag 255 slurmd: debug4: found jobid = 641862, stepid = 1 slurmd: debug4: found jobid = 641862, stepid = 0 slurmd: debug4: found jobid = 641862, stepid = 4294967295 slurmd: debug2: container signal 2 to job 641862.1 slurmd: debug2: container signal 2 to job 641862.0 slurmd: debug2: container signal 2 to job 641862.4294967295 slurmd: debug3: in the service_connection slurmd: debug2: got this type of message 6004 slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS slurmd: debug4: found jobid = 641862, stepid = 1 slurmd: debug4: found jobid = 641862, stepid = 0 slurmd: debug4: found jobid = 641862, stepid = 4294967295 slurmd: debug: _rpc_signal_tasks: sending signal 9 to step 641862.1 flag 0 slurmd: debug2: container signal 9 to job 641862.1 From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Monday, March 18, 2019 at 7:49 AM To: "Whiteside, David" <David.Whiteside@nrel.gov> Subject: [Bug 6613] control-c on srun causes pam to prevent login on the interactive job Comment # 7<https://bugs.schedmd.com/show_bug.cgi?id=6613#c7> on bug 6613<https://bugs.schedmd.com/show_bug.cgi?id=6613> from Dominik Bartkiewicz<mailto:bart@schedmd.com> Hi Could you send me /etc/pam.d/password-auth? If you have a line like "-session optional pam_systemd.so" in this file, could you try to stop systemd-logind and check if this helped? systemctl stops systemd-logind Could you increase debug of slurmd and slurmstepd to debug3, set log_level for pam_slurm_adopt.so in /etc/pam.d/sshd? And then repeat your test. Dominik ________________________________ You are receiving this mail because: * You reported the bug. I looked at the source for pam_slurm_adopt.so and figured it out, below is what it logged with the increase log_level=debug5. Mar 20 16:10:30 r5i3n2 pam_slurm_adopt[282344]: debug: Reading cgroup.conf file /nopt/slurm/etc/cgroup.conf Mar 20 16:10:30 r5i3n2 pam_slurm_adopt[282344]: debug: Reading slurm.conf file: /nopt/slurm/etc/slurm.conf Mar 20 16:10:30 r5i3n2 pam_slurm_adopt[282344]: debug4: found jobid = 641891, stepid = 0 Mar 20 16:10:30 r5i3n2 pam_slurm_adopt[282344]: send_user_msg: Access denied by pam_slurm_adopt: you have no active jobs on this node Mar 20 16:10:30 r5i3n2 sshd[282344]: fatal: Access denied for user dwhitesi by PAM account configuration [preauth] * David From: "Whiteside, David" <David.Whiteside@nrel.gov> Date: Wednesday, March 20, 2019 at 3:56 PM To: "bugs@schedmd.com" <bugs@schedmd.com> Subject: Re: [Bug 6613] control-c on srun causes pam to prevent login on the interactive job The systemd line is commented out, I still tried stopping logind but it didn’t help. Can you provide more details of how I would set log_level for pam_slurm_adopt.so in /etc/pam.d/sshd? [dwhitesi@r1i5n28 ~]$ cat /etc/pam.d/password-auth #%PAM-1.0 # This file is auto-generated. # User changes will be destroyed the next time authconfig is run. auth required pam_env.so auth sufficient pam_unix.so try_first_pass nullok auth required pam_deny.so account required pam_unix.so password requisite pam_pwquality.so try_first_pass local_users_only retry=3 authtok_type= password sufficient pam_unix.so try_first_pass use_authtok nullok sha512 shadow password required pam_deny.so session optional pam_keyinit.so revoke session required pam_limits.so #-session optional pam_systemd.so session [success=1 default=ignore] pam_succeed_if.so service in crond quiet use_uid session required pam_unix.so The test was: [d@el1 ~]$ salloc -A XXXXX --nodes 1 -t 04:00:00 salloc: Granted job allocation 641839 srun: Step created for job 641839 [dwhitesi@r1i5n28 ~]$ srun sleep 600 ^Csrun: interrupt (one more within 1 sec to abort) <-- I pressed control-c twice in a row quickly srun: step:641839.1 task 0: running ^Csrun: sending Ctrl-C to job 641839.1 srun: Job step aborted: Waiting up to 32 seconds for job step to finish. [dwhitesi@r1i5n28 ~]$ Then from another login shell: [dwhitesi@el1 ~]$ ssh r1i5n28 Warning: Permanently added 'r1i5n28' (RSA) to the list of known hosts. Access denied by pam_slurm_adopt: you have no active jobs on this node Authentication failed. I did the test with the increased log level, this is what I see. slurmd: debug2: _group_cache_lookup_internal: no entry found for dwhitesi slurmd: debug: task affinity : before lllp distribution cpu bind method is '(null type)' ((null)) slurmd: debug3: task/affinity: slurmctld s 2 c 18; hw s 2 c 18 t 1 slurmd: debug3: task/affinity: job 641862.1 core mask from slurmctld: 0xFFFFFFFFF slurmd: debug3: task/affinity: job 641862.1 CPU final mask for local node: 0xFFFFFFFFF slurmd: debug3: _lllp_map_abstract_masks slurmd: debug: binding tasks:1 to nodes:1 sockets:2:0 cores:36:0 threads:36 slurmd: lllp_distribution jobid [641862] auto binding off: mask_cpu,one_thread slurmd: debug: task affinity : after lllp distribution cpu bind method is 'mask_cpu,one_thread' (0xFFFFFFFFF) slurmd: debug: Waiting for job 641862's prolog to complete slurmd: debug: Finished wait for job 641862's prolog to complete slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd slurmd: debug3: slurmstepd rank 0 (r1i3n18), parent rank -1 (NONE), children 0, depth 0, max_depth 0 slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd slurmd: debug: task_p_slurmd_reserve_resources: 641862 slurmd: debug3: in the service_connection slurmd: debug2: got this type of message 6004 slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS slurmd: debug4: found jobid = 641862, stepid = 1 slurmd: debug4: found jobid = 641862, stepid = 0 slurmd: debug4: found jobid = 641862, stepid = 4294967295 slurmd: debug: _rpc_signal_tasks: sending signal 2 to entire job 641862 flag 255 slurmd: debug4: found jobid = 641862, stepid = 1 slurmd: debug4: found jobid = 641862, stepid = 0 slurmd: debug4: found jobid = 641862, stepid = 4294967295 slurmd: debug2: container signal 2 to job 641862.1 slurmd: debug2: container signal 2 to job 641862.0 slurmd: debug2: container signal 2 to job 641862.4294967295 slurmd: debug3: in the service_connection slurmd: debug2: got this type of message 6004 slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS slurmd: debug4: found jobid = 641862, stepid = 1 slurmd: debug4: found jobid = 641862, stepid = 0 slurmd: debug4: found jobid = 641862, stepid = 4294967295 slurmd: debug: _rpc_signal_tasks: sending signal 9 to step 641862.1 flag 0 slurmd: debug2: container signal 9 to job 641862.1 From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Monday, March 18, 2019 at 7:49 AM To: "Whiteside, David" <David.Whiteside@nrel.gov> Subject: [Bug 6613] control-c on srun causes pam to prevent login on the interactive job Comment # 7<https://bugs.schedmd.com/show_bug.cgi?id=6613#c7> on bug 6613<https://bugs.schedmd.com/show_bug.cgi?id=6613> from Dominik Bartkiewicz<mailto:bart@schedmd.com> Hi Could you send me /etc/pam.d/password-auth? If you have a line like "-session optional pam_systemd.so" in this file, could you try to stop systemd-logind and check if this helped? systemctl stops systemd-logind Could you increase debug of slurmd and slurmstepd to debug3, set log_level for pam_slurm_adopt.so in /etc/pam.d/sshd? And then repeat your test. Dominik ________________________________ You are receiving this mail because: * You reported the bug. Hi I can't recreate exactly the same behavior, but with your config, it is possible to start the first step 0 before creating the external step. Could you try to add "salloc_wait_nodes" to your SchedulerParameters or easier for testing add " --wait-all-nodes=1" to salloc option? Could you try to find slurmstepd log? Could you check if before submitting srun external step exists in allocation? eg.: [bart@centos01 ~]$ ps -ef |grep stepd root 372 1 0 14:03 ? 00:00:00 slurmstepd: [719.extern] root 378 1 0 14:03 ? 00:00:00 slurmstepd: [719.0] bart 424 384 0 14:03 pts/1 00:00:00 grep --color=auto stepd Dominik I think I have found the cause of this problem. We are working on the fix. Dominik FYI, I see this in batch jobs as well. I'm running a pdsh at the beginning and end of my benchmark jobs to grab some processor info. I cleared out my .ssh/known_hosts, but am getting on some of the hosts (not all): Access denied by pam_slurm_adopt: you have no active jobs on this node Authentication failed. Hi Similar to comment 10, could you check if this occurs after adding "--wait-all-nodes=1" to sbatch? Could you also check if extern step exists on those nodes? Dominik Sorry, that's the problem with doing tickets early in the morning. I have set #SBATCH --wait-all-nodes=1 in the submission script and have the issue. And if it isn't clear, this is the same system. Hi
Let's try to fix the first issue first.
Could you try to apply attachment 9654 [details] and check if this fixed issue from initial comment?
Dominik
What would need to be recompiled? slurmd, slurmdctld, the pam module? - David Hi You need to recompile srun and libs. Dominik Because its a production cluster I had to test carefully, I applied the slurm patch to a locally compiled slurm (in my home directory) and tested from there. I did the below and it did not resolve the issue. $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/dwhitesi/slurm-slurm-18-08-3-1/src/api/.libs/ $ salloc -n1 --account=csc000 --time=30 [dwhitesi@r1i0n3 slurm-slurm-18-08-3-1]$ /home/dwhitesi/slurm-slurm-18-08-3-1/src/srun/.libs/srun sleep 1000 ^Csrun: interrupt (one more within 1 sec to abort) srun: step:670011.1 task 0: running ^Csrun: sending Ctrl-C to job 670011.1 srun: Job step aborted: Waiting up to 32 seconds for job step to finish. From another shell... $ ssh r1i0n3 Access denied by pam_slurm_adopt: you have no active jobs on this node Authentication failed. We finally got a test cluster setup and were able to reproduce the issue on it. When I recompiled all of slurm using the patch (not just srun), it seems to have fixed the issue. Thanks! (In reply to david.whiteside from comment #24) > We finally got a test cluster setup and were able to reproduce the issue on > it. When I recompiled all of slurm using the patch (not just srun), it > seems to have fixed the issue. > > Thanks! Hi I'm glad to hear that because I haven't got any more ideas about how to fix it. We have found other places in API where we can have similar issues. Final patch which will be committed to the repo will look slightly different, I will inform you about progress. Dominik Hi We have finally committed patches that should solve this issue. https://github.com/SchedMD/slurm/commit/7fa46453ccdc83 https://github.com/SchedMD/slurm/commit/4728d0e816f92d Those patches will be included in 18.08.8. Let me know if we can close this bug now. Dominik Thanks. You can close this bug. * David From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Friday, April 19, 2019 at 7:11 AM To: "Whiteside, David" <David.Whiteside@nrel.gov> Subject: [Bug 6613] control-c on srun causes pam to prevent login on the interactive job Comment # 31<https://bugs.schedmd.com/show_bug.cgi?id=6613#c31> on bug 6613<https://bugs.schedmd.com/show_bug.cgi?id=6613> from Dominik Bartkiewicz<mailto:bart@schedmd.com> Hi We have finally committed patches that should solve this issue. https://github.com/SchedMD/slurm/commit/7fa46453ccdc83 https://github.com/SchedMD/slurm/commit/4728d0e816f92d Those patches will be included in 18.08.8. Let me know if we can close this bug now. Dominik ________________________________ You are receiving this mail because: * You reported the bug. |