When using nss_slurm to pass group info, I am unable to use sudo on the node, I get: sudo: PAM account management error: Authentication service cannot retrieve authentication info This may be a case of "you can't do that" Situation: Nodes are domain-joined with realmd sssd is used for caching Our HPC-ADMINS group is part of the domain groups, so it is in sudoers as: %FREMONT\\HPC-ADMINS ALL = (ALL) NOPASSWD: ALL When I get on a node with srun, I can see all my uid/gids are correct. I can also do 'newgrp' to make hpc-admins my primary group, however any attempt to use sudo results in the error message above. from nsswitch.conf: passwd: slurm files sss shadow: slurm files sss group: slurm files sss If I remove slurm from the passwd section (or put it after sss), it will work.
Would you be able to elaborate on what situation you are calling sudo in? Is this from an ssh session to the node or is it an "srun sudo" kind of situation? In either case, would you be able to get an strace of sudo running something innocuous (eg sudo echo testing) as root on a node and attach the trace? Thanks! --Tim
This particular scenario is an interactive bash session: srun --nodes=1 --exclusive --account=root --time=24:00:00 --partition=gen-test --pty bash After getting a console, I try something as simple as: $sudo hostname sudo: PAM account management error: Authentication service cannot retrieve authentication info I ran strace sudo hostname and have attached that output. It doesn't seem to provide anything of use. I get the same output when I run it on a system without nss_slurm, although the command is successful.
Created attachment 16209 [details] output from 'strace sudo hostname'
Hi! I've been doing some digging on this and been trying to reproduce, but sudo has been working for me so far (though my setup isn't currently dealing with SSSD). I do have a couple questions though that might help me track it down. What OS are you running? From the slurm controller as root do both "getent passwd $username" and "getent shadow $username" return something? And do you have selinux configured/running? Thanks! --Tim
Tim, I am guessing you mean to use my username for $username, otherwise that is not set and it returns everyone in the /etc/passwd and /etc/shadow files (enumeration is disabled in Active Directory so you can get everyone). So for my account it shows: 07:00:41-root@slurmmaster01:~$ getent passwd andrubr andrubr:*:10043871:1644000513:Andrus, Brian:/home/andrubr:/bin/bash 07:00:45-root@slurmmaster01:~$ getent shadow andrubr 07:00:51-root@slurmmaster01:~$ (Active Directory does not do shadow) And selinux is disabled throughout the cluster. From: bugs@schedmd.com <bugs@schedmd.com> Sent: Friday, October 16, 2020 5:43 AM To: Andrus, Brian <Brian.Andrus@lamresearch.com> Subject: [Bug 9978] Cannot use sudo when using nss_slurm External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please forward it as an attachment to: it.servicedesk@lamresearch.com<mailto:it.servicedesk@lamresearch.com> with the subject: Suspicious Email and then delete it. Comment # 8<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c8&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C12eb7752899940f0f65d08d871d1111b%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637384490041366663%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=4%2FgPXiL%2F5Z%2BR2F%2F3%2F%2FrfxbVuYzXoiMxyTiWfuujQR%2F4%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C12eb7752899940f0f65d08d871d1111b%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637384490041376655%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=aurqPIQ7iHIBQ92er1OdOjTvkILzNLg61aWV1l0O%2BK4%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com> Hi! I've been doing some digging on this and been trying to reproduce, but sudo has been working for me so far (though my setup isn't currently dealing with SSSD). I do have a couple questions though that might help me track it down. What OS are you running? From the slurm controller as root do both "getent passwd $username" and "getent shadow $username" return something? And do you have selinux configured/running? Thanks! --Tim ________________________________ You are receiving this mail because: * You reported the bug.
Hi Brian, Would you be able to check in /var/log/auth.log after a failed sudo attempt for an error like "error retrieving Slurm step info"? Thanks! -Tim
I have no /var/log/auth.log, but I do see in /var/log/secure and the journal: Oct 20 14:11:26 gen-test-01 sudo: andrubr : PAM account management error: Authentication service cannot retrieve authentication info ; TTY=pts/0 ; PWD=/opt/hpc-admin/scripts ; USER=root ; COMMAND=/bin/hostname My entire slurmd.log for the session is: [2020-10-20T14:09:55.119] Message aggregation disabled [2020-10-20T14:09:55.133] topology NONE plugin loaded [2020-10-20T14:09:55.141] route default plugin loaded [2020-10-20T14:09:55.143] CPU frequency setting not configured for this node [2020-10-20T14:09:55.184] Munge credential signature plugin loaded [2020-10-20T14:09:55.186] slurmd version 20.02.5 started [2020-10-20T14:09:55.281] slurmd started on Tue, 20 Oct 2020 14:09:55 +0000 [2020-10-20T14:09:55.282] CPUs=2 Boards=1 Sockets=1 Cores=2 Threads=1 Memory=3950 TmpDisk=29703 Uptime=61 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) [2020-10-20T14:10:50.577] _run_prolog: run job script took usec=350 [2020-10-20T14:10:50.577] _run_prolog: prolog with lock for job 587138 ran for 0 seconds [2020-10-20T14:10:50.653] [587138.extern] Munge credential signature plugin loaded [2020-10-20T14:10:52.869] launch task 587138.0 request from UID:10043871 GID:1644000513 HOST:10.49.32.26 PORT:20660 [2020-10-20T14:10:52.887] [587138.0] Munge credential signature plugin loaded [2020-10-20T14:10:52.898] [587138.0] in _window_manager [2020-10-20T14:10:52.900] [587138.0] debug level = 2 [2020-10-20T14:10:52.901] [587138.0] starting 1 tasks [2020-10-20T14:10:52.902] [587138.0] task 0 (2392) started 2020-10-20T14:10:52 [2020-10-20T14:15:44.789] [587138.0] task 0 (2392) exited with exit code 1. [2020-10-20T14:15:44.858] [587138.0] done with job [2020-10-20T14:15:44.861] [587138.extern] Sent signal 18 to 587138.4294967295 [2020-10-20T14:15:44.861] [587138.extern] Sent signal 15 to 587138.4294967295 [2020-10-20T14:15:44.866] [587138.extern] done with job [https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg] Brian Andrus - HPC Systems brian.andrus@lamresearch.com From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, October 20, 2020 5:20 AM To: Andrus, Brian <Brian.Andrus@lamresearch.com> Subject: [Bug 9978] Cannot use sudo when using nss_slurm External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please forward it as an attachment to: it.servicedesk@lamresearch.com<mailto:it.servicedesk@lamresearch.com> with the subject: Suspicious Email and then delete it. Comment # 11<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c11&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C6221d1e1da9643bd91fc08d874f2889d%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637387932322911443%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2BZmgLCosfeKYEXSbcZFnSgMCw6%2BEcd4oAEzAQJUmwvg%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C6221d1e1da9643bd91fc08d874f2889d%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637387932322921439%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=if26SDdPnhNDD0XZuv8ubHF3MaVUSRwkFUiLetSnRls%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com> Hi Brian, Would you be able to check in /var/log/auth.log after a failed sudo attempt for an error like "error retrieving Slurm step info"? Thanks! -Tim ________________________________ You are receiving this mail because: * You reported the bug.
Hi Brian, Unfortunately these logs and further attempts at reproducing haven't really tracked down a good reason for this yet. Just to confirm that nss_slurm and sss are both doing what we expect them to, would you be able to run the following: srun bash -c 'for i in slurm files sss; do echo "$i passwd: $(getent -s $i passwd $USER)"; echo "$i group: $(getent -s $i group $USER)"; done' The goal is to make sure that nss_slurm, files, and sss are all returning sane values under normal conditions. Would you also be able to tell me what OS you are running, and if you are experiencing this in srun, sbatch, or both? Thanks, --Tim
I added '-n 1' to the srun command since it was running on all the cores otherwise :) andrubr@gen-test-01 ~]$ srun -n1 bash -c 'for i in slurm files sss; do echo "$i passwd: $(getent -s $i passwd $USER)"; echo "$i group: $(getent -s $i group $USER)"; done' slurm passwd: andrubr:x:10043871:1644000513:Andrus, Brian:/home/andrubr:/bin/bash slurm group: files passwd: files group: sss passwd: andrubr:*:10043871:1644000513:Andrus, Brian:/home/andrubr:/bin/bash sss group: This as expected. My default group is not my name ("Domain Users") and my account is in Active Directory, so nothing in files. When I change the group to be Domain\ Users, I get: [andrubr@gen-test-01 ~]$ srun -n1 bash -c 'for i in slurm files sss; do echo "$i passwd: $(getent -s $i passwd $USER)"; echo "$i group: $(getent -s $i group Domain\ Users)"; done' slurm passwd: andrubr:x:10043871:1644000513:Andrus, Brian:/home/andrubr:/bin/bash slurm group: files passwd: files group: sss passwd: andrubr:*:10043871:1644000513:Andrus, Brian:/home/andrubr:/bin/bash sss group: domain users:*:1644000513:andrubr,bd4adm,cd4adm,gd4adm,sapserviceatp,sapserviceatq,huangxi1,sybsjp,sybbv4,sybsjq,c4padm,at2adm,atqadm,ad2adm,atdadm,atpadm,atxadm,svc_ibpadmin,migtest01,svc_ibpagent,svc_covbuilds,hylinca,bq5adm,monahde,sc4adm,chakaiv,vmartirosyan,sybsep,sybseq,sybsed,sepadm,seqadm,sedadm,qa5adm,asoussou,rhathwar,liuje2,sybids,idsadm,tdesrues,scalderon,jmcintyre,qdtadm,lcarver,jhuang,jlapidas,welham,gaosu,sxliu,lindas,ken,dfried,bvandyk,bvincent,yhan,yyan,vliberman,vhuber,vallampalli,tomk,srouvillois,rmiller,rramachandran,orenaud,makbulut,mlee,mlevin,matt,jlehto,jasselot,ifavorskiy,dfox,dsieger,dfaken,aparent,apap,akunwar,asinding,alevin,cq2adm,gq2adm,qa2adm,pisadm,walshpe,bwpadm,sybbjp,cexadm,cd2adm,clearcase_aldb,bq2adm,sapadm,edide2,qa3adm,sybbjd,bd2adm,bdvadm,gd2adm,axqadm,axpadm,axdadm,bjdadm,crdadm,cq4adm,bq4adm,sybbdv,wddadm,prdadm,slqadm,sybbqv,smdadm,cedadm,bwqadm,ediqa2,ceqadm,gq4adm,axsadm,sapservicewdd,codadm,bjpadm,de2adm,qa4adm,crpadm,bjqadm,crqadm,cepadm,slxadm,bqvadm,ediprd,lq4adm,bwdadm,gtdadm,gtqadm,ld2adm,bv4adm,gtpadm,sapserviceslx,sapserviceslq,qasadm,devadm,coqadm So, it may be group membership for the slurm_nss that is an issue. when I do an id on me, I have: [andrubr@gen-test-01 ~]$ id andrubr uid=10043871(andrubr) gid=1644000513(domain users) groups=1644000513(domain users),1644179569(distlist-all lam - fremont (contractors temps only)),10097910(nobody),1644183490(vpn_it_contractors_and_consultants),1644068025(nobody),1644067776(gis all (na and international)),1644065583(nobody),10054405(ad-confluence-external),1644065584(nobody),10059363(semulator3d-mupm-users),10054720(ad-bitbucket-external),1644043879(all lam fremont users),1644063358(gis all north america),10041742(semu_internal_users),1644195010(endar major events),10054404(ad-jira-external),10063511(nobody),10041696(confluence-users),1644068121(nobody),10041493(nobody),10079848(nobody),10084119(lamrc_csc_read),10049453(hpc-admins),10041693(jira-users),10030561(semu_users),10037519(hpc-mapd),1644068028(nobody),1644075644(nobody),1644179870(recordedmeeting_change),1644077329(mb - gis calendar),1644043888(nobody),1644043833(nobody),10008343(nobody),1644075673(messagestats web),1644015988(cond_rd read),1644075642(citrix cellfusion access),1644079866(citrix sapgui access),1644149504(fs_30010r_mfg_photoarchive),1644165789(fs_51010r),1644079871(citrix word access),1644071400(#intranet - global read),1644051065(fs_public),1644164176(fs_30010r_tool_data_logging),1644071402(#intranet - phonebook),1644079868(citrix ncr access),1644079869(citrix ie8 apps access),10004486(gis_test_01_read),1644021511(10061c),1644079867(citrix be access),1644079872(citrix excel access) Hope that helps
Thanks Brian! The group entries are case sensitive, so its possible that "getent -s $i group Domain\ Users" just isn't matching since from SSSD it seems to be lowercase "domain users". can you try just: srun -n1 getent -s slurm group and see what that returns? It *should* be a list of all the groups you are a member of. I'm working on a debug patch that might help track this down better, particularly if getent -s slurm group doesn't return reasonable results. Something else to note: you don't need to change shadow in nsswitch.conf, we don't actually implement anything to ship shadow around. I've seen some weird bugs with sudo and shadow being odd, so it might be worth changing that back to just "files sss". Thanks, and sorry about all the back and forth! --Tim
Out put as requested: [andrubr@gen-test-01 ~]$ srun -n1 getent -s slurm group domain users:x:1644000513:andrubr distlist-all lam - fremont (contractors temps only):x:1644179569:andrubr vpn_it_contractors_and_consultants:x:1644183490:andrubr nobody:x:1644068025:andrubr gis all (na and international):x:1644067776:andrubr nobody:x:1644065583:andrubr ad-confluence-external:x:10054405:andrubr nobody:x:1644065584:andrubr semulator3d-mupm-users:x:10059363:andrubr ad-bitbucket-external:x:10054720:andrubr all lam fremont users:x:1644043879:andrubr gis all north america:x:1644063358:andrubr semu_internal_users:x:10041742:andrubr endar major events:x:1644195010:andrubr ad-jira-external:x:10054404:andrubr nobody:x:10063511:andrubr confluence-users:x:10041696:andrubr nobody:x:1644068121:andrubr nobody:x:10041493:andrubr nobody:x:10079848:andrubr lamrc_csc_read:x:10084119:andrubr hpc-admins:x:10049453:andrubr jira-users:x:10041693:andrubr semu_users:x:10030561:andrubr hpc-mapd:x:10037519:andrubr nobody:x:1644068028:andrubr nobody:x:1644075644:andrubr recordedmeeting_change:x:1644179870:andrubr mb - gis calendar:x:1644077329:andrubr nobody:x:1644043888:andrubr nobody:x:1644043833:andrubr nobody:x:10008343:andrubr messagestats web:x:1644075673:andrubr cond_rd read:x:1644015988:andrubr citrix cellfusion access:x:1644075642:andrubr citrix sapgui access:x:1644079866:andrubr fs_30010r_mfg_photoarchive:x:1644149504:andrubr fs_51010r:x:1644165789:andrubr citrix word access:x:1644079871:andrubr #intranet - global read:x:1644071400:andrubr fs_public:x:1644051065:andrubr fs_30010r_tool_data_logging:x:1644164176:andrubr #intranet - phonebook:x:1644071402:andrubr citrix ncr access:x:1644079868:andrubr citrix ie8 apps access:x:1644079869:andrubr gis_test_01_read:x:10004486:andrubr 10061c:x:1644021511:andrubr citrix be access:x:1644079867:andrubr citrix excel access:x:1644079872:andrubr Looks good, I think. Only thing to note is that everything is lowercase. I removed slurm from the shadow line in nsswitch.conf as well. Of note: I also tried changing sudoers to just have my name with the same errors. However, if I ssh to the node as root and then sudo to myself and run ‘sudo hostname’ it does work. I imagine that is because that session is not attached to the job. [https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg] Brian Andrus - HPC Systems brian.andrus@lamresearch.com From: bugs@schedmd.com <bugs@schedmd.com> Sent: Friday, October 30, 2020 8:37 AM To: Andrus, Brian <Brian.Andrus@lamresearch.com> Subject: [Bug 9978] Cannot use sudo when using nss_slurm External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please forward it as an attachment to: it.servicedesk@lamresearch.com<mailto:it.servicedesk@lamresearch.com> with the subject: Suspicious Email and then delete it. Comment # 15<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c15&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7Cbc2b325dee3642c46cde08d87ce9b449%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637396690504953440%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=wGKrDcVflec1AEdgvgg%2BUbiBPcMCdzAnZfOgNzHWaUY%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7Cbc2b325dee3642c46cde08d87ce9b449%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637396690504963432%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=9hAW1oTbuxuEC5bVWlp%2FsfsFuY5gNU7W7Nil%2Blaf3GI%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com> Thanks Brian! The group entries are case sensitive, so its possible that "getent -s $i group Domain\ Users" just isn't matching since from SSSD it seems to be lowercase "domain users". can you try just: srun -n1 getent -s slurm group and see what that returns? It *should* be a list of all the groups you are a member of. I'm working on a debug patch that might help track this down better, particularly if getent -s slurm group doesn't return reasonable results. Something else to note: you don't need to change shadow in nsswitch.conf, we don't actually implement anything to ship shadow around. I've seen some weird bugs with sudo and shadow being odd, so it might be worth changing that back to just "files sss". Thanks, and sorry about all the back and forth! --Tim ________________________________ You are receiving this mail because: * You reported the bug.
Created attachment 16472 [details] debug patch Hi Brian, That all does look good, and I am very much wondering if case is an issue here... though I found your note at the end particularly interesting. I've attached a debug patch that will be a little more verbose on the command line about what user nss_slurm found and how the internal functions got called. This is just a patch to nss_slurm. To get the output we need to see what it is doing, I needed to wrap the sudo command a bit with "srun -n1 bash -c 'sudo hostname'" Let me know if you get the chance to try this! --Tim
Ok, I got it built/installed and ran both with “bash -c ‘sudo hostname’” and in a “--pty bash” interactive session. Results for both below: 18:18:50-andrubr@slurmmaster01:~$ srun --nodes=1 --exclusive --account=root --time=24:00:00 --partition=gen-test bash -c 'sudo hostname' _pw_internal: uid:10043871 name:(null) _pw_internal: user (null)(10043871) not found in Job:616232 Step:-1 _pw_internal: user andrubr(10043871) found in Job:616232 Step:0 _pw_internal: uid:-2 name:andrubr _pw_internal: user andrubr(-2) not found in Job:616232 Step:-1 _pw_internal: user andrubr(10043871) found in Job:616232 Step:0 _gr_internal: gid:-2 name:(null) _gr_internal: no groups found in Job:616232 Step:-1 _gr_internal: groups found in Job:616232 Step:0 _gr_internal: gid:-2 name:FREMONT\HPC-ADMINS _gr_internal: no groups found in Job:616232 Step:-1 _gr_internal: no groups found in Job:616232 Step:0 _gr_internal: could not find groups in any step _pw_internal: uid:-2 name:andrubr _pw_internal: user andrubr(-2) not found in Job:616232 Step:-1 _pw_internal: user andrubr(10043871) found in Job:616232 Step:0 _pw_internal: uid:-2 name:andrubr _pw_internal: user andrubr(-2) not found in Job:616232 Step:-1 _pw_internal: user andrubr(10043871) found in Job:616232 Step:0 sudo: PAM account management error: Authentication service cannot retrieve authentication info srun: error: gen-test-01: task 0: Exited with exit code 1 18:19:18-andrubr@slurmmaster01:~$ srun --nodes=1 --exclusive --account=root --time=24:00:00 --partition=gen-test --pty bash _pw_internal: uid:10043871 name:(null) _pw_internal: user andrubr(10043871) found in Job:616233 Step:0 _gr_internal: gid:1644000513 name:(null) _gr_internal: groups found in Job:616233 Step:0 _pw_internal: uid:10043871 name:(null) _pw_internal: user andrubr(10043871) found in Job:616233 Step:0 [andrubr@gen-test-01 ~]$ sudo hostname _pw_internal: uid:10043871 name:(null) _pw_internal: user andrubr(10043871) found in Job:616233 Step:0 _pw_internal: uid:-2 name:andrubr _pw_internal: user andrubr(10043871) found in Job:616233 Step:0 _gr_internal: gid:-2 name:(null) _gr_internal: groups found in Job:616233 Step:0 _gr_internal: gid:-2 name:FREMONT\HPC-ADMINS _gr_internal: no groups found in Job:616233 Step:0 _gr_internal: no groups found in Job:616233 Step:-1 _gr_internal: could not find groups in any step _pw_internal: uid:-2 name:andrubr _pw_internal: user andrubr(10043871) found in Job:616233 Step:0 _pw_internal: uid:-2 name:andrubr _pw_internal: user andrubr(10043871) found in Job:616233 Step:0 sudo: PAM account management error: Authentication service cannot retrieve authentication info [andrubr@gen-test-01 ~]$ exit srun: error: gen-test-01: task 0: Exited with exit code 1 [https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg] Brian Andrus - HPC Systems brian.andrus@lamresearch.com From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, November 3, 2020 6:32 AM To: Andrus, Brian <Brian.Andrus@lamresearch.com> Subject: [Bug 9978] Cannot use sudo when using nss_slurm External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please forward it as an attachment to: it.servicedesk@lamresearch.com<mailto:it.servicedesk@lamresearch.com> with the subject: Suspicious Email and then delete it. Comment # 17<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c17&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C1e8e9dd464d34deb975708d880053a7c%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637400107229415369%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=LZURZiDLTHenKm0GaaFNjPl4gdGXarB8yYiLdpr%2B9jc%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C1e8e9dd464d34deb975708d880053a7c%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637400107229415369%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=xRs1QfT5Chv7B%2BM6PQu1Kj5HAMzrhcZqs2wFnrzgf1k%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com> Created attachment 16472 [details]<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fattachment.cgi%3Fid%3D16472%26action%3Ddiff&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C1e8e9dd464d34deb975708d880053a7c%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637400107229425365%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=m4NOaITxLjOdhw4QaDCDG9B7N2QSUcdftWxwIBl23zY%3D&reserved=0> [details]<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fattachment.cgi%3Fid%3D16472%26action%3Dedit&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C1e8e9dd464d34deb975708d880053a7c%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637400107229425365%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=UUDbVt9%2B0TRpNbB49fhqnFmAC7RvPGGV8ug0DIRVIhw%3D&reserved=0> debug patch Hi Brian, That all does look good, and I am very much wondering if case is an issue here... though I found your note at the end particularly interesting. I've attached a debug patch that will be a little more verbose on the command line about what user nss_slurm found and how the internal functions got called. This is just a patch to nss_slurm. To get the output we need to see what it is doing, I needed to wrap the sudo command a bit with "srun -n1 bash -c 'sudo hostname'" Let me know if you get the chance to try this! --Tim ________________________________ You are receiving this mail because: * You reported the bug.
Thank you for the output! I'm looking it over along with the nss_slurm and sudo source to see if I can determine what is going on here, but it looks like nss_slurm is finding everything it should be. Looking up FREMONT\HPC-ADMINS should fail in nss_slurm, but it should fallback to sss to find it. Right now, it seems a lot like all the information is there and retrievable, so its not very clear what is causing PAM to error out like that. The only other thing I think I might need is /etc/pam.d/sudo and any file it may include to rule out something in PAM going on. Thanks for your patience with this, it doesn't seem clear what is preventing this from working. --Tim
I don’t see much there, but here they are: [root@gen-f16-39 resource]# cat /etc/pam.d/sudo #%PAM-1.0 auth include system-auth account include system-auth password include system-auth session optional pam_keyinit.so revoke session include system-auth [root@gen-f16-39 resource]# cat /etc/pam.d/system-auth #%PAM-1.0 # This file is auto-generated. # User changes will be destroyed the next time authconfig is run. auth required pam_env.so auth required pam_faildelay.so delay=2000000 auth sufficient pam_fprintd.so auth [default=1 ignore=ignore success=ok] pam_succeed_if.so uid >= 1000 quiet auth [default=1 ignore=ignore success=ok] pam_localuser.so auth sufficient pam_unix.so nullok try_first_pass auth requisite pam_succeed_if.so uid >= 1000 quiet_success auth sufficient pam_sss.so forward_pass auth required pam_deny.so account required pam_unix.so account sufficient pam_localuser.so account sufficient pam_succeed_if.so uid < 1000 quiet account [default=bad success=ok user_unknown=ignore] pam_sss.so account required pam_permit.so password requisite pam_pwquality.so try_first_pass local_users_only retry=3 authtok_type= ucredit=-1 lcredit=-1 dcredit=-1 ocredit=-1 password sufficient pam_unix.so sha512 shadow nullok try_first_pass use_authtok password sufficient pam_sss.so use_authtok password required pam_deny.so session optional pam_keyinit.so revoke session required pam_limits.so #-session optional pam_systemd.so session optional pam_oddjob_mkhomedir.so umask=0077 session [success=1 default=ignore] pam_succeed_if.so service in crond quiet use_uid session required pam_unix.so session optional pam_sss.so From: bugs@schedmd.com <bugs@schedmd.com> Sent: Thursday, November 12, 2020 11:21 AM To: Andrus, Brian <Brian.Andrus@lamresearch.com> Subject: [Bug 9978] Cannot use sudo when using nss_slurm External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please forward it as an attachment to: it.servicedesk@lamresearch.com<mailto:it.servicedesk@lamresearch.com> with the subject: Suspicious Email and then delete it. Comment # 19<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c19&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C185e7c66fb9649d8692208d887401e38%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637408056760369087%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=5gcC%2BRfLQ4xDqed0cbs531w4pC%2FR62PrdoNpCetCF5I%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C185e7c66fb9649d8692208d887401e38%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637408056760369087%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=9iNG8%2B2Mk7I55nAhhVsVKP3uYN6r7v7e3UwFSROHRNw%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com> Thank you for the output! I'm looking it over along with the nss_slurm and sudo source to see if I can determine what is going on here, but it looks like nss_slurm is finding everything it should be. Looking up FREMONT\HPC-ADMINS should fail in nss_slurm, but it should fallback to sss to find it. Right now, it seems a lot like all the information is there and retrievable, so its not very clear what is causing PAM to error out like that. The only other thing I think I might need is /etc/pam.d/sudo and any file it may include to rule out something in PAM going on. Thanks for your patience with this, it doesn't seem clear what is preventing this from working. --Tim ________________________________ You are receiving this mail because: * You reported the bug.
So I just upgraded to slurm 20.11.0 Now the nss_slurm does not work at all: $ srun --nodes=1 --exclusive --account=root --time=24:00:00 --partition=gen-test bash -c 'sudo hostname' sudo: PAM account management error: Authentication service cannot retrieve authentication info srun: error: gen-test-01: task 0: Exited with exit code 1 If I remove 'slurm' from the passwd line in nsswitch.conf, it works.
(In reply to Brian Andrus from comment #21) > So I just upgraded to slurm 20.11.0 > Now the nss_slurm does not work at all: > > $ srun --nodes=1 --exclusive --account=root --time=24:00:00 > --partition=gen-test bash -c 'sudo hostname' > sudo: PAM account management error: Authentication service cannot retrieve > authentication info > srun: error: gen-test-01: task 0: Exited with exit code 1 > > > If I remove 'slurm' from the passwd line in nsswitch.conf, it works. I just tested if nss_slurm works in general with 20.11 and it seems to. I noticed in your command that you ran "sudo hostname" which I think was the initial problem. Does something like "getent -s slurm passwd" return correctly from srun or sbatch?
Yep, that was me just cutting and pasting. So the same errors still occur under the same circumstances. [https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg] Brian Andrus - HPC Systems brian.andrus@lamresearch.com From: bugs@schedmd.com <bugs@schedmd.com> Sent: Thursday, November 19, 2020 6:19 AM To: Andrus, Brian <Brian.Andrus@lamresearch.com> Subject: [Bug 9978] Cannot use sudo when using nss_slurm External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please forward it as an attachment to: it.servicedesk@lamresearch.com<mailto:it.servicedesk@lamresearch.com> with the subject: Suspicious Email and then delete it. Comment # 22<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c22&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7Cfdcb185f06b945a4c3e708d88c9614f9%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637413923507439454%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=lDKYA45bw1x%2BKxFRA4geEV%2FO7mnCp87qzORl9SzAiIY%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7Cfdcb185f06b945a4c3e708d88c9614f9%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637413923507449413%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=qL004nbxntbkblaIgVuXqLDouVm9MKFW2AcnMM4wito%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com> (In reply to Brian Andrus from comment #21<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c21&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7Cfdcb185f06b945a4c3e708d88c9614f9%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637413923507449413%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=kCye0byRGH8%2Bl6drpsa8PxikHkka6U8aAAQ%2B12C98i8%3D&reserved=0>) > So I just upgraded to slurm 20.11.0 > Now the nss_slurm does not work at all: > > $ srun --nodes=1 --exclusive --account=root --time=24:00:00 > --partition=gen-test bash -c 'sudo hostname' > sudo: PAM account management error: Authentication service cannot retrieve > authentication info > srun: error: gen-test-01: task 0: Exited with exit code 1 > > > If I remove 'slurm' from the passwd line in nsswitch.conf, it works. I just tested if nss_slurm works in general with 20.11 and it seems to. I noticed in your command that you ran "sudo hostname" which I think was the initial problem. Does something like "getent -s slurm passwd" return correctly from srun or sbatch? ________________________________ You are receiving this mail because: * You reported the bug.
Thanks Brian, I just wanted to make sure we were in basically the same place!
I was looking through all this again, and I was wondering if you could try just using or adding "%hpc-admins ALL = (ALL) NOPASSWD: ALL" in the sudoers file. (+ instead of % may be required, depending on the exact versions of sudo/sssd at play). It looks like the group is being properly displayed as one you are a member of, and it looks like nss_slurm knows that, or is at least able to get that information. In recent enough version of SSSD, it should actually do a bunch of the import work for you and it might be the only line required. Let me know what you think! --Tim
Hmnm. I cannot get a pty anymore with the updated slurm 20.11.0 update: 11:20:53-andrubr@slurmmaster01:/$ srun --nodes=1 --exclusive --account=root --time=24:00:00 --partition=gen-test --pty /bin/bash srun: error: gen-test-01: task 0: Segmentation fault But I do still get the authentication error if I try to sudo directly: 11:20:31-andrubr@slurmmaster01:/$ srun --nodes=1 --exclusive --account=root --time=24:00:00 --partition=gen-test sudo hostname sudo: PAM account management error: Authentication service cannot retrieve authentication info srun: error: gen-test-01: task 0: Exited with exit code 1 If I remove ‘slurm’ from nss_switch, I can get the pty: 11:23:32-andrubr@slurmmaster01:/$ srun --nodes=1 --exclusive --account=root --time=24:00:00 --partition=gen-test --pty /bin/bash [andrubr@gen-test-01 /]$ FWIW, I am able to use sudo still with that change to the sudoers, I just can’t be under the nss_slurm environment [https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg] Brian Andrus - HPC Systems brian.andrus@lamresearch.com From: bugs@schedmd.com <bugs@schedmd.com> Sent: Friday, November 20, 2020 10:53 AM To: Andrus, Brian <Brian.Andrus@lamresearch.com> Subject: [Bug 9978] Cannot use sudo when using nss_slurm External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please forward it as an attachment to: it.servicedesk@lamresearch.com<mailto:it.servicedesk@lamresearch.com> with the subject: Suspicious Email and then delete it. Comment # 25<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c25&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C6426ac20cd2a41a76b1e08d88d857f93%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637414951806385459%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=XClKZDguRGMWAbJo9zTI%2FT6jX8Th30zmxRsoIahkfTM%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C6426ac20cd2a41a76b1e08d88d857f93%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637414951806385459%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=aLfA%2BbaxcVo315n8yZUbS9fQ20e%2FpHFNzuDMtHdvx5w%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com> I was looking through all this again, and I was wondering if you could try just using or adding "%hpc-admins ALL = (ALL) NOPASSWD: ALL" in the sudoers file. (+ instead of % may be required, depending on the exact versions of sudo/sssd at play). It looks like the group is being properly displayed as one you are a member of, and it looks like nss_slurm knows that, or is at least able to get that information. In recent enough version of SSSD, it should actually do a bunch of the import work for you and it might be the only line required. Let me know what you think! --Tim ________________________________ You are receiving this mail because: * You reported the bug.
(In reply to Brian Andrus from comment #26) > Hmnm. I cannot get a pty anymore with the updated slurm 20.11.0 update: > 11:20:53-andrubr@slurmmaster01:/$ srun --nodes=1 --exclusive --account=root > --time=24:00:00 --partition=gen-test --pty /bin/bash > srun: error: gen-test-01: task 0: Segmentation fault I certainly wouldn't expect that to segfault! Would you be able to open a new bug with the slurmd logs (at debug4) and a backtrace of the slurmstepd (if possible)? This is largely just to not muddy this ticket with what appears to be a new/different problem. > But I do still get the authentication error if I try to sudo directly: > 11:20:31-andrubr@slurmmaster01:/$ srun --nodes=1 --exclusive --account=root > --time=24:00:00 --partition=gen-test sudo hostname > sudo: PAM account management error: Authentication service cannot retrieve > authentication info > srun: error: gen-test-01: task 0: Exited with exit code 1 Can we do a quick sanity check on this version that "getent -s slurm passwd" and "getent -s slurm group" still return the same as they did before? If they differ please attach the new one, but if they are effectively the same just let me know. I'm mostly interested in making nss_slurm still ships the same "hpc-admins" line in groups. > FWIW, I am able to use sudo still with that change to the sudoers, I just > can’t be under the nss_slurm environment Thank you, that's useful information on this too!
Created attachment 16774 [details] bug9978 debug (20.11) I've attached a new version of the debug patch that should work with 20.11 Would you be able to try this, with the debug patch and the sudoers change I suggested before? I'd like to see if being able to look that group up with nss_slurm changes the result in any significant way. Particularly, while looking things over I noticed that we didn't log any attempts to lookup the root user, and I'd like to see if those attempts will appear with a (likely) successful group lookup. Thanks! --Tim
Ok, I built and tried the patch. If I add 'slurm' to nsswitch.conf, I get segfaults: 12:02:20-andrubr@slurmmaster01:~$ srun --nodes=1 --exclusive --account=root --time=24:00:00 --partition=gen-test --pty /bin/bash srun: error: gen-test-01: task 0: Segmentation fault _pw_internal: uid:10043871 name:(null) It errors even if I try to ssh to the box as root: 12:02:42-root@slurmmaster01:~$ ssh gen-test-01 Last login: Wed Dec 2 19:59:17 2020 from 10.49.32.26 _gr_internal: gid:-2 name:(null) _gr_internal: could not find groups in any step [root@gen-test-01 ~]#
Hey Brian, We've been working on the segfault issue and think we have the problem tracked down. We are working an an acceptable solution for it! Just wanted to update you on progress! Thanks, --Tim
A fix for the segfault issue should be included in 20.11.1. Please let me know if this resolves the issue for you and we can keep looking at why sudo doesn't work :) Thanks! --Tim
Looks happier as far as the segfault. Back to the same old issue: 13:42:06-andrubr@slurmmaster01:~$ srun --nodes=1 --exclusive --account=root --time=24:00:00 --partition=gen-test bash -c 'sudo hostname' sudo: PAM account management error: Authentication service cannot retrieve authentication info srun: error: gen-test-01: task 0: Exited with exit code 1 13:42:09-andrubr@slurmmaster01:~$ srun --nodes=1 --exclusive --account=root --time=24:00:00 --partition=gen-test --pty /bin/bash [andrubr@gen-test-01 ~]$ sudo hostname sudo: PAM account management error: Authentication service cannot retrieve authentication info [andrubr@gen-test-01 ~]$ rpm -q slurm slurm-20.11.1-1.el7.x86_64 [andrubr@gen-test-01 ~]$ exit Brian From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, December 14, 2020 4:48 AM To: Andrus, Brian <Brian.Andrus@lamresearch.com> Subject: [Bug 9978] Cannot use sudo when using nss_slurm External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please forward it as an attachment to: it.servicedesk@lamresearch.com<mailto:it.servicedesk@lamresearch.com> with the subject: Suspicious Email and then delete it. Comment # 31<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c31&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C83deb62045384949385008d8a02e7122%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637435468619987211%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=SGr2q%2Ffwls3uSHqd4V9CQjCqk%2BPBSyKynbzfKT6vkGk%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C83deb62045384949385008d8a02e7122%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637435468619987211%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Q9lXa9Hvbgksv05z5%2B0mvl%2F6LIUbXbuAe31rbPHbxOI%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com> A fix for the segfault issue should be included in 20.11.1. Please let me know if this resolves the issue for you and we can keep looking at why sudo doesn't work :) Thanks! --Tim ________________________________ You are receiving this mail because: * You reported the bug.
Hey Brian, I just wanted to let you know I'm still looking into this. I've been looking for different possibilities and I'm really wondering if there is something funny going on in sssd/nss land. Would you be able to share you sssd config as well? Thanks, --Tim
So I think I am on to something that may affect this as well. It seems you cannot use pam_slurm_adopt AND nss_slurm pam_slurm_adopt is unable to identify any user that has a job running under credentials from nss_slurm. I suspect that is the same thing that is happening with sudo.
(In reply to Brian Andrus from comment #34) > So I think I am on to something that may affect this as well. > > It seems you cannot use pam_slurm_adopt AND nss_slurm > > pam_slurm_adopt is unable to identify any user that has a job running under > credentials from nss_slurm. > > I suspect that is the same thing that is happening with sudo. That's an interesting find, I didn't realize that was part of the equation! I'll see what adding that into the mix does. Thanks! --Tim
> pam_slurm_adopt is unable to identify any user that has a job running under credentials from nss_slurm. When you say this, do you mean that nss_slurm is the only source of that users details? or is sssd on that system as well?
Created attachment 18083 [details] 034A7F654D8D43CAAF99D26BA2F3C24B.png In testing, slurm and files were the only source of details configured in nsswitch.conf. If I add sss back in, things can work, but that defeats the need for nss_slurm. [https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg]Brian Andrus - HPC Systems brian.andrus@lamresearch.com<mailto:brian.andrus@lamresearch.com> From: bugs@schedmd.com<mailto:bugs@schedmd.com> Sent: Wednesday, February 24, 2021 6:33 AM To: Andrus, Brian<mailto:Brian.Andrus@lamresearch.com> Subject: [Bug 9978] Cannot use sudo when using nss_slurm External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe.If you believe this email may be unsafe, please click on the “Report Phishing” button on the top right of Outlook. Comment # 36<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c36&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7Ca6f65b9348734da2166108d8d8d130fa%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637497740306285058%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=GRX861aVVekhfmlwQT9uAnlBZRL4GzHVMONCurC1Gpg%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7Ca6f65b9348734da2166108d8d8d130fa%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637497740306295052%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ZAoIXGhOkZOp6eoChKjTT1eF0PIB2rsIsJyMs6dyls0%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com> > pam_slurm_adopt is unable to identify any user that has a job running under credentials from nss_slurm. When you say this, do you mean that nss_slurm is the only source of that users details? or is sssd on that system as well? You are receiving this mail because: * You reported the bug.
nss_slurm isn't meant to be a full replacement of the central authentication. It will only respond with user/group information if the process is considered part of the "job". In the case of pam_slurm_adopt, the session would already need to be adopted. pam_slurm_adopt will have to check once against the central authentication, then find the job it can be adopted into after which any requests will be served by nss_slurm. The only way this would impact sudo is if the sudo process was somehow escaping the job cgroup so it wouldn't be able to get the group information. We can confirm that the sudo process is in the cgroup (I think we accomplished this with the debug patch for nss_slurm, but we can check this way too): On a test node: Make sudo require a password via srun --pty on the node, run sudo hostname In another shell, log in to the node as root. Get the pid of the sudo process confirm that pid is in the cgroup for the job eg: grep 12354 /sys/fs/cgroup/memory/slurm/uid_1000/job_185/step_0/task_0/tasks 12354 If the pid is there, sudo isn't escaping so it should get the group info correctly. I've been doing my recent tests on a system with pam_slurm_adopt, nss_slurm, and auth done with ldapd (sssd is not cooperating yet, so that is still in progress).
Ok, That path doesn't have a 'slurm' directory, but there is one in /sys/fs/cgroup/freezer/slurm/ Inside there, I have /sys/fs/cgroup/freezer/slurm/uid_10043871/job_782724/step_0/tasks Which does have the pid of sudo hostname There is also /sys/fs/cgroup/freezer/slurm/uid_10043871/job_782724/tasks which does not have that pid [https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg] Brian Andrus - HPC Systems brian.andrus@lamresearch.com From: bugs@schedmd.com <bugs@schedmd.com> Sent: Wednesday, February 24, 2021 7:47 AM To: Andrus, Brian <Brian.Andrus@lamresearch.com> Subject: [Bug 9978] Cannot use sudo when using nss_slurm External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe.If you believe this email may be unsafe, please click on the "Report Phishing" button on the top right of Outlook. Comment # 38<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c38&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C5adc85dedd90461cd73008d8d8db707d%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637497784311623562%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=C2qMD8mXFtJpbsaq7eumt0AsRwj6SoFktbp0%2F%2B3zuhE%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C5adc85dedd90461cd73008d8d8db707d%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637497784311633548%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jDRGZuNtGP%2BV0rqkY9KgX%2B5BG6uhwUu9X8IKP4OAiq0%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com> nss_slurm isn't meant to be a full replacement of the central authentication. It will only respond with user/group information if the process is considered part of the "job". In the case of pam_slurm_adopt, the session would already need to be adopted. pam_slurm_adopt will have to check once against the central authentication, then find the job it can be adopted into after which any requests will be served by nss_slurm. The only way this would impact sudo is if the sudo process was somehow escaping the job cgroup so it wouldn't be able to get the group information. We can confirm that the sudo process is in the cgroup (I think we accomplished this with the debug patch for nss_slurm, but we can check this way too): On a test node: Make sudo require a password via srun --pty on the node, run sudo hostname In another shell, log in to the node as root. Get the pid of the sudo process confirm that pid is in the cgroup for the job eg: grep 12354 /sys/fs/cgroup/memory/slurm/uid_1000/job_185/step_0/task_0/tasks 12354 If the pid is there, sudo isn't escaping so it should get the group info correctly. I've been doing my recent tests on a system with pam_slurm_adopt, nss_slurm, and auth done with ldapd (sssd is not cooperating yet, so that is still in progress). ________________________________ You are receiving this mail because: * You reported the bug.
A new datapoint: So my nsswitch.conf has: passwd: slurm files sss group: slurm files sss I started an interactive job on the node and then tried to ssh to the node. I get an error: packet_write_wait: Connection to 10.49.38.112 port 22: Broken pipe If I remove slurm from the nsswitch.conf, OR remove pam_slurm_adopt it works as expected. (so any one or none work, but not both) This is using slurm 20.11.4 [https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg] Brian Andrus - HPC Systems brian.andrus@lamresearch.com From: bugs@schedmd.com <bugs@schedmd.com> Sent: Wednesday, February 24, 2021 7:47 AM To: Andrus, Brian <Brian.Andrus@lamresearch.com> Subject: [Bug 9978] Cannot use sudo when using nss_slurm External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe.If you believe this email may be unsafe, please click on the "Report Phishing" button on the top right of Outlook. Comment # 38<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c38&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C5adc85dedd90461cd73008d8d8db707d%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637497784311623562%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=C2qMD8mXFtJpbsaq7eumt0AsRwj6SoFktbp0%2F%2B3zuhE%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C5adc85dedd90461cd73008d8d8db707d%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637497784311633548%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jDRGZuNtGP%2BV0rqkY9KgX%2B5BG6uhwUu9X8IKP4OAiq0%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com> nss_slurm isn't meant to be a full replacement of the central authentication. It will only respond with user/group information if the process is considered part of the "job". In the case of pam_slurm_adopt, the session would already need to be adopted. pam_slurm_adopt will have to check once against the central authentication, then find the job it can be adopted into after which any requests will be served by nss_slurm. The only way this would impact sudo is if the sudo process was somehow escaping the job cgroup so it wouldn't be able to get the group information. We can confirm that the sudo process is in the cgroup (I think we accomplished this with the debug patch for nss_slurm, but we can check this way too): On a test node: Make sudo require a password via srun --pty on the node, run sudo hostname In another shell, log in to the node as root. Get the pid of the sudo process confirm that pid is in the cgroup for the job eg: grep 12354 /sys/fs/cgroup/memory/slurm/uid_1000/job_185/step_0/task_0/tasks 12354 If the pid is there, sudo isn't escaping so it should get the group info correctly. I've been doing my recent tests on a system with pam_slurm_adopt, nss_slurm, and auth done with ldapd (sssd is not cooperating yet, so that is still in progress). ________________________________ You are receiving this mail because: * You reported the bug.
Can you enable debug logging for pam_slurm_adopt and attach the logs? You should be able to find them in the authentication logs with the other pam logs. A line like this in the pam config should do it: -account required pam_slurm_adopt.so log_level=debug5 Thanks, -Tim
Enabled debug logging. With pam_slurm_adopt and nss_slurm enabled here is what I get in the log: Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.extern Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.0 Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug2: _establish_config_source: using config_file=/etc/slurm/slurm.conf (default) Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug: slurm_conf_init: using config_file=/etc/slurm/slurm.conf Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug: Reading slurm.conf file: /etc/slurm/slurm.conf Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.extern Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.0 Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.extern Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.0 Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: Connection by user andrubr: user has only one job 783074 Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug: _adopt_process: trying to get StepId=783074.extern to adopt 5357 Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug: Leaving stepd_add_extern_pid Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug: Leaving stepd_get_x11_display Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: Process 5357 adopted into job 783074 Feb 24 21:18:01 gen-test-01 sshd[5357]: Accepted publickey for andrubr from 10.49.32.26 port 50768 ssh2: RSA SHA256:R/vELFJwo4D4RWcmMH/KdvXjsoR3oAGl4t0NoUKYHsk Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.extern Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.0 Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug: Leaving stepd_getpw Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.extern Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.0 Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug: Leaving stepd_getpw Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.extern Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.0 Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.extern Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.0 Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug: Leaving stepd_getpw Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.extern Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.0 Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.extern Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.0 Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug: Leaving stepd_getpw Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.extern Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.0 Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug: Leaving stepd_getpw Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.extern Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.0 Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug: Leaving stepd_getpw And here is where I tried to log in: 13:17:24-andrubr@slurmmaster01:~$ ssh gen-test-01 packet_write_wait: Connection to 10.49.38.112 port 22: Broken pipe From: bugs@schedmd.com <bugs@schedmd.com> Sent: Wednesday, February 24, 2021 9:10 AM To: Andrus, Brian <Brian.Andrus@lamresearch.com> Subject: [Bug 9978] Cannot use sudo when using nss_slurm External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe.If you believe this email may be unsafe, please click on the "Report Phishing" button on the top right of Outlook. Comment # 41<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c41&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7Cf1d9981ed3754b51927e08d8d8e6f914%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637497833842264064%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QJ%2Blpiq0jWJC9FHY3gTodvPnKv%2B4mqMpbuQVeP4FgUo%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7Cf1d9981ed3754b51927e08d8d8e6f914%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637497833842274060%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Up3SGoO7rfHP%2BC%2BckY4lpEXv0a7VY5aG5LtTR8Hq6nc%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com> Can you enable debug logging for pam_slurm_adopt and attach the logs? You should be able to find them in the authentication logs with the other pam logs. A line like this in the pam config should do it: -account required pam_slurm_adopt.so log_level=debug5 Thanks, -Tim ________________________________ You are receiving this mail because: * You reported the bug.
That looks like pam_slurm_adopt got fairly far in the process. Would you be able to repeat the experiment, but also attach the slurmd log at at least debug2? Would you also be able to attach your current slurm.conf?
Here is the slurmd debug log for the tasks: 1. Start an interactive session 07:25:44-andrubr@slurmmaster01:~$ srun -N1 --exclusive --account root --partition gen-test --time 01:00:00 --pty bash srun: job 785369 queued and waiting for resources srun: job 785369 has been allocated resources [andrubr@gen-test-01 ~]$ 1. Try to ssh to that node as that user 07:26:00-andrubr@slurmmaster01:~$ ssh gen-test-01 packet_write_wait: Connection to 10.49.38.112 port 22: Broken pipe 07:26:38-andrubr@slurmmaster01:~$ 1. End interactive session [andrubr@gen-test-01 ~]$ exit exit 07:26:52-andrubr@slurmmaster01:~$ Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: in the service_connection Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug2: Start processing RPC: REQUEST_LAUNCH_PROLOG Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug2: Processing RPC: REQUEST_LAUNCH_PROLOG Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: state for jobid 785366: ctime:1614266368 revoked:1614266432 expires:1614266552 Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: destroying job 785366 state Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: state for jobid 785367: ctime:1614266453 revoked:1614266454 expires:1614266575 Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: destroying job 785367 state Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: state for jobid 785368: ctime:1614266466 revoked:1614266511 expires:1614266632 Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: destroying job 785368 state Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug: Checking credential with 1984 bytes of sig data Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug2: _insert_job_state: we already have a job state for job 785369. No big deal, just an FYI. Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug: [job 785369] attempting to run prolog [/opt/hpc-admin/slurm/bin/prologue.sh] Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: _spawn_prolog_stepd: call to _forkexec_slurmstepd Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: slurmstepd rank 0 (gen-test-01), parent rank -1 (NONE), children 0, depth 0, max_depth 0 Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: _spawn_prolog_stepd: return from _forkexec_slurmstepd 0 Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug2: Finish processing RPC: REQUEST_LAUNCH_PROLOG Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: in the service_connection Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug2: Start processing RPC: REQUEST_LAUNCH_TASKS Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug2: Processing RPC: REQUEST_LAUNCH_TASKS Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: launch task StepId=785369.0 request from UID:10043871 GID:1644000513 HOST:10.49.32.26 PORT:53790 Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug: Checking credential with 1984 bytes of sig data Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug: Waiting for job 785369's prolog to complete Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug: Finished wait for job 785369's prolog to complete Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: slurmstepd rank 0 (gen-test-01), parent rank -1 (NONE), children 0, depth 0, max_depth 0 Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug2: Finish processing RPC: REQUEST_LAUNCH_TASKS Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: in the service_connection Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug2: Start processing RPC: REQUEST_TERMINATE_JOB Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug2: Processing RPC: REQUEST_TERMINATE_JOB Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug: _rpc_terminate_job, uid = 912 JobId=785369 Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: state for jobid 785369: ctime:1614266756 revoked:0 expires:2147483647 Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug: credential for job 785369 revoked Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug4: found StepId=785369.extern Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug2: container signal 18 to StepId=785369.extern Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug4: found StepId=785369.extern Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug2: container signal 15 to StepId=785369.extern Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug4: sent SUCCESS Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug4: found StepId=785369.extern Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug2: set revoke expiration for jobid 785369 to 1614266932 UTS Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug: Waiting for job 785369's prolog to complete Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug: Finished wait for job 785369's prolog to complete Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug: [job 785369] attempting to run epilog [/opt/hpc-admin/slurm/bin/epilogue.sh] Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug: completed epilog for jobid 785369 Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug: JobId=785369: sent epilog complete msg: rc = 0 Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug2: Finish processing RPC: REQUEST_TERMINATE_JOB From: bugs@schedmd.com <bugs@schedmd.com> Sent: Thursday, February 25, 2021 5:41 AM To: Andrus, Brian <Brian.Andrus@lamresearch.com> Subject: [Bug 9978] Cannot use sudo when using nss_slurm External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe.If you believe this email may be unsafe, please click on the "Report Phishing" button on the top right of Outlook. Comment # 43<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c43&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7Ca692761072a5449e9be408d8d992fd5e%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637498572627594979%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Qg08x497%2B4jIxwq6HWkhkmZdphI96FKJz1cZoeROjbc%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7Ca692761072a5449e9be408d8d992fd5e%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637498572627604973%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=tFCfzs5KQBz9CYcJ%2BAqjNqCuW3dY0OAF3kBAIVqBVmI%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com> That looks like pam_slurm_adopt got fairly far in the process. Would you be able to repeat the experiment, but also attach the slurmd log at at least debug2? Would you also be able to attach your current slurm.conf? ________________________________ You are receiving this mail because: * You reported the bug.
I'm not seeing the step logs in the included logs, are you using systemd to spawn the slurmd? If so, you may want to set a "WorkingDirectory" in the unit file (usually just the log directory) and see if the step logs come back. I'm still having no luck reproducing this. Would you please attach your slurm.conf file? Running ssh with "-vv" may also help pin down where its getting stuck. Thanks, --Tim
Created attachment 18306 [details] slurmd.log Tim, I ran ssh -vvv and here is the relevant part at the end: debug2: we sent a publickey packet, wait for reply debug3: receive packet: type 60 debug1: Server accepts key: pkalg rsa-sha2-512 blen 279 debug2: input_userauth_pk_ok: fp SHA256:R/vELFJwo4D4RWcmMH/KdvXjsoR3oAGl4t0NoUKYHsk debug3: sign_and_send_pubkey: RSA SHA256:R/vELFJwo4D4RWcmMH/KdvXjsoR3oAGl4t0NoUKYHsk debug3: send packet: type 50 debug3: receive packet: type 52 debug1: Authentication succeeded (publickey). Authenticated to gen-test-01 ([10.49.38.112]:22). debug1: channel 0: new [client-session] debug3: ssh_session2_open: channel_new: 0 debug2: channel 0: send open debug3: send packet: type 90 debug1: Requesting no-more-sessions@openssh.com debug3: send packet: type 80 debug1: Entering interactive session. debug1: pledge: network debug3: send packet: type 1 packet_write_wait: Connection to 10.49.38.112 port 22: Broken pipe So it authenticated and then tried to enter an interactive session. I am using systemd and have that logging in debug always. I captured everything going there from anything slurm when I tried to ssh in: Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.extern Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.0 Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug2: _establish_config_source: using config_file=/etc/slurm/slurm.conf (default) Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug: slurm_conf_init: using config_file=/etc/slurm/slurm.conf Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug: Reading slurm.conf file: /etc/slurm/slurm.conf Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.extern Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.0 Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.extern Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.0 Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: Connection by user andrubr: user has only one job 802261 Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug: _adopt_process: trying to get StepId=802261.extern to adopt 6346 Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug: Leaving stepd_add_extern_pid Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug: Leaving stepd_get_x11_display Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: Process 6346 adopted into job 802261 Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.extern Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.0 Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug: Leaving stepd_getpw Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.extern Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.0 Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug: Leaving stepd_getpw Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.extern Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.0 Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.extern Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.0 Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug: Leaving stepd_getpw Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.extern Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.0 Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.extern Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.0 Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug: Leaving stepd_getpw Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.extern Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.0 Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug: Leaving stepd_getpw Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.extern Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.0 Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug: Leaving stepd_getpw I also started slurmd with "-vv" as options and have attached that logfile. From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, March 8, 2021 5:26 AM To: Andrus, Brian <Brian.Andrus@lamresearch.com> Subject: [Bug 9978] Cannot use sudo when using nss_slurm External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe.If you believe this email may be unsafe, please click on the "Report Phishing" button on the top right of Outlook. Comment # 45<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c45&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C5b96bf38e247490de9a308d8e235bb63%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637508067723511915%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=22IU2XoevcgCK%2F98xTGAVRy8GMepugo5A%2BRC7IlPTIw%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C5b96bf38e247490de9a308d8e235bb63%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637508067723511915%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=u0Ris6Dc1iW%2FXnZ8Kr8Uk5vxPH6ZKI6cy8x9i81roDk%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com> I'm not seeing the step logs in the included logs, are you using systemd to spawn the slurmd? If so, you may want to set a "WorkingDirectory" in the unit file (usually just the log directory) and see if the step logs come back. I'm still having no luck reproducing this. Would you please attach your slurm.conf file? Running ssh with "-vv" may also help pin down where its getting stuck. Thanks, --Tim ________________________________ You are receiving this mail because: * You reported the bug.
Thank you for the additional logs! So far, it looks like the process at least gets adopted. I'll see if I can get it to reproduce the ssh connection closing before the job ends in a similar place. Thanks again, --Tim
Hi Brian - I wanted to give you an update here, since this bug has been a long outstanding issue for you and for us. We have devoted substantial resources to tracking this down, however, we have not been able to find or duplicate this issue, which leaves you and us in a difficult situation. There is some suspicion that the setuid()'d sudo command escapes from the cgroup freezer, and thus when it calls getpwuid() none of the stepd processes claim responsibility for it, and thus won't report back the info. With the upcoming release of 21.08 we have completely revamped cgroup v1, we have a couple of suggestions. 1. Try 21.08 out in a test environment where you can duplicate this and see if you can reproduce. 21.08 also comes with improved logging around cgroups which may help pinpoint the issue if you can indeed reproduce this on 21.08. 2. If this is not a possibility or, 21.08 has the same issue then we can of course look at the logs from 21.08, however, it would be more beneficial to us if you could construct a reproducer with clear steps. What I mean by this is that we have tried on different OS'es, with different identity management systems. If you can try to outline the exact steps and config that leads to this situation, then perhaps we could duplicate. Longer term, if we can not make progress here then I can not justify allocating more resources to tackle this issue since this does seem to be unique to your site. Please give this some though and let us know how you would like to proceed. Jason Booth Director of Support
Tim, So I just updated to slurm-21.08 and it looks like things work fine. I was able to start a job on a node, then ssh to that node (so nss_slurm allowed access), then from that session I was able to 'sudo -i' which allowed sudo access for my account because it is part of an AD group which is allowed sudo access. So, whatever updates/changes were made, it made this work as expected. You may close this ticket. [https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg] Brian Andrus - HPC Systems brian.andrus@lamresearch.com From: bugs@schedmd.com <bugs@schedmd.com> Sent: Wednesday, August 18, 2021 10:53 AM To: Andrus, Brian <Brian.Andrus@lamresearch.com> Subject: [Bug 9978] Cannot use sudo when using nss_slurm External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please click on the "Report Phishing" button on the top right of Outlook. Comment # 57<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c57&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7Cd304ecd06b1145e36f2e08d962711444%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637649060089814076%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=BzxbTlY%2BoR8%2FJjycMqyAFGVOeDpAuwGopVpz2vqDHgo%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7Cd304ecd06b1145e36f2e08d962711444%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637649060089824071%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=sEEs2Lo7A0ZZmwJ%2F%2BrokRoXoCJ62ynZkLJkZHktPKW8%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com> Hi Brian - I wanted to give you an update here, since this bug has been a long outstanding issue for you and for us. We have devoted substantial resources to tracking this down, however, we have not been able to find or duplicate this issue, which leaves you and us in a difficult situation. There is some suspicion that the setuid()'d sudo command escapes from the cgroup freezer, and thus when it calls getpwuid() none of the stepd processes claim responsibility for it, and thus won't report back the info. With the upcoming release of 21.08 we have completely revamped cgroup v1, we have a couple of suggestions. 1. Try 21.08 out in a test environment where you can duplicate this and see if you can reproduce. 21.08 also comes with improved logging around cgroups which may help pinpoint the issue if you can indeed reproduce this on 21.08. 2. If this is not a possibility or, 21.08 has the same issue then we can of course look at the logs from 21.08, however, it would be more beneficial to us if you could construct a reproducer with clear steps. What I mean by this is that we have tried on different OS'es, with different identity management systems. If you can try to outline the exact steps and config that leads to this situation, then perhaps we could duplicate. Longer term, if we can not make progress here then I can not justify allocating more resources to tackle this issue since this does seem to be unique to your site. Please give this some though and let us know how you would like to proceed. Jason Booth Director of Support ________________________________ You are receiving this mail because: * You reported the bug. LAM RESEARCH CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, files, or previous e-mail messages attached to it, (collectively, "E-mail Transmission") may be subject to one or more of the following based on the associated sensitivity level: E-mail Transmission (i) contains confidential information, (ii) is prohibited from distribution outside of Lam, and/or (iii) is intended solely for and restricted to the specified recipient(s). If you are not the intended recipient, or a person responsible for delivering it to the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of any of the information contained in or attached to this message is STRICTLY PROHIBITED. If you have received this transmission in error, please immediately notify the sender and destroy the original transmission and its attachments without reading them or saving them to disk. Thank you.
Resolving as can not reproduce.