Ticket 9978 - Cannot use sudo when using nss_slurm
Summary: Cannot use sudo when using nss_slurm
Status: RESOLVED CANNOTREPRODUCE
Alias: None
Product: Slurm
Classification: Unclassified
Component: nss_slurm (show other tickets)
Version: 20.02.5
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Chad Vizino
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-10-12 16:18 MDT by Brian Andrus
Modified: 2023-09-15 13:01 MDT (History)
2 users (show)

See Also:
Site: Lam
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
output from 'strace sudo hostname' (12.52 KB, text/plain)
2020-10-13 12:54 MDT, Brian Andrus
Details
debug patch (2.11 KB, patch)
2020-11-03 07:32 MST, Tim McMullan
Details | Diff
bug9978 debug (20.11) (2.19 KB, patch)
2020-11-23 07:20 MST, Tim McMullan
Details | Diff
034A7F654D8D43CAAF99D26BA2F3C24B.png (132 bytes, image/png)
2021-02-24 07:50 MST, Brian Andrus
Details
slurmd.log (57.65 KB, application/octet-stream)
2021-03-09 09:09 MST, Brian Andrus
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Brian Andrus 2020-10-12 16:18:02 MDT
When using nss_slurm to pass group info, I am unable to use sudo on the node, I get:
sudo: PAM account management error: Authentication service cannot retrieve authentication info

This may be a case of "you can't do that"
Situation:
Nodes are domain-joined with realmd
sssd is used for caching

Our HPC-ADMINS group is part of the domain groups, so it is in sudoers as:
%FREMONT\\HPC-ADMINS ALL = (ALL) NOPASSWD: ALL

When I get on a node with srun, I can see all my uid/gids are correct.
I can also do 'newgrp' to make hpc-admins my primary group, however any attempt to use sudo results in the error message above.

from nsswitch.conf:
passwd:     slurm files sss
shadow:     slurm files sss
group:      slurm files sss

If I remove slurm from the passwd section (or put it after sss), it will work.
Comment 1 Tim McMullan 2020-10-13 12:10:04 MDT
Would you be able to elaborate on what situation you are calling sudo in?  Is this from an ssh session to the node or is it an "srun sudo" kind of situation?

In either case, would you be able to get an strace of sudo running something innocuous (eg sudo echo testing) as root on a node and attach the trace?

Thanks!
--Tim
Comment 2 Brian Andrus 2020-10-13 12:53:26 MDT
This particular scenario is an interactive bash session:

srun --nodes=1 --exclusive --account=root --time=24:00:00 --partition=gen-test --pty bash

After getting a console, I try something as simple as:

$sudo hostname
sudo: PAM account management error: Authentication service cannot retrieve authentication info

I ran strace sudo hostname and have attached that output.
It doesn't seem to provide anything of use. I get the same output when I run it on a system without nss_slurm, although the command is successful.
Comment 3 Brian Andrus 2020-10-13 12:54:13 MDT
Created attachment 16209 [details]
output from 'strace sudo hostname'
Comment 8 Tim McMullan 2020-10-16 06:43:21 MDT
Hi!

I've been doing some digging on this and been trying to reproduce, but sudo has been working for me so far (though my setup isn't currently dealing with SSSD).  I do have a couple questions though that might help me track it down.  What OS are you running?  From the slurm controller as root do both "getent passwd $username" and "getent shadow $username" return something? And do you have selinux configured/running?

Thanks!
--Tim
Comment 10 Brian Andrus 2020-10-16 08:02:25 MDT
Tim,
I am guessing you mean to use my username for $username, otherwise that is not set and it returns everyone in the /etc/passwd and /etc/shadow files (enumeration is disabled in Active Directory so you can get everyone).
So for my account it shows:

07:00:41-root@slurmmaster01:~$ getent passwd andrubr
andrubr:*:10043871:1644000513:Andrus, Brian:/home/andrubr:/bin/bash
07:00:45-root@slurmmaster01:~$ getent shadow  andrubr
07:00:51-root@slurmmaster01:~$

(Active Directory does not do shadow)

And selinux is disabled throughout the cluster.

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Friday, October 16, 2020 5:43 AM
To: Andrus, Brian <Brian.Andrus@lamresearch.com>
Subject: [Bug 9978] Cannot use sudo when using nss_slurm



External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please forward it as an attachment to: it.servicedesk@lamresearch.com<mailto:it.servicedesk@lamresearch.com> with the subject: Suspicious Email and then delete it.


Comment # 8<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c8&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C12eb7752899940f0f65d08d871d1111b%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637384490041366663%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=4%2FgPXiL%2F5Z%2BR2F%2F3%2F%2FrfxbVuYzXoiMxyTiWfuujQR%2F4%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C12eb7752899940f0f65d08d871d1111b%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637384490041376655%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=aurqPIQ7iHIBQ92er1OdOjTvkILzNLg61aWV1l0O%2BK4%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com>

Hi!



I've been doing some digging on this and been trying to reproduce, but sudo has

been working for me so far (though my setup isn't currently dealing with SSSD).

 I do have a couple questions though that might help me track it down.  What OS

are you running?  From the slurm controller as root do both "getent passwd

$username" and "getent shadow $username" return something? And do you have

selinux configured/running?



Thanks!

--Tim

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 11 Tim McMullan 2020-10-20 06:20:27 MDT
Hi Brian,

Would you be able to check in /var/log/auth.log after a failed sudo attempt for an error like "error retrieving Slurm step info"?

Thanks!
-Tim
Comment 12 Brian Andrus 2020-10-20 08:16:29 MDT
I have no /var/log/auth.log, but I do see in /var/log/secure and the journal:
Oct 20 14:11:26 gen-test-01 sudo: andrubr : PAM account management error: Authentication service cannot retrieve authentication info ; TTY=pts/0 ; PWD=/opt/hpc-admin/scripts ; USER=root ; COMMAND=/bin/hostname

My entire slurmd.log for the session is:

[2020-10-20T14:09:55.119] Message aggregation disabled
[2020-10-20T14:09:55.133] topology NONE plugin loaded
[2020-10-20T14:09:55.141] route default plugin loaded
[2020-10-20T14:09:55.143] CPU frequency setting not configured for this node
[2020-10-20T14:09:55.184] Munge credential signature plugin loaded
[2020-10-20T14:09:55.186] slurmd version 20.02.5 started
[2020-10-20T14:09:55.281] slurmd started on Tue, 20 Oct 2020 14:09:55 +0000
[2020-10-20T14:09:55.282] CPUs=2 Boards=1 Sockets=1 Cores=2 Threads=1 Memory=3950 TmpDisk=29703 Uptime=61 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2020-10-20T14:10:50.577] _run_prolog: run job script took usec=350
[2020-10-20T14:10:50.577] _run_prolog: prolog with lock for job 587138 ran for 0 seconds
[2020-10-20T14:10:50.653] [587138.extern] Munge credential signature plugin loaded
[2020-10-20T14:10:52.869] launch task 587138.0 request from UID:10043871 GID:1644000513 HOST:10.49.32.26 PORT:20660
[2020-10-20T14:10:52.887] [587138.0] Munge credential signature plugin loaded
[2020-10-20T14:10:52.898] [587138.0] in _window_manager
[2020-10-20T14:10:52.900] [587138.0] debug level = 2
[2020-10-20T14:10:52.901] [587138.0] starting 1 tasks
[2020-10-20T14:10:52.902] [587138.0] task 0 (2392) started 2020-10-20T14:10:52
[2020-10-20T14:15:44.789] [587138.0] task 0 (2392) exited with exit code 1.
[2020-10-20T14:15:44.858] [587138.0] done with job
[2020-10-20T14:15:44.861] [587138.extern] Sent signal 18 to 587138.4294967295
[2020-10-20T14:15:44.861] [587138.extern] Sent signal 15 to 587138.4294967295
[2020-10-20T14:15:44.866] [587138.extern] done with job


[https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg] Brian Andrus - HPC Systems
brian.andrus@lamresearch.com



From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, October 20, 2020 5:20 AM
To: Andrus, Brian <Brian.Andrus@lamresearch.com>
Subject: [Bug 9978] Cannot use sudo when using nss_slurm



External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please forward it as an attachment to: it.servicedesk@lamresearch.com<mailto:it.servicedesk@lamresearch.com> with the subject: Suspicious Email and then delete it.


Comment # 11<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c11&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C6221d1e1da9643bd91fc08d874f2889d%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637387932322911443%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2BZmgLCosfeKYEXSbcZFnSgMCw6%2BEcd4oAEzAQJUmwvg%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C6221d1e1da9643bd91fc08d874f2889d%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637387932322921439%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=if26SDdPnhNDD0XZuv8ubHF3MaVUSRwkFUiLetSnRls%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com>

Hi Brian,



Would you be able to check in /var/log/auth.log after a failed sudo attempt for

an error like "error retrieving Slurm step info"?



Thanks!

-Tim

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 13 Tim McMullan 2020-10-28 11:52:01 MDT
Hi Brian,

Unfortunately these logs and further attempts at reproducing haven't really tracked down a good reason for this yet.

Just to confirm that nss_slurm and sss are both doing what we expect them to, would you be able to run the following:

srun bash -c 'for i in slurm files sss; do echo "$i passwd: $(getent -s $i passwd $USER)"; echo "$i group: $(getent -s $i group $USER)"; done'

The goal is to make sure that nss_slurm, files, and sss are all returning sane values under normal conditions.

Would you also be able to tell me what OS you are running, and if you are experiencing this in srun, sbatch, or both?

Thanks,
--Tim
Comment 14 Brian Andrus 2020-10-29 10:05:19 MDT
I added '-n 1' to the srun command since it was running on all the cores otherwise :)

andrubr@gen-test-01 ~]$ srun -n1 bash -c 'for i in slurm files sss; do echo "$i passwd: $(getent -s $i passwd $USER)"; echo "$i group: $(getent -s $i group $USER)"; done'
slurm passwd: andrubr:x:10043871:1644000513:Andrus, Brian:/home/andrubr:/bin/bash
slurm group: 
files passwd: 
files group: 
sss passwd: andrubr:*:10043871:1644000513:Andrus, Brian:/home/andrubr:/bin/bash
sss group: 

This as expected. My default group is not my name ("Domain Users") and my account is in Active Directory, so nothing in files.

When I change the group to be Domain\ Users, I get:
[andrubr@gen-test-01 ~]$ srun -n1 bash -c 'for i in slurm files sss; do echo "$i passwd: $(getent -s $i passwd $USER)"; echo "$i group: $(getent -s $i group Domain\ Users)"; done'
slurm passwd: andrubr:x:10043871:1644000513:Andrus, Brian:/home/andrubr:/bin/bash
slurm group: 
files passwd: 
files group: 
sss passwd: andrubr:*:10043871:1644000513:Andrus, Brian:/home/andrubr:/bin/bash
sss group: domain users:*:1644000513:andrubr,bd4adm,cd4adm,gd4adm,sapserviceatp,sapserviceatq,huangxi1,sybsjp,sybbv4,sybsjq,c4padm,at2adm,atqadm,ad2adm,atdadm,atpadm,atxadm,svc_ibpadmin,migtest01,svc_ibpagent,svc_covbuilds,hylinca,bq5adm,monahde,sc4adm,chakaiv,vmartirosyan,sybsep,sybseq,sybsed,sepadm,seqadm,sedadm,qa5adm,asoussou,rhathwar,liuje2,sybids,idsadm,tdesrues,scalderon,jmcintyre,qdtadm,lcarver,jhuang,jlapidas,welham,gaosu,sxliu,lindas,ken,dfried,bvandyk,bvincent,yhan,yyan,vliberman,vhuber,vallampalli,tomk,srouvillois,rmiller,rramachandran,orenaud,makbulut,mlee,mlevin,matt,jlehto,jasselot,ifavorskiy,dfox,dsieger,dfaken,aparent,apap,akunwar,asinding,alevin,cq2adm,gq2adm,qa2adm,pisadm,walshpe,bwpadm,sybbjp,cexadm,cd2adm,clearcase_aldb,bq2adm,sapadm,edide2,qa3adm,sybbjd,bd2adm,bdvadm,gd2adm,axqadm,axpadm,axdadm,bjdadm,crdadm,cq4adm,bq4adm,sybbdv,wddadm,prdadm,slqadm,sybbqv,smdadm,cedadm,bwqadm,ediqa2,ceqadm,gq4adm,axsadm,sapservicewdd,codadm,bjpadm,de2adm,qa4adm,crpadm,bjqadm,crqadm,cepadm,slxadm,bqvadm,ediprd,lq4adm,bwdadm,gtdadm,gtqadm,ld2adm,bv4adm,gtpadm,sapserviceslx,sapserviceslq,qasadm,devadm,coqadm

So, it may be group membership for the slurm_nss that is an issue.
when I do an id on me, I have:
[andrubr@gen-test-01 ~]$ id andrubr
uid=10043871(andrubr) gid=1644000513(domain users) groups=1644000513(domain users),1644179569(distlist-all lam - fremont (contractors temps only)),10097910(nobody),1644183490(vpn_it_contractors_and_consultants),1644068025(nobody),1644067776(gis all (na and international)),1644065583(nobody),10054405(ad-confluence-external),1644065584(nobody),10059363(semulator3d-mupm-users),10054720(ad-bitbucket-external),1644043879(all lam fremont users),1644063358(gis all north america),10041742(semu_internal_users),1644195010(endar major events),10054404(ad-jira-external),10063511(nobody),10041696(confluence-users),1644068121(nobody),10041493(nobody),10079848(nobody),10084119(lamrc_csc_read),10049453(hpc-admins),10041693(jira-users),10030561(semu_users),10037519(hpc-mapd),1644068028(nobody),1644075644(nobody),1644179870(recordedmeeting_change),1644077329(mb - gis calendar),1644043888(nobody),1644043833(nobody),10008343(nobody),1644075673(messagestats web),1644015988(cond_rd read),1644075642(citrix cellfusion access),1644079866(citrix sapgui access),1644149504(fs_30010r_mfg_photoarchive),1644165789(fs_51010r),1644079871(citrix word access),1644071400(#intranet - global read),1644051065(fs_public),1644164176(fs_30010r_tool_data_logging),1644071402(#intranet - phonebook),1644079868(citrix ncr access),1644079869(citrix ie8 apps access),10004486(gis_test_01_read),1644021511(10061c),1644079867(citrix be access),1644079872(citrix excel access)

Hope that helps
Comment 15 Tim McMullan 2020-10-30 09:37:25 MDT
Thanks Brian!  The group entries are case sensitive, so its possible that "getent -s $i group Domain\ Users" just isn't matching since from SSSD it seems to be lowercase "domain users".

can you try just:

srun -n1 getent -s slurm group

and see what that returns?  It *should* be a list of all the groups you are a member of.

I'm working on a debug patch that might help track this down better, particularly if getent -s slurm group doesn't return reasonable results.

Something else to note: you don't need to change shadow in nsswitch.conf, we don't actually implement anything to ship shadow around.  I've seen some weird bugs with sudo and shadow being odd, so it might be worth changing that back to just "files sss".

Thanks, and sorry about all the back and forth!
--Tim
Comment 16 Brian Andrus 2020-10-30 10:38:36 MDT
Out put as requested:

[andrubr@gen-test-01 ~]$ srun -n1 getent -s slurm group
domain users:x:1644000513:andrubr
distlist-all lam - fremont (contractors temps only):x:1644179569:andrubr
vpn_it_contractors_and_consultants:x:1644183490:andrubr
nobody:x:1644068025:andrubr
gis all (na and international):x:1644067776:andrubr
nobody:x:1644065583:andrubr
ad-confluence-external:x:10054405:andrubr
nobody:x:1644065584:andrubr
semulator3d-mupm-users:x:10059363:andrubr
ad-bitbucket-external:x:10054720:andrubr
all lam fremont users:x:1644043879:andrubr
gis all north america:x:1644063358:andrubr
semu_internal_users:x:10041742:andrubr
endar major events:x:1644195010:andrubr
ad-jira-external:x:10054404:andrubr
nobody:x:10063511:andrubr
confluence-users:x:10041696:andrubr
nobody:x:1644068121:andrubr
nobody:x:10041493:andrubr
nobody:x:10079848:andrubr
lamrc_csc_read:x:10084119:andrubr
hpc-admins:x:10049453:andrubr
jira-users:x:10041693:andrubr
semu_users:x:10030561:andrubr
hpc-mapd:x:10037519:andrubr
nobody:x:1644068028:andrubr
nobody:x:1644075644:andrubr
recordedmeeting_change:x:1644179870:andrubr
mb - gis calendar:x:1644077329:andrubr
nobody:x:1644043888:andrubr
nobody:x:1644043833:andrubr
nobody:x:10008343:andrubr
messagestats web:x:1644075673:andrubr
cond_rd read:x:1644015988:andrubr
citrix cellfusion access:x:1644075642:andrubr
citrix sapgui access:x:1644079866:andrubr
fs_30010r_mfg_photoarchive:x:1644149504:andrubr
fs_51010r:x:1644165789:andrubr
citrix word access:x:1644079871:andrubr
#intranet - global read:x:1644071400:andrubr
fs_public:x:1644051065:andrubr
fs_30010r_tool_data_logging:x:1644164176:andrubr
#intranet - phonebook:x:1644071402:andrubr
citrix ncr access:x:1644079868:andrubr
citrix ie8 apps access:x:1644079869:andrubr
gis_test_01_read:x:10004486:andrubr
10061c:x:1644021511:andrubr
citrix be access:x:1644079867:andrubr
citrix excel access:x:1644079872:andrubr

Looks good, I think. Only thing to note is that everything is lowercase.
I removed slurm from the shadow line in nsswitch.conf as well.

Of note: I also tried changing sudoers to just have my name with the same errors.
However, if I ssh to the node as root and then sudo to myself and run ‘sudo hostname’ it does work. I imagine that is because that session is not attached to the job.


[https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg] Brian Andrus - HPC Systems
brian.andrus@lamresearch.com



From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Friday, October 30, 2020 8:37 AM
To: Andrus, Brian <Brian.Andrus@lamresearch.com>
Subject: [Bug 9978] Cannot use sudo when using nss_slurm



External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please forward it as an attachment to: it.servicedesk@lamresearch.com<mailto:it.servicedesk@lamresearch.com> with the subject: Suspicious Email and then delete it.


Comment # 15<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c15&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7Cbc2b325dee3642c46cde08d87ce9b449%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637396690504953440%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=wGKrDcVflec1AEdgvgg%2BUbiBPcMCdzAnZfOgNzHWaUY%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7Cbc2b325dee3642c46cde08d87ce9b449%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637396690504963432%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=9hAW1oTbuxuEC5bVWlp%2FsfsFuY5gNU7W7Nil%2Blaf3GI%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com>

Thanks Brian!  The group entries are case sensitive, so its possible that

"getent -s $i group Domain\ Users" just isn't matching since from SSSD it seems

to be lowercase "domain users".



can you try just:



srun -n1 getent -s slurm group



and see what that returns?  It *should* be a list of all the groups you are a

member of.



I'm working on a debug patch that might help track this down better,

particularly if getent -s slurm group doesn't return reasonable results.



Something else to note: you don't need to change shadow in nsswitch.conf, we

don't actually implement anything to ship shadow around.  I've seen some weird

bugs with sudo and shadow being odd, so it might be worth changing that back to

just "files sss".



Thanks, and sorry about all the back and forth!

--Tim

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 17 Tim McMullan 2020-11-03 07:32:00 MST
Created attachment 16472 [details]
debug patch

Hi Brian,

That all does look good, and I am very much wondering if case is an issue here... though  I found your note at the end particularly interesting.

I've attached a debug patch that will be a little more verbose on the command line about what user nss_slurm found and how the internal functions got called.  This is just a patch to nss_slurm.

To get the output we need to see what it is doing, I needed to wrap the sudo command a bit with "srun -n1 bash -c 'sudo hostname'"

Let me know if you get the chance to try this!

--Tim
Comment 18 Brian Andrus 2020-11-05 19:23:22 MST
Ok, I got it built/installed and ran both with “bash -c ‘sudo hostname’” and in a “--pty bash” interactive session.
Results for both below:

18:18:50-andrubr@slurmmaster01:~$ srun --nodes=1 --exclusive --account=root --time=24:00:00 --partition=gen-test bash -c 'sudo hostname'
_pw_internal: uid:10043871 name:(null)
_pw_internal: user (null)(10043871) not found in Job:616232 Step:-1
_pw_internal: user andrubr(10043871) found in Job:616232 Step:0
_pw_internal: uid:-2 name:andrubr
_pw_internal: user andrubr(-2) not found in Job:616232 Step:-1
_pw_internal: user andrubr(10043871) found in Job:616232 Step:0
_gr_internal: gid:-2 name:(null)
_gr_internal: no groups found in Job:616232 Step:-1
_gr_internal: groups found in Job:616232 Step:0
_gr_internal: gid:-2 name:FREMONT\HPC-ADMINS
_gr_internal: no groups found in Job:616232 Step:-1
_gr_internal: no groups found in Job:616232 Step:0
_gr_internal: could not find groups in any step
_pw_internal: uid:-2 name:andrubr
_pw_internal: user andrubr(-2) not found in Job:616232 Step:-1
_pw_internal: user andrubr(10043871) found in Job:616232 Step:0
_pw_internal: uid:-2 name:andrubr
_pw_internal: user andrubr(-2) not found in Job:616232 Step:-1
_pw_internal: user andrubr(10043871) found in Job:616232 Step:0
sudo: PAM account management error: Authentication service cannot retrieve authentication info
srun: error: gen-test-01: task 0: Exited with exit code 1


18:19:18-andrubr@slurmmaster01:~$ srun --nodes=1 --exclusive --account=root --time=24:00:00 --partition=gen-test --pty bash
_pw_internal: uid:10043871 name:(null)
_pw_internal: user andrubr(10043871) found in Job:616233 Step:0
_gr_internal: gid:1644000513 name:(null)
_gr_internal: groups found in Job:616233 Step:0
_pw_internal: uid:10043871 name:(null)
_pw_internal: user andrubr(10043871) found in Job:616233 Step:0
[andrubr@gen-test-01 ~]$ sudo hostname
_pw_internal: uid:10043871 name:(null)
_pw_internal: user andrubr(10043871) found in Job:616233 Step:0
_pw_internal: uid:-2 name:andrubr
_pw_internal: user andrubr(10043871) found in Job:616233 Step:0
_gr_internal: gid:-2 name:(null)
_gr_internal: groups found in Job:616233 Step:0
_gr_internal: gid:-2 name:FREMONT\HPC-ADMINS
_gr_internal: no groups found in Job:616233 Step:0
_gr_internal: no groups found in Job:616233 Step:-1
_gr_internal: could not find groups in any step
_pw_internal: uid:-2 name:andrubr
_pw_internal: user andrubr(10043871) found in Job:616233 Step:0
_pw_internal: uid:-2 name:andrubr
_pw_internal: user andrubr(10043871) found in Job:616233 Step:0
sudo: PAM account management error: Authentication service cannot retrieve authentication info
[andrubr@gen-test-01 ~]$ exit
srun: error: gen-test-01: task 0: Exited with exit code 1


[https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg] Brian Andrus - HPC Systems
brian.andrus@lamresearch.com



From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, November 3, 2020 6:32 AM
To: Andrus, Brian <Brian.Andrus@lamresearch.com>
Subject: [Bug 9978] Cannot use sudo when using nss_slurm



External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please forward it as an attachment to: it.servicedesk@lamresearch.com<mailto:it.servicedesk@lamresearch.com> with the subject: Suspicious Email and then delete it.


Comment # 17<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c17&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C1e8e9dd464d34deb975708d880053a7c%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637400107229415369%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=LZURZiDLTHenKm0GaaFNjPl4gdGXarB8yYiLdpr%2B9jc%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C1e8e9dd464d34deb975708d880053a7c%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637400107229415369%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=xRs1QfT5Chv7B%2BM6PQu1Kj5HAMzrhcZqs2wFnrzgf1k%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com>

Created attachment 16472 [details]<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fattachment.cgi%3Fid%3D16472%26action%3Ddiff&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C1e8e9dd464d34deb975708d880053a7c%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637400107229425365%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=m4NOaITxLjOdhw4QaDCDG9B7N2QSUcdftWxwIBl23zY%3D&reserved=0> [details]<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fattachment.cgi%3Fid%3D16472%26action%3Dedit&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C1e8e9dd464d34deb975708d880053a7c%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637400107229425365%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=UUDbVt9%2B0TRpNbB49fhqnFmAC7RvPGGV8ug0DIRVIhw%3D&reserved=0>

debug patch



Hi Brian,



That all does look good, and I am very much wondering if case is an issue

here... though  I found your note at the end particularly interesting.



I've attached a debug patch that will be a little more verbose on the command

line about what user nss_slurm found and how the internal functions got called.

 This is just a patch to nss_slurm.



To get the output we need to see what it is doing, I needed to wrap the sudo

command a bit with "srun -n1 bash -c 'sudo hostname'"



Let me know if you get the chance to try this!



--Tim

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 19 Tim McMullan 2020-11-12 12:21:12 MST
Thank you for the output!

I'm looking it over along with the nss_slurm and sudo source to see if I can determine what is going on here, but it looks like nss_slurm is finding everything it should be.  Looking up FREMONT\HPC-ADMINS should fail in nss_slurm, but it should fallback to sss to find it.

Right now, it seems a lot like all the information is there and retrievable, so its not very clear what is causing PAM to error out like that. 

The only other thing I think I might need is /etc/pam.d/sudo and any file it may include to rule out something in PAM going on.

Thanks for your patience with this, it doesn't seem clear what is preventing this from working.
--Tim
Comment 20 Brian Andrus 2020-11-12 12:31:27 MST
I don’t see much there, but here they are:

[root@gen-f16-39 resource]# cat /etc/pam.d/sudo
#%PAM-1.0
auth       include      system-auth
account    include      system-auth
password   include      system-auth
session    optional     pam_keyinit.so revoke
session    include      system-auth
[root@gen-f16-39 resource]# cat /etc/pam.d/system-auth
#%PAM-1.0
# This file is auto-generated.
# User changes will be destroyed the next time authconfig is run.
auth        required      pam_env.so
auth        required      pam_faildelay.so delay=2000000
auth        sufficient    pam_fprintd.so
auth        [default=1 ignore=ignore success=ok] pam_succeed_if.so uid >= 1000 quiet
auth        [default=1 ignore=ignore success=ok] pam_localuser.so
auth        sufficient    pam_unix.so nullok try_first_pass
auth        requisite     pam_succeed_if.so uid >= 1000 quiet_success
auth        sufficient    pam_sss.so forward_pass
auth        required      pam_deny.so

account     required      pam_unix.so
account     sufficient    pam_localuser.so
account     sufficient    pam_succeed_if.so uid < 1000 quiet
account     [default=bad success=ok user_unknown=ignore] pam_sss.so
account     required      pam_permit.so

password    requisite     pam_pwquality.so try_first_pass local_users_only retry=3 authtok_type= ucredit=-1 lcredit=-1 dcredit=-1 ocredit=-1
password    sufficient    pam_unix.so sha512 shadow nullok try_first_pass use_authtok
password    sufficient    pam_sss.so use_authtok
password    required      pam_deny.so

session     optional      pam_keyinit.so revoke
session     required      pam_limits.so
#-session     optional      pam_systemd.so
session     optional      pam_oddjob_mkhomedir.so umask=0077
session     [success=1 default=ignore] pam_succeed_if.so service in crond quiet use_uid
session     required      pam_unix.so
session     optional      pam_sss.so

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, November 12, 2020 11:21 AM
To: Andrus, Brian <Brian.Andrus@lamresearch.com>
Subject: [Bug 9978] Cannot use sudo when using nss_slurm



External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please forward it as an attachment to: it.servicedesk@lamresearch.com<mailto:it.servicedesk@lamresearch.com> with the subject: Suspicious Email and then delete it.


Comment # 19<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c19&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C185e7c66fb9649d8692208d887401e38%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637408056760369087%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=5gcC%2BRfLQ4xDqed0cbs531w4pC%2FR62PrdoNpCetCF5I%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C185e7c66fb9649d8692208d887401e38%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637408056760369087%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=9iNG8%2B2Mk7I55nAhhVsVKP3uYN6r7v7e3UwFSROHRNw%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com>

Thank you for the output!



I'm looking it over along with the nss_slurm and sudo source to see if I can

determine what is going on here, but it looks like nss_slurm is finding

everything it should be.  Looking up FREMONT\HPC-ADMINS should fail in

nss_slurm, but it should fallback to sss to find it.



Right now, it seems a lot like all the information is there and retrievable, so

its not very clear what is causing PAM to error out like that.



The only other thing I think I might need is /etc/pam.d/sudo and any file it

may include to rule out something in PAM going on.



Thanks for your patience with this, it doesn't seem clear what is preventing

this from working.

--Tim

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 21 Brian Andrus 2020-11-17 19:30:23 MST
So I just upgraded to slurm 20.11.0
Now the nss_slurm does not work at all:

$ srun --nodes=1 --exclusive --account=root --time=24:00:00 --partition=gen-test bash -c 'sudo hostname'
sudo: PAM account management error: Authentication service cannot retrieve authentication info
srun: error: gen-test-01: task 0: Exited with exit code 1


If I remove 'slurm' from the passwd line in nsswitch.conf, it works.
Comment 22 Tim McMullan 2020-11-19 07:19:08 MST
(In reply to Brian Andrus from comment #21)
> So I just upgraded to slurm 20.11.0
> Now the nss_slurm does not work at all:
> 
> $ srun --nodes=1 --exclusive --account=root --time=24:00:00
> --partition=gen-test bash -c 'sudo hostname'
> sudo: PAM account management error: Authentication service cannot retrieve
> authentication info
> srun: error: gen-test-01: task 0: Exited with exit code 1
> 
> 
> If I remove 'slurm' from the passwd line in nsswitch.conf, it works.

I just tested if nss_slurm works in general with 20.11 and it seems to.  I noticed in your command that you ran "sudo hostname" which I think was the initial problem.  Does something like "getent -s slurm passwd" return correctly from srun or sbatch?
Comment 23 Brian Andrus 2020-11-19 15:42:09 MST
Yep, that was me just cutting and pasting.
So the same errors still occur under the same circumstances.


[https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg] Brian Andrus - HPC Systems
brian.andrus@lamresearch.com



From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, November 19, 2020 6:19 AM
To: Andrus, Brian <Brian.Andrus@lamresearch.com>
Subject: [Bug 9978] Cannot use sudo when using nss_slurm



External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please forward it as an attachment to: it.servicedesk@lamresearch.com<mailto:it.servicedesk@lamresearch.com> with the subject: Suspicious Email and then delete it.


Comment # 22<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c22&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7Cfdcb185f06b945a4c3e708d88c9614f9%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637413923507439454%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=lDKYA45bw1x%2BKxFRA4geEV%2FO7mnCp87qzORl9SzAiIY%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7Cfdcb185f06b945a4c3e708d88c9614f9%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637413923507449413%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=qL004nbxntbkblaIgVuXqLDouVm9MKFW2AcnMM4wito%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com>

(In reply to Brian Andrus from comment #21<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c21&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7Cfdcb185f06b945a4c3e708d88c9614f9%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637413923507449413%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=kCye0byRGH8%2Bl6drpsa8PxikHkka6U8aAAQ%2B12C98i8%3D&reserved=0>)

> So I just upgraded to slurm 20.11.0

> Now the nss_slurm does not work at all:

>

> $ srun --nodes=1 --exclusive --account=root --time=24:00:00

> --partition=gen-test bash -c 'sudo hostname'

> sudo: PAM account management error: Authentication service cannot retrieve

> authentication info

> srun: error: gen-test-01: task 0: Exited with exit code 1

>

>

> If I remove 'slurm' from the passwd line in nsswitch.conf, it works.



I just tested if nss_slurm works in general with 20.11 and it seems to.  I

noticed in your command that you ran "sudo hostname" which I think was the

initial problem.  Does something like "getent -s slurm passwd" return correctly

from srun or sbatch?

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 24 Tim McMullan 2020-11-20 08:49:33 MST
Thanks Brian, I just wanted to make sure we were in basically the same place!
Comment 25 Tim McMullan 2020-11-20 11:52:57 MST
I was looking through all this again, and I was wondering if you could try just using or adding "%hpc-admins ALL = (ALL) NOPASSWD: ALL" in the sudoers file. (+ instead of % may be required, depending on the exact versions of sudo/sssd at play).

It looks like the group is being properly displayed as one you are a member of, and it looks like nss_slurm knows that, or is at least able to get that information.  In recent enough version of SSSD, it should actually do a bunch of the import work for you and it might be the only line required.

Let me know what you think!
--Tim
Comment 26 Brian Andrus 2020-11-20 12:25:28 MST
Hmnm. I cannot get a pty anymore with the updated slurm 20.11.0 update:
11:20:53-andrubr@slurmmaster01:/$  srun --nodes=1 --exclusive --account=root --time=24:00:00 --partition=gen-test --pty /bin/bash
srun: error: gen-test-01: task 0: Segmentation fault

But I do still get the authentication error if I try to sudo directly:
11:20:31-andrubr@slurmmaster01:/$  srun --nodes=1 --exclusive --account=root --time=24:00:00 --partition=gen-test sudo hostname
sudo: PAM account management error: Authentication service cannot retrieve authentication info
srun: error: gen-test-01: task 0: Exited with exit code 1

If I remove ‘slurm’ from nss_switch, I can get the pty:
11:23:32-andrubr@slurmmaster01:/$  srun --nodes=1 --exclusive --account=root --time=24:00:00 --partition=gen-test --pty /bin/bash
[andrubr@gen-test-01 /]$


FWIW, I am able to use sudo still with that change to the sudoers, I just can’t be under the nss_slurm environment


[https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg] Brian Andrus - HPC Systems
brian.andrus@lamresearch.com


From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Friday, November 20, 2020 10:53 AM
To: Andrus, Brian <Brian.Andrus@lamresearch.com>
Subject: [Bug 9978] Cannot use sudo when using nss_slurm



External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please forward it as an attachment to: it.servicedesk@lamresearch.com<mailto:it.servicedesk@lamresearch.com> with the subject: Suspicious Email and then delete it.


Comment # 25<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c25&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C6426ac20cd2a41a76b1e08d88d857f93%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637414951806385459%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=XClKZDguRGMWAbJo9zTI%2FT6jX8Th30zmxRsoIahkfTM%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C6426ac20cd2a41a76b1e08d88d857f93%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637414951806385459%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=aLfA%2BbaxcVo315n8yZUbS9fQ20e%2FpHFNzuDMtHdvx5w%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com>

I was looking through all this again, and I was wondering if you could try just

using or adding "%hpc-admins ALL = (ALL) NOPASSWD: ALL" in the sudoers file. (+

instead of % may be required, depending on the exact versions of sudo/sssd at

play).



It looks like the group is being properly displayed as one you are a member of,

and it looks like nss_slurm knows that, or is at least able to get that

information.  In recent enough version of SSSD, it should actually do a bunch

of the import work for you and it might be the only line required.



Let me know what you think!

--Tim

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 27 Tim McMullan 2020-11-20 13:03:03 MST
(In reply to Brian Andrus from comment #26)
> Hmnm. I cannot get a pty anymore with the updated slurm 20.11.0 update:
> 11:20:53-andrubr@slurmmaster01:/$  srun --nodes=1 --exclusive --account=root
> --time=24:00:00 --partition=gen-test --pty /bin/bash
> srun: error: gen-test-01: task 0: Segmentation fault

I certainly wouldn't expect that to segfault!  Would you be able to open a new bug with the slurmd logs (at debug4) and a backtrace of the slurmstepd (if possible)?  This is largely just to not muddy this ticket with what appears to be a new/different problem.
 
> But I do still get the authentication error if I try to sudo directly:
> 11:20:31-andrubr@slurmmaster01:/$  srun --nodes=1 --exclusive --account=root
> --time=24:00:00 --partition=gen-test sudo hostname
> sudo: PAM account management error: Authentication service cannot retrieve
> authentication info
> srun: error: gen-test-01: task 0: Exited with exit code 1

Can we do a quick sanity check on this version that "getent -s slurm passwd" and "getent -s slurm group" still return the same as they did before?  If they differ please attach the new one, but if they are effectively the same just let me know.  I'm mostly interested in making nss_slurm still ships the same  "hpc-admins" line in groups.

> FWIW, I am able to use sudo still with that change to the sudoers, I just
> can’t be under the nss_slurm environment

Thank you, that's useful information on this too!
Comment 28 Tim McMullan 2020-11-23 07:20:26 MST
Created attachment 16774 [details]
bug9978 debug (20.11)

I've attached a new version of the debug patch that should work with 20.11

Would you be able to try this, with the debug patch and the sudoers change I suggested before?

I'd like to see if being able to look that group up with nss_slurm changes the result in any significant way.  Particularly, while looking things over I noticed that we didn't log any attempts to lookup the root user, and I'd like to see if those attempts will appear with a (likely) successful group lookup.

Thanks!
--Tim
Comment 29 Brian Andrus 2020-12-02 13:05:35 MST
Ok, I built and tried the patch.
If I add 'slurm' to nsswitch.conf, I get segfaults:

12:02:20-andrubr@slurmmaster01:~$ srun --nodes=1 --exclusive --account=root --time=24:00:00 --partition=gen-test --pty /bin/bash
srun: error: gen-test-01: task 0: Segmentation fault
_pw_internal: uid:10043871 name:(null)


It errors even if I try to ssh to the box as root:
12:02:42-root@slurmmaster01:~$ ssh gen-test-01
Last login: Wed Dec  2 19:59:17 2020 from 10.49.32.26
_gr_internal: gid:-2 name:(null)
_gr_internal: could not find groups in any step
[root@gen-test-01 ~]#
Comment 30 Tim McMullan 2020-12-07 13:24:01 MST
Hey Brian,

We've been working on the segfault issue and think we have the problem tracked down.  We are working an an acceptable solution for it!

Just wanted to update you on progress!
Thanks,
--Tim
Comment 31 Tim McMullan 2020-12-14 05:47:39 MST
A fix for the segfault issue should be included in 20.11.1.  Please let me know if this resolves the issue for you and we can keep looking at why sudo doesn't work :)

Thanks!
--Tim
Comment 32 Brian Andrus 2020-12-14 14:43:01 MST
Looks happier as far as the segfault.
Back to the same old issue:

13:42:06-andrubr@slurmmaster01:~$ srun --nodes=1 --exclusive --account=root --time=24:00:00 --partition=gen-test bash -c 'sudo hostname'
sudo: PAM account management error: Authentication service cannot retrieve authentication info
srun: error: gen-test-01: task 0: Exited with exit code 1
13:42:09-andrubr@slurmmaster01:~$ srun --nodes=1 --exclusive --account=root --time=24:00:00 --partition=gen-test --pty /bin/bash
[andrubr@gen-test-01 ~]$ sudo hostname
sudo: PAM account management error: Authentication service cannot retrieve authentication info
[andrubr@gen-test-01 ~]$ rpm -q slurm
slurm-20.11.1-1.el7.x86_64
[andrubr@gen-test-01 ~]$ exit

Brian

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, December 14, 2020 4:48 AM
To: Andrus, Brian <Brian.Andrus@lamresearch.com>
Subject: [Bug 9978] Cannot use sudo when using nss_slurm



External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please forward it as an attachment to: it.servicedesk@lamresearch.com<mailto:it.servicedesk@lamresearch.com> with the subject: Suspicious Email and then delete it.


Comment # 31<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c31&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C83deb62045384949385008d8a02e7122%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637435468619987211%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=SGr2q%2Ffwls3uSHqd4V9CQjCqk%2BPBSyKynbzfKT6vkGk%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C83deb62045384949385008d8a02e7122%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637435468619987211%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Q9lXa9Hvbgksv05z5%2B0mvl%2F6LIUbXbuAe31rbPHbxOI%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com>

A fix for the segfault issue should be included in 20.11.1.  Please let me know

if this resolves the issue for you and we can keep looking at why sudo doesn't

work :)



Thanks!

--Tim

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 33 Tim McMullan 2021-01-27 09:01:54 MST
Hey Brian,

I just wanted to let you know I'm still looking into this.  I've been looking for different possibilities and I'm really wondering if there is something funny going on in sssd/nss land.

Would you be able to share you sssd config as well?

Thanks,
--Tim
Comment 34 Brian Andrus 2021-02-12 15:38:01 MST
So I think I am on to something that may affect this as well.

It seems you cannot use pam_slurm_adopt AND nss_slurm

pam_slurm_adopt is unable to identify any user that has a job running under credentials from nss_slurm.

I suspect that is the same thing that is happening with sudo.
Comment 35 Tim McMullan 2021-02-16 07:05:50 MST
(In reply to Brian Andrus from comment #34)
> So I think I am on to something that may affect this as well.
> 
> It seems you cannot use pam_slurm_adopt AND nss_slurm
> 
> pam_slurm_adopt is unable to identify any user that has a job running under
> credentials from nss_slurm.
> 
> I suspect that is the same thing that is happening with sudo.

That's an interesting find, I didn't realize that was part of the equation! I'll see what adding that into the mix does.

Thanks!
--Tim
Comment 36 Tim McMullan 2021-02-24 07:33:44 MST
> pam_slurm_adopt is unable to identify any user that has a job running under credentials from nss_slurm.

When you say this, do you mean that nss_slurm is the only source of that users details? or is sssd on that system as well?
Comment 37 Brian Andrus 2021-02-24 07:50:37 MST
Created attachment 18083 [details]
034A7F654D8D43CAAF99D26BA2F3C24B.png

In testing, slurm and files were the only source of details configured in nsswitch.conf.
If I add sss back in, things can work, but that defeats the need for nss_slurm.

[https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg]Brian Andrus - HPC Systems
brian.andrus@lamresearch.com<mailto:brian.andrus@lamresearch.com>


From: bugs@schedmd.com<mailto:bugs@schedmd.com>
Sent: Wednesday, February 24, 2021 6:33 AM
To: Andrus, Brian<mailto:Brian.Andrus@lamresearch.com>
Subject: [Bug 9978] Cannot use sudo when using nss_slurm



External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe.If you believe this email may be unsafe, please click on the “Report Phishing” button on the top right of Outlook.


Comment # 36<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c36&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7Ca6f65b9348734da2166108d8d8d130fa%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637497740306285058%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=GRX861aVVekhfmlwQT9uAnlBZRL4GzHVMONCurC1Gpg%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7Ca6f65b9348734da2166108d8d8d130fa%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637497740306295052%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ZAoIXGhOkZOp6eoChKjTT1eF0PIB2rsIsJyMs6dyls0%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com>

> pam_slurm_adopt is unable to identify any user that has a job running under credentials from nss_slurm.



When you say this, do you mean that nss_slurm is the only source of that users

details? or is sssd on that system as well?

You are receiving this mail because:

  *   You reported the bug.
Comment 38 Tim McMullan 2021-02-24 08:47:06 MST
nss_slurm isn't meant to be a full replacement of the central authentication.  It will only respond with user/group information if the process is considered part of the "job".  In the case of pam_slurm_adopt, the session would already need to be adopted.  pam_slurm_adopt will have to check once against the central authentication, then find the job it can be adopted into after which any requests will be served by nss_slurm.

The only way this would impact sudo is if the sudo process was somehow escaping the job cgroup so it wouldn't be able to get the group information.

We can confirm that the sudo process is in the cgroup (I think we accomplished this with the debug patch for nss_slurm, but we can check this way too):
On a test node:

Make sudo require a password
via srun --pty on the node, run sudo hostname

In another shell, log in to the node as root.
Get the pid of the sudo process
confirm that pid is in the cgroup for the job eg:

grep 12354 /sys/fs/cgroup/memory/slurm/uid_1000/job_185/step_0/task_0/tasks
12354

If the pid is there, sudo isn't escaping so it should get the group info correctly.

I've been doing my recent tests on a system with pam_slurm_adopt, nss_slurm, and auth done with ldapd (sssd is not cooperating yet, so that is still in progress).
Comment 39 Brian Andrus 2021-02-24 09:23:36 MST
Ok,
That path doesn't have a 'slurm' directory, but there is one in /sys/fs/cgroup/freezer/slurm/
Inside there, I have /sys/fs/cgroup/freezer/slurm/uid_10043871/job_782724/step_0/tasks Which does have the pid of sudo hostname
There is also /sys/fs/cgroup/freezer/slurm/uid_10043871/job_782724/tasks which does not have that pid


[https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg] Brian Andrus - HPC Systems
brian.andrus@lamresearch.com



From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Wednesday, February 24, 2021 7:47 AM
To: Andrus, Brian <Brian.Andrus@lamresearch.com>
Subject: [Bug 9978] Cannot use sudo when using nss_slurm



External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe.If you believe this email may be unsafe, please click on the "Report Phishing" button on the top right of Outlook.


Comment # 38<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c38&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C5adc85dedd90461cd73008d8d8db707d%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637497784311623562%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=C2qMD8mXFtJpbsaq7eumt0AsRwj6SoFktbp0%2F%2B3zuhE%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C5adc85dedd90461cd73008d8d8db707d%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637497784311633548%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jDRGZuNtGP%2BV0rqkY9KgX%2B5BG6uhwUu9X8IKP4OAiq0%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com>

nss_slurm isn't meant to be a full replacement of the central authentication.

It will only respond with user/group information if the process is considered

part of the "job".  In the case of pam_slurm_adopt, the session would already

need to be adopted.  pam_slurm_adopt will have to check once against the

central authentication, then find the job it can be adopted into after which

any requests will be served by nss_slurm.



The only way this would impact sudo is if the sudo process was somehow escaping

the job cgroup so it wouldn't be able to get the group information.



We can confirm that the sudo process is in the cgroup (I think we accomplished

this with the debug patch for nss_slurm, but we can check this way too):

On a test node:



Make sudo require a password

via srun --pty on the node, run sudo hostname



In another shell, log in to the node as root.

Get the pid of the sudo process

confirm that pid is in the cgroup for the job eg:



grep 12354 /sys/fs/cgroup/memory/slurm/uid_1000/job_185/step_0/task_0/tasks

12354



If the pid is there, sudo isn't escaping so it should get the group info

correctly.



I've been doing my recent tests on a system with pam_slurm_adopt, nss_slurm,

and auth done with ldapd (sssd is not cooperating yet, so that is still in

progress).

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 40 Brian Andrus 2021-02-24 09:44:30 MST
A new datapoint:

So my nsswitch.conf has:
passwd:     slurm files sss
group:      slurm files sss

I started an interactive job on the node and then tried to ssh to the node. I get an error:
packet_write_wait: Connection to 10.49.38.112 port 22: Broken pipe

If I remove slurm from the nsswitch.conf, OR remove pam_slurm_adopt it works as expected. (so any one or none work, but not both)
This is using slurm 20.11.4

[https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg] Brian Andrus - HPC Systems
brian.andrus@lamresearch.com


From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Wednesday, February 24, 2021 7:47 AM
To: Andrus, Brian <Brian.Andrus@lamresearch.com>
Subject: [Bug 9978] Cannot use sudo when using nss_slurm



External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe.If you believe this email may be unsafe, please click on the "Report Phishing" button on the top right of Outlook.


Comment # 38<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c38&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C5adc85dedd90461cd73008d8d8db707d%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637497784311623562%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=C2qMD8mXFtJpbsaq7eumt0AsRwj6SoFktbp0%2F%2B3zuhE%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C5adc85dedd90461cd73008d8d8db707d%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637497784311633548%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jDRGZuNtGP%2BV0rqkY9KgX%2B5BG6uhwUu9X8IKP4OAiq0%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com>

nss_slurm isn't meant to be a full replacement of the central authentication.

It will only respond with user/group information if the process is considered

part of the "job".  In the case of pam_slurm_adopt, the session would already

need to be adopted.  pam_slurm_adopt will have to check once against the

central authentication, then find the job it can be adopted into after which

any requests will be served by nss_slurm.



The only way this would impact sudo is if the sudo process was somehow escaping

the job cgroup so it wouldn't be able to get the group information.



We can confirm that the sudo process is in the cgroup (I think we accomplished

this with the debug patch for nss_slurm, but we can check this way too):

On a test node:



Make sudo require a password

via srun --pty on the node, run sudo hostname



In another shell, log in to the node as root.

Get the pid of the sudo process

confirm that pid is in the cgroup for the job eg:



grep 12354 /sys/fs/cgroup/memory/slurm/uid_1000/job_185/step_0/task_0/tasks

12354



If the pid is there, sudo isn't escaping so it should get the group info

correctly.



I've been doing my recent tests on a system with pam_slurm_adopt, nss_slurm,

and auth done with ldapd (sssd is not cooperating yet, so that is still in

progress).

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 41 Tim McMullan 2021-02-24 10:09:40 MST
Can you enable debug logging for pam_slurm_adopt and attach the logs?

You should be able to find them in the authentication logs with the other pam logs.

A line like this in the pam config should do it:
-account		required	pam_slurm_adopt.so log_level=debug5

Thanks,
-Tim
Comment 42 Brian Andrus 2021-02-24 14:19:32 MST
Enabled debug logging.
With pam_slurm_adopt and nss_slurm enabled here is what I get in the log:

Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.extern
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.0
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug2: _establish_config_source: using config_file=/etc/slurm/slurm.conf (default)
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug:  slurm_conf_init: using config_file=/etc/slurm/slurm.conf
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug:  Reading slurm.conf file: /etc/slurm/slurm.conf
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.extern
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.0
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.extern
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.0
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: Connection by user andrubr: user has only one job 783074
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug:  _adopt_process: trying to get StepId=783074.extern to adopt 5357
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug:  Leaving stepd_add_extern_pid
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug:  Leaving stepd_get_x11_display
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: Process 5357 adopted into job 783074
Feb 24 21:18:01 gen-test-01 sshd[5357]: Accepted publickey for andrubr from 10.49.32.26 port 50768 ssh2: RSA SHA256:R/vELFJwo4D4RWcmMH/KdvXjsoR3oAGl4t0NoUKYHsk
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.extern
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.0
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug:  Leaving stepd_getpw
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.extern
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.0
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug:  Leaving stepd_getpw
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.extern
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.0
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.extern
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.0
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug:  Leaving stepd_getpw
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.extern
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.0
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.extern
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.0
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug:  Leaving stepd_getpw
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.extern
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.0
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug:  Leaving stepd_getpw
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.extern
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug4: found StepId=783074.0
Feb 24 21:18:01 gen-test-01 pam_slurm_adopt[5357]: debug:  Leaving stepd_getpw

And here is where I tried to log in:
13:17:24-andrubr@slurmmaster01:~$ ssh gen-test-01
packet_write_wait: Connection to 10.49.38.112 port 22: Broken pipe


From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Wednesday, February 24, 2021 9:10 AM
To: Andrus, Brian <Brian.Andrus@lamresearch.com>
Subject: [Bug 9978] Cannot use sudo when using nss_slurm



External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe.If you believe this email may be unsafe, please click on the "Report Phishing" button on the top right of Outlook.


Comment # 41<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c41&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7Cf1d9981ed3754b51927e08d8d8e6f914%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637497833842264064%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QJ%2Blpiq0jWJC9FHY3gTodvPnKv%2B4mqMpbuQVeP4FgUo%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7Cf1d9981ed3754b51927e08d8d8e6f914%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637497833842274060%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Up3SGoO7rfHP%2BC%2BckY4lpEXv0a7VY5aG5LtTR8Hq6nc%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com>

Can you enable debug logging for pam_slurm_adopt and attach the logs?



You should be able to find them in the authentication logs with the other pam

logs.



A line like this in the pam config should do it:

-account                required        pam_slurm_adopt.so log_level=debug5



Thanks,

-Tim

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 43 Tim McMullan 2021-02-25 06:41:00 MST
That looks like pam_slurm_adopt got fairly far in the process.

Would you be able to repeat the experiment, but also attach the slurmd log at at least debug2? Would you also be able to attach your current slurm.conf?
Comment 44 Brian Andrus 2021-02-25 08:27:38 MST
Here is the slurmd debug log for the tasks:

  1.  Start an interactive session

07:25:44-andrubr@slurmmaster01:~$ srun -N1 --exclusive --account root --partition gen-test --time 01:00:00 --pty bash

srun: job 785369 queued and waiting for resources

srun: job 785369 has been allocated resources

[andrubr@gen-test-01 ~]$

  1.  Try to ssh to that node as that user

07:26:00-andrubr@slurmmaster01:~$ ssh gen-test-01

packet_write_wait: Connection to 10.49.38.112 port 22: Broken pipe

07:26:38-andrubr@slurmmaster01:~$

  1.  End interactive session

[andrubr@gen-test-01 ~]$ exit

exit

07:26:52-andrubr@slurmmaster01:~$

Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: in the service_connection
Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug2: Start processing RPC: REQUEST_LAUNCH_PROLOG
Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug2: Processing RPC: REQUEST_LAUNCH_PROLOG
Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: state for jobid 785366: ctime:1614266368 revoked:1614266432 expires:1614266552
Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: destroying job 785366 state
Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: state for jobid 785367: ctime:1614266453 revoked:1614266454 expires:1614266575
Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: destroying job 785367 state
Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: state for jobid 785368: ctime:1614266466 revoked:1614266511 expires:1614266632
Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: destroying job 785368 state
Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug:  Checking credential with 1984 bytes of sig data
Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug2: _insert_job_state: we already have a job state for job 785369.  No big deal, just an FYI.
Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug:  [job 785369] attempting to run prolog [/opt/hpc-admin/slurm/bin/prologue.sh]
Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: _spawn_prolog_stepd: call to _forkexec_slurmstepd
Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: slurmstepd rank 0 (gen-test-01), parent rank -1 (NONE), children 0, depth 0, max_depth 0
Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: _spawn_prolog_stepd: return from _forkexec_slurmstepd 0
Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug2: Finish processing RPC: REQUEST_LAUNCH_PROLOG
Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: in the service_connection
Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug2: Start processing RPC: REQUEST_LAUNCH_TASKS
Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: launch task StepId=785369.0 request from UID:10043871 GID:1644000513 HOST:10.49.32.26 PORT:53790
Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug:  Checking credential with 1984 bytes of sig data
Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug:  Waiting for job 785369's prolog to complete
Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug:  Finished wait for job 785369's prolog to complete
Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: slurmstepd rank 0 (gen-test-01), parent rank -1 (NONE), children 0, depth 0, max_depth 0
Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
Feb 25 15:25:56 gen-test-01.fremont.lamrc.net slurmd[1140]: debug2: Finish processing RPC: REQUEST_LAUNCH_TASKS
Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: in the service_connection
Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug2: Start processing RPC: REQUEST_TERMINATE_JOB
Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug2: Processing RPC: REQUEST_TERMINATE_JOB
Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug:  _rpc_terminate_job, uid = 912 JobId=785369
Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug3: state for jobid 785369: ctime:1614266756 revoked:0 expires:2147483647
Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug:  credential for job 785369 revoked
Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug4: found StepId=785369.extern
Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug2: container signal 18 to StepId=785369.extern
Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug4: found StepId=785369.extern
Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug2: container signal 15 to StepId=785369.extern
Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug4: sent SUCCESS
Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug4: found StepId=785369.extern
Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug2: set revoke expiration for jobid 785369 to 1614266932 UTS
Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug:  Waiting for job 785369's prolog to complete
Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug:  Finished wait for job 785369's prolog to complete
Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug:  [job 785369] attempting to run epilog [/opt/hpc-admin/slurm/bin/epilogue.sh]
Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug:  completed epilog for jobid 785369
Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug:  JobId=785369: sent epilog complete msg: rc = 0
Feb 25 15:26:52 gen-test-01.fremont.lamrc.net slurmd[1140]: debug2: Finish processing RPC: REQUEST_TERMINATE_JOB



From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, February 25, 2021 5:41 AM
To: Andrus, Brian <Brian.Andrus@lamresearch.com>
Subject: [Bug 9978] Cannot use sudo when using nss_slurm



External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe.If you believe this email may be unsafe, please click on the "Report Phishing" button on the top right of Outlook.


Comment # 43<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c43&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7Ca692761072a5449e9be408d8d992fd5e%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637498572627594979%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Qg08x497%2B4jIxwq6HWkhkmZdphI96FKJz1cZoeROjbc%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7Ca692761072a5449e9be408d8d992fd5e%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637498572627604973%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=tFCfzs5KQBz9CYcJ%2BAqjNqCuW3dY0OAF3kBAIVqBVmI%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com>

That looks like pam_slurm_adopt got fairly far in the process.



Would you be able to repeat the experiment, but also attach the slurmd log at

at least debug2? Would you also be able to attach your current slurm.conf?

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 45 Tim McMullan 2021-03-08 06:26:07 MST
I'm not seeing the step logs in the included logs, are you using systemd to spawn the slurmd?  If so, you may want to set a "WorkingDirectory" in the unit file (usually just the log directory) and see if the step logs come back.

I'm still having no luck reproducing this. Would you please attach your slurm.conf file?  Running ssh with "-vv" may also help pin down where its getting stuck.

Thanks,
--Tim
Comment 46 Brian Andrus 2021-03-09 09:09:19 MST
Created attachment 18306 [details]
slurmd.log

Tim,
I ran ssh -vvv and here is the relevant part at the end:
debug2: we sent a publickey packet, wait for reply
debug3: receive packet: type 60
debug1: Server accepts key: pkalg rsa-sha2-512 blen 279
debug2: input_userauth_pk_ok: fp SHA256:R/vELFJwo4D4RWcmMH/KdvXjsoR3oAGl4t0NoUKYHsk
debug3: sign_and_send_pubkey: RSA SHA256:R/vELFJwo4D4RWcmMH/KdvXjsoR3oAGl4t0NoUKYHsk
debug3: send packet: type 50
debug3: receive packet: type 52
debug1: Authentication succeeded (publickey).
Authenticated to gen-test-01 ([10.49.38.112]:22).
debug1: channel 0: new [client-session]
debug3: ssh_session2_open: channel_new: 0
debug2: channel 0: send open
debug3: send packet: type 90
debug1: Requesting no-more-sessions@openssh.com
debug3: send packet: type 80
debug1: Entering interactive session.
debug1: pledge: network
debug3: send packet: type 1
packet_write_wait: Connection to 10.49.38.112 port 22: Broken pipe

So it authenticated and then tried to enter an interactive session.
I am using systemd and have that logging in debug always. I captured everything going there from anything slurm when I tried to ssh in:

Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.extern
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.0
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug2: _establish_config_source: using config_file=/etc/slurm/slurm.conf (default)
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug:  slurm_conf_init: using config_file=/etc/slurm/slurm.conf
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug:  Reading slurm.conf file: /etc/slurm/slurm.conf
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.extern
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.0
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.extern
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.0
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: Connection by user andrubr: user has only one job 802261
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug:  _adopt_process: trying to get StepId=802261.extern to adopt 6346
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug:  Leaving stepd_add_extern_pid
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug:  Leaving stepd_get_x11_display
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: Process 6346 adopted into job 802261
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.extern
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.0
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug:  Leaving stepd_getpw
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.extern
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.0
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug:  Leaving stepd_getpw
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.extern
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.0
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.extern
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.0
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug:  Leaving stepd_getpw
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.extern
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.0
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.extern
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.0
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug:  Leaving stepd_getpw
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.extern
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.0
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug:  Leaving stepd_getpw
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.extern
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug4: found StepId=802261.0
Mar 09 16:04:38 gen-test-01.fremont.lamrc.net pam_slurm_adopt[6346]: debug:  Leaving stepd_getpw

I also started slurmd with "-vv" as options and have attached that logfile.


From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, March 8, 2021 5:26 AM
To: Andrus, Brian <Brian.Andrus@lamresearch.com>
Subject: [Bug 9978] Cannot use sudo when using nss_slurm



External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe.If you believe this email may be unsafe, please click on the "Report Phishing" button on the top right of Outlook.


Comment # 45<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c45&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C5b96bf38e247490de9a308d8e235bb63%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637508067723511915%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=22IU2XoevcgCK%2F98xTGAVRy8GMepugo5A%2BRC7IlPTIw%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C5b96bf38e247490de9a308d8e235bb63%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637508067723511915%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=u0Ris6Dc1iW%2FXnZ8Kr8Uk5vxPH6ZKI6cy8x9i81roDk%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com>

I'm not seeing the step logs in the included logs, are you using systemd to

spawn the slurmd?  If so, you may want to set a "WorkingDirectory" in the unit

file (usually just the log directory) and see if the step logs come back.



I'm still having no luck reproducing this. Would you please attach your

slurm.conf file?  Running ssh with "-vv" may also help pin down where its

getting stuck.



Thanks,

--Tim

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 47 Tim McMullan 2021-03-12 06:57:58 MST
Thank you for the additional logs!  So far, it looks like the process at least gets adopted.  I'll see if I can get it to reproduce the ssh connection closing before the job ends in a similar place.

Thanks again,
--Tim
Comment 57 Jason Booth 2021-08-18 11:53:24 MDT
Hi Brian - I wanted to give you an update here, since this bug has been a long outstanding issue for you and for us. We have devoted substantial resources to tracking this down, however, we have not been able to find or duplicate this issue, which leaves you and us in a difficult situation. 

There is some suspicion that the setuid()'d sudo command escapes from the cgroup freezer, and thus when it calls getpwuid() none of the stepd processes claim responsibility for it, and thus won't report back the info. 

With the upcoming release of 21.08 we have completely revamped cgroup v1, we have a couple of suggestions.

1. Try 21.08 out in a test environment where you can duplicate this and see if you can reproduce. 21.08 also comes with improved logging around cgroups which may help pinpoint the issue if you can indeed reproduce this on 21.08.


2. If this is not a possibility or, 21.08 has the same issue then we can of course look at the logs from 21.08, however, it would be more beneficial to us if you could construct a reproducer with clear steps. What I mean by this is that we have tried on different OS'es, with different identity management systems. If you can try to outline the exact steps and config that leads to this situation, then perhaps we could duplicate.


Longer term, if we can not make progress here then I can not justify allocating more resources to tackle this issue since this does seem to be unique to your site. 

Please give this some though and let us know how you would like to proceed.


Jason Booth
Director of Support
Comment 58 Brian Andrus 2021-08-30 12:30:22 MDT
Tim,

So I just updated to slurm-21.08 and it looks like things work fine.
I was able to start a job on a node, then ssh to that node (so nss_slurm allowed access), then from that session I was able to 'sudo -i' which allowed sudo access for my account because it is part of an AD group which is allowed sudo access.

So, whatever updates/changes were made, it made this work as expected. You may close this ticket.


[https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg] Brian Andrus - HPC Systems
brian.andrus@lamresearch.com

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Wednesday, August 18, 2021 10:53 AM
To: Andrus, Brian <Brian.Andrus@lamresearch.com>
Subject: [Bug 9978] Cannot use sudo when using nss_slurm



External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please click on the "Report Phishing" button on the top right of Outlook.


Comment # 57<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978%23c57&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7Cd304ecd06b1145e36f2e08d962711444%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637649060089814076%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=BzxbTlY%2BoR8%2FJjycMqyAFGVOeDpAuwGopVpz2vqDHgo%3D&reserved=0> on bug 9978<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9978&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7Cd304ecd06b1145e36f2e08d962711444%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637649060089824071%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=sEEs2Lo7A0ZZmwJ%2F%2BrokRoXoCJ62ynZkLJkZHktPKW8%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com>

Hi Brian - I wanted to give you an update here, since this bug has been a long

outstanding issue for you and for us. We have devoted substantial resources to

tracking this down, however, we have not been able to find or duplicate this

issue, which leaves you and us in a difficult situation.



There is some suspicion that the setuid()'d sudo command escapes from the

cgroup freezer, and thus when it calls getpwuid() none of the stepd processes

claim responsibility for it, and thus won't report back the info.



With the upcoming release of 21.08 we have completely revamped cgroup v1, we

have a couple of suggestions.



1. Try 21.08 out in a test environment where you can duplicate this and see if

you can reproduce. 21.08 also comes with improved logging around cgroups which

may help pinpoint the issue if you can indeed reproduce this on 21.08.





2. If this is not a possibility or, 21.08 has the same issue then we can of

course look at the logs from 21.08, however, it would be more beneficial to us

if you could construct a reproducer with clear steps. What I mean by this is

that we have tried on different OS'es, with different identity management

systems. If you can try to outline the exact steps and config that leads to

this situation, then perhaps we could duplicate.





Longer term, if we can not make progress here then I can not justify allocating

more resources to tackle this issue since this does seem to be unique to your

site.



Please give this some though and let us know how you would like to proceed.





Jason Booth

Director of Support

________________________________
You are receiving this mail because:

  *   You reported the bug.

LAM RESEARCH CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, files, or previous e-mail messages attached to it, (collectively, "E-mail Transmission") may be subject to one or more of the following based on the associated sensitivity level: E-mail Transmission (i) contains confidential information, (ii) is prohibited from distribution outside of Lam, and/or (iii) is intended solely for and restricted to the specified recipient(s). If you are not the intended recipient, or a person responsible for delivering it to the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of any of the information contained in or attached to this message is STRICTLY PROHIBITED. If you have received this transmission in error, please immediately notify the sender and destroy the original transmission and its attachments without reading them or saving them to disk. Thank you.
Comment 59 Jason Booth 2021-08-30 12:34:12 MDT
Resolving as can not reproduce.