Ticket 17033 - EpilogSlurmctld sudo error
Summary: EpilogSlurmctld sudo error
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 22.05.6
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Jason Booth
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2023-06-22 14:52 MDT by Alex Mamach
Modified: 2023-06-23 13:32 MDT (History)
0 users

See Also:
Site: Memorial Sloan Kettering Cancer Center
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (1.65 KB, text/plain)
2023-06-23 10:13 MDT, Alex Mamach
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Alex Mamach 2023-06-22 14:52:49 MDT
We're looking at integrating NVIDIA's DCGM GPU reporting into our Slurm setup per the instructions in this blog post (https://developer.nvidia.com/blog/job-statistics-nvidia-data-center-gpu-manager-slurm/) which boils down to a prolog and epilog script.

Prolog:


# DCGM job statistics
group=$(sudo -u $SLURM_JOB_USER dcgmi group -c allgpus --default)
if [ $? -eq 0 ]; then
  groupid=$(echo $group | awk '{print $10}')
  sudo -u $SLURM_JOB_USER dcgmi stats -g $groupid -e
  sudo -u $SLURM_JOB_USER dcgmi stats -g $groupid -s $SLURM_JOBID
fi

Epilog:

# DCGM job statistics
OUTPUTDIR=$(scontrol show job $SLURM_JOBID | grep WorkDir | cut -d = -f 2)
sudo -u $SLURM_JOB_USER dcgmi stats -x $SLURM_JOBID
sudo -u $SLURM_JOB_USER dcgmi stats -v -j $SLURM_JOBID | \
sudo -u $SLURM_JOB_USER tee $OUTPUTDIR/dcgm-gpu-stats-$HOSTNAME-$SLURM_JOBID.out

We implemented this using the PrologSlurmctld and EpilogSlurmctld parameters in slurm.conf, however the Epilog script seems to run into some issues with the Slurm user using sudo. When logging errors from the commands in the epilog, it looks like the SlurmUser is having issues using sudo to run commands as the job user:

We trust you have received the usual lecture from the local System
Administrator. It usually boils down to these three things:

    #1) Respect the privacy of others.
    #2) Think before you type.
    #3) With great power comes great responsibility.

sudo: no tty present and no askpass program specified

Interestingly this doesn't seem to happen with the sudo commands in the prolog. Is this happening because the job has already ended by the time the epilog script runs, and the SlurmUser is thus unable to run commands as the job user anymore?

I would be incredibly grateful for any assistance on getting this working!
Comment 1 Jason Booth 2023-06-22 17:38:23 MDT
Please attach your slurm.conf

Slurm has SlurmUser and SlurmdUser.

> SlurmUser               = slurm(1000)
> SlurmdUser              = root(0)

The epilog should run as the SlurmdUser which should be root on all systems except with some exotic configurations.

https://slurm.schedmd.com/prolog_epilog.html

You might want to put in some debugging in the Epilog that runs "whoami" just to see which user this is and then confirm if that user has sudo nopasswd access. By default, root should be able to do this without a password prompt from the command.
Comment 2 Alex Mamach 2023-06-23 10:13:26 MDT
Created attachment 30922 [details]
slurm.conf
Comment 3 Alex Mamach 2023-06-23 10:14:11 MDT
Hi Jason,

Thanks for your response! Please find our slurm.conf attached. I tried logging "whoami" in the epilog and it looks like this is running as user Slurm.

I tried changing PrologSlurmctld and EpilogSlurmctld to simply "Prolog" and "Epilog," but I then see the Prolog script fail on start and the nodes drain.

If there are any Slurm-side config changes that would help here, please let me know.
Comment 4 Jason Booth 2023-06-23 10:37:13 MDT
Thank you for uploading that information.

> Interestingly this doesn't seem to happen with the sudo commands in the prolog..

Can you confirm if the user slurm is part of the sudoers or if it has an entry in the sudoers file?

> slurm   ALL=(ALL) NOPASSWD:/path/to/command1
Comment 5 Alex Mamach 2023-06-23 13:32:31 MDT
Hi Jason,

Thanks for your help, I was able to solve the issue. It turns out I just needed to change to Prolog/Epilog from SlurmctldProlog/SlurmctldEpilog; the errors were due to a PATH difference in binaries being called by slurm vs root.

Thanks!