We're looking at integrating NVIDIA's DCGM GPU reporting into our Slurm setup per the instructions in this blog post (https://developer.nvidia.com/blog/job-statistics-nvidia-data-center-gpu-manager-slurm/) which boils down to a prolog and epilog script. Prolog: # DCGM job statistics group=$(sudo -u $SLURM_JOB_USER dcgmi group -c allgpus --default) if [ $? -eq 0 ]; then groupid=$(echo $group | awk '{print $10}') sudo -u $SLURM_JOB_USER dcgmi stats -g $groupid -e sudo -u $SLURM_JOB_USER dcgmi stats -g $groupid -s $SLURM_JOBID fi Epilog: # DCGM job statistics OUTPUTDIR=$(scontrol show job $SLURM_JOBID | grep WorkDir | cut -d = -f 2) sudo -u $SLURM_JOB_USER dcgmi stats -x $SLURM_JOBID sudo -u $SLURM_JOB_USER dcgmi stats -v -j $SLURM_JOBID | \ sudo -u $SLURM_JOB_USER tee $OUTPUTDIR/dcgm-gpu-stats-$HOSTNAME-$SLURM_JOBID.out We implemented this using the PrologSlurmctld and EpilogSlurmctld parameters in slurm.conf, however the Epilog script seems to run into some issues with the Slurm user using sudo. When logging errors from the commands in the epilog, it looks like the SlurmUser is having issues using sudo to run commands as the job user: We trust you have received the usual lecture from the local System Administrator. It usually boils down to these three things: #1) Respect the privacy of others. #2) Think before you type. #3) With great power comes great responsibility. sudo: no tty present and no askpass program specified Interestingly this doesn't seem to happen with the sudo commands in the prolog. Is this happening because the job has already ended by the time the epilog script runs, and the SlurmUser is thus unable to run commands as the job user anymore? I would be incredibly grateful for any assistance on getting this working!
Please attach your slurm.conf Slurm has SlurmUser and SlurmdUser. > SlurmUser = slurm(1000) > SlurmdUser = root(0) The epilog should run as the SlurmdUser which should be root on all systems except with some exotic configurations. https://slurm.schedmd.com/prolog_epilog.html You might want to put in some debugging in the Epilog that runs "whoami" just to see which user this is and then confirm if that user has sudo nopasswd access. By default, root should be able to do this without a password prompt from the command.
Created attachment 30922 [details] slurm.conf
Hi Jason, Thanks for your response! Please find our slurm.conf attached. I tried logging "whoami" in the epilog and it looks like this is running as user Slurm. I tried changing PrologSlurmctld and EpilogSlurmctld to simply "Prolog" and "Epilog," but I then see the Prolog script fail on start and the nodes drain. If there are any Slurm-side config changes that would help here, please let me know.
Thank you for uploading that information. > Interestingly this doesn't seem to happen with the sudo commands in the prolog.. Can you confirm if the user slurm is part of the sudoers or if it has an entry in the sudoers file? > slurm ALL=(ALL) NOPASSWD:/path/to/command1
Hi Jason, Thanks for your help, I was able to solve the issue. It turns out I just needed to change to Prolog/Epilog from SlurmctldProlog/SlurmctldEpilog; the errors were due to a PATH difference in binaries being called by slurm vs root. Thanks!