Per https://slurm.schedmd.com/prolog_epilog.html, a wide range of information about the job is available in slurmd prolog/epilog scripts through environment variables. For example SLURM_JOB_CONSTRAINTS is useful for us as we sometimes need to do additional per-job setup when the node was rebooted with a particular dynamic node feature (thanks to the helper node_features plugin from https://bugs.schedmd.com/show_bug.cgi?id=9567). For node tweaks that don't require a reboot, we rely on --comment. For instance, to disable THP a user can do "--comment=transparent_hugepage=never" and the change is applied to the node in a Slurm epilog script. However we have to rely on "scontrol" to query the "Comment" field of the job today, which is not recommended per the documentation above: > Prolog and Epilog scripts should be designed to be as short as possible and should not call Slurm commands (e.g. squeue, scontrol, sacctmgr, etc). Long running scripts can cause scheduling problems when jobs take a long time to start or finish. Slurm commands in these scripts can potentially lead to performance issues and should not be used. There are multiple ways to workaround this, but adding SLURM_JOB_COMMENT (behaving similarly to SLURM_JOB_CONSTRAINTS) is probably straightforward to do and would eliminate the call to scontrol.
I am OOO until Aug 2nd. Email replies will be delayed.
Felix - If NVIDIA's interested in sponsoring work around this for 22.05 we can discuss that. I'll ask Jess to setup a call to discuss this and other potential development work in a couple of weeks once the release work for 21.08 quiets down. - Tim
Sounds good :) I'll get a call set up the week of August 20th
Hey folks - Work on this is complete, and will be included in 22.05 when released. Environment variables accessible in Prolog/Epilog/PrologSlurmctld/EpilogSlurmctld have been expanded to include: SLURM_JOB_COMMENT SLURM_JOB_STDERR SLURM_JOB_STDIN SLURM_JOB_STDOUT SLURM_JOB_PARTITION SLURM_JOB_ACCOUNT SLURM_JOB_RESERVATION SLURM_JOB_CONSTRAINTS SLURM_JOB_NUM_HOSTS SLURM_JOB_CPUS_PER_NODE SLURM_JOB_NTASKS SLURM_JOB_RESTART_COUNT And for Epilog/EpilogSlurmctld the exit codes are additionally accessible through SLURM_JOB_DERIVED_EC SLURM_JOB_EXIT_CODE SLURM_JOB_EXIT_CODE2
*** Ticket 9331 has been marked as a duplicate of this ticket. ***