Ticket 12110 - RFE: introduce SLURM_JOB_COMMENT in prologs/epilogs scripts
Summary: RFE: introduce SLURM_JOB_COMMENT in prologs/epilogs scripts
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 21.08.x
Hardware: Linux Linux
: 5 - Enhancement
Assignee: Tim Wickberg
QA Contact:
URL:
: 9331 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2021-07-22 15:42 MDT by Felix Abecassis
Modified: 2022-04-28 15:58 MDT (History)
5 users (show)

See Also:
Site: NVIDIA (PSLA)
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 22.05.0pre1
Target Release: 22.05
DevPrio: 1 - Paid
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Felix Abecassis 2021-07-22 15:42:08 MDT
Per https://slurm.schedmd.com/prolog_epilog.html, a wide range of information about the job is available in slurmd prolog/epilog scripts through environment variables. 

For example SLURM_JOB_CONSTRAINTS is useful for us as we sometimes need to do additional per-job setup when the node was rebooted with a particular dynamic node feature (thanks to the helper node_features plugin from https://bugs.schedmd.com/show_bug.cgi?id=9567). 

For node tweaks that don't require a reboot, we rely on --comment. For instance, to disable THP a user can do "--comment=transparent_hugepage=never" and the change is applied to the node in a Slurm epilog script. However we have to rely on "scontrol" to query the "Comment" field of the job today, which is not recommended per the documentation above:
> Prolog and Epilog scripts should be designed to be as short as possible and should not call Slurm commands (e.g. squeue, scontrol, sacctmgr, etc). Long running scripts can cause scheduling problems when jobs take a long time to start or finish. Slurm commands in these scripts can potentially lead to performance issues and should not be used.

There are multiple ways to workaround this, but adding SLURM_JOB_COMMENT (behaving similarly to SLURM_JOB_CONSTRAINTS) is probably straightforward to do and would eliminate the call to scontrol.
Comment 1 Julie Bernauer 2021-07-22 15:43:06 MDT
I am OOO until Aug 2nd. Email replies will be delayed.
Comment 2 Tim Wickberg 2021-07-26 11:27:17 MDT
Felix - 

If NVIDIA's interested in sponsoring work around this for 22.05 we can discuss that. I'll ask Jess to setup a call to discuss this and other potential development work in a couple of weeks once the release work for 21.08 quiets down.

- Tim
Comment 3 Jess 2021-07-26 19:00:44 MDT
Sounds good  :)   I'll get a call set up the week of August 20th
Comment 4 Tim Wickberg 2022-04-28 15:28:32 MDT
Hey folks -

Work on this is complete, and will be included in 22.05 when released.

Environment variables accessible in Prolog/Epilog/PrologSlurmctld/EpilogSlurmctld have been expanded to include:

SLURM_JOB_COMMENT
SLURM_JOB_STDERR
SLURM_JOB_STDIN
SLURM_JOB_STDOUT
SLURM_JOB_PARTITION
SLURM_JOB_ACCOUNT
SLURM_JOB_RESERVATION
SLURM_JOB_CONSTRAINTS
SLURM_JOB_NUM_HOSTS
SLURM_JOB_CPUS_PER_NODE
SLURM_JOB_NTASKS
SLURM_JOB_RESTART_COUNT

And for Epilog/EpilogSlurmctld the exit codes are additionally accessible through
SLURM_JOB_DERIVED_EC
SLURM_JOB_EXIT_CODE
SLURM_JOB_EXIT_CODE2
Comment 5 Tim Wickberg 2022-04-28 15:29:56 MDT
*** Ticket 9331 has been marked as a duplicate of this ticket. ***