Ticket 3685

Summary: Support for a potential 'BatchProlog' / 'BatchEpilog' script to allow slurmd to output header/footer info to job output
Product: Slurm Reporter: Bjørn-Helge Mevik <b.h.mevik>
Component: User CommandsAssignee: Unassigned Developer <dev-unassigned>
Status: OPEN --- QA Contact:
Severity: 5 - Enhancement    
Priority: --- CC: sts, uemit.seren
Version: 17.02.1   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=3207
https://bugs.schedmd.com/show_bug.cgi?id=8107
Site: Sigma2 Norway Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Bjørn-Helge Mevik 2017-04-11 06:44:45 MDT
(This is with Slurm 17.02.1-2)

If a job has --output that contains "%x", "scontrol show job" will not substitute that with the job name.  For instance:

413 (1) $ sbatch --wrap='sleep 60' -A nn9999k -t 1:0:0 -N 4 --output='%x.out'
Submitted batch job 1437
414 (1) $ scontrol show job 1437
JobId=1437 JobName=wrap
   UserId=bhm(51568) GroupId=bhm(51568) MCS_label=N/A
   Priority=19940 Nice=0 Account=nn9999k QOS=nn9999k
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:08 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2017-04-11T14:41:52 EligibleTime=2017-04-11T14:41:52
   StartTime=2017-04-11T14:41:52 EndTime=2017-04-11T15:41:52 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=normal AllocNode:Sid=login-1-2:4143709
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=c29-[2-5]
   BatchHost=c29-2
   NumNodes=4 NumCPUs=128 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=128,mem=240G,node=4
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=60G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/cluster/nird/home/bhm/testjobs
   StdErr=/cluster/nird/home/bhm/testjobs/%x.out
   StdIn=/dev/null
   StdOut=/cluster/nird/home/bhm/testjobs/%x.out
   Power=

We use scontrol show job in the prolog and epilog scripts to write info into the jobs stdout file.  This is a minor issue, since it is easy to do the substitution ourself, but it would be nice not to have to do it. :)
Comment 1 Tim Wickberg 2017-04-11 09:15:48 MDT
There's no safe way to handle the format substitution there for all jobs, and as such I'm disinclined to change this.

Keep in mind that %x is not the only format option available - %n %N in particular cannot be anticipated ahead of the job starting.

If you don't mind me redirecting this, can you elaborate on how you're using this in the Prolog/Epilog?

I've heard similar requests for ways to put some output in the batch job output automatically, and would be curious to better understand that. There might be a reasonable enhancement request we could address in 17.11 to better address that use case, but it'd help me if I better understood how that would work.

If there was a version of a Prolog/Epilog script that (a) ran once per job, and (b) had its output inserted into the users' StdOut would that cover this? Say a new set of configuration options something like "BatchProlog/BatchEpilog"?
Comment 2 Bjørn-Helge Mevik 2017-04-12 02:07:41 MDT
(In reply to Tim Wickberg from comment #1)

> If you don't mind me redirecting this, can you elaborate on how you're using
> this in the Prolog/Epilog?

Sure, no problem: We are using it to print a header ("Starting job $SLURM_JOB_ID on $SLURM_NODELIST at $(date)") and a footer including output from sacct (for instance "sacct -j $SLURM_JOB_ID -o JobID,JobName,AllocCPUs,NTasks,MinCPU,MinCPUTask,AveCPU,Elapsed,ExitCode") into the stdout file.  The idea is (amongst other) to make users more aware of how their jobs actually performed wrt. the resources they asked for it (like memory or walltime).

Basically, we parse the output from "scontrol show job --oneliner $SLURM_JOB_ID" and put it into a bash array variable $job.  Then we use:

---- snip ----
## Do the stuff that should only be done once, on the head node:
if [[ $SLURMD_NODENAME == ${job[BatchHost]} ]]; then
    ... other stuff ...

    ## Run epilog_slurmd.user for batch jobs (only):
    if [[ ${job[BatchFlag]} == 1 ]]; then
       export STDOUT_FILE=$(echo ${job[StdOut]} | sed "s/%x/${job[JobName]}/g")
       su "$SLURM_JOB_USER" -c /node/sbin/epilog_slurmd.user
    fi
fi
---- snip ----

and epilog_slurmd.user does
----- snip -----
## Make sure stdout file exists or is created with the right owner:
if [[ ! -f $STDOUT_FILE ]]; then
    touch $STDOUT_FILE
    chown $USER_DOT_GROUP $STDOUT_FILE
fi
## Append usage stats to stdout file:
{
    echo
    echo Task and CPU usage stats:
    sacct -j $SLURM_JOB_ID -o JobID,JobName,AllocCPUs,NTasks,MinCPU,MinCPUTask,AveCPU,Elapsed,ExitCode
    ... more stuff ...
} >> $STDOUT_FILE
exit 0 # Needed in case directory of $STDOUT_FILE has been removed
---- snip ----

(And a similar setup for the prolog.)  This is simplified a bit, of course.  The setup is a bit finicky; a lot of small details that have to be taken care of, but it works fairly well.  We used to have this implemented as a shell script that jobs were supposed to source at the start, and which used a shell trigger function to print out the footer, but users kept forgetting to source the file. :)

> If there was a version of a Prolog/Epilog script that (a) ran once per job,
> and (b) had its output inserted into the users' StdOut would that cover
> this? Say a new set of configuration options something like
> "BatchProlog/BatchEpilog"?

That would definitely cover our needs, and make our prolog/epilog setup much simpler!
Comment 4 Tim Wickberg 2017-04-18 20:23:26 MDT
Remarking as an Enhancement request. No promises on when/if this may happen, although I'd like to see something in this vein done.

There are a few architectural hurdles we need to discuss internally - for one, most common uses seem to be related to printing accounting records from 'sacct', and the final information isn't pushed there until after the Epilog finishes. So a hypothetical 'BatchEpilog' would need to either rely on some other source of accounting data, or be run after the traditional Epilog has completed.