Ticket 737

Summary: More detail in emails from slurm
Product: Slurm Reporter: Josko Plazonic <plazonic>
Component: OtherAssignee: Moe Jette <jette>
Status: RESOLVED FIXED QA Contact:
Severity: 5 - Enhancement    
Priority: --- CC: da
Version: 14.03.0   
Hardware: Linux   
OS: Linux   
Site: Princeton (PICSciE) Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 14.03.2 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Josko Plazonic 2014-04-22 07:33:24 MDT
Our users have been complaining that they do not get much info from slurm's emails.  E.g. this is what a user reported:

The emails from slurm are much less informative, with empty body and a
title like "SLURM Job_id=82771 Name=ovlp-gf-P12-7 Ended, Run time
13:13:32". More importantly, it does NOT report the fact that the job
actually failed, as shown in log:
slurmstepd: Job 82771 exceeded memory limit (41943176 > 41943040), being
killed
slurmstepd: Exceeded job memory limit
slurmstepd: *** JOB 82771 CANCELLED AT 2014-04-22T14:19:04 ***
slurmstepd: Exceeded step memory limit at some point. Step may have been
partially swapped out to disk.

They would like at the very least job status/exit code or something that they can look at and see which jobs need a 2nd look - ideally as much as possible.

I don't see any way of achieving that as is, any suggestions or workarounds?
Comment 1 David Bigagli 2014-04-22 07:36:01 MDT
Josko let me look into this and get back to you.

David
Comment 2 Moe Jette 2014-04-25 09:02:05 MDT
I've added a job's exit state (COMPELTED, FAILED, NODE_FAIL, etc.) plus its exit code to the email message, for example:

SLURM Job_id=200 Name=tmp Failed, Run time 00:00:01, FAILED, ExitCode 3

This change will be in v14.03.2 when released, probably next week.

https://github.com/SchedMD/slurm/commit/e8dce1a7ce6d812a8ca0c52a9d4e76e1a6f2c02e
Comment 3 Josko Plazonic 2014-04-28 01:23:34 MDT
Perfect, thanks!