Ticket 737 - More detail in emails from slurm
Summary: More detail in emails from slurm
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 14.03.0
Hardware: Linux Linux
: 5 - Enhancement
Assignee: Moe Jette
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2014-04-22 07:33 MDT by Josko Plazonic
Modified: 2014-04-28 01:23 MDT (History)
1 user (show)

See Also:
Site: Princeton (PICSciE)
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 14.03.2
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Josko Plazonic 2014-04-22 07:33:24 MDT
Our users have been complaining that they do not get much info from slurm's emails.  E.g. this is what a user reported:

The emails from slurm are much less informative, with empty body and a
title like "SLURM Job_id=82771 Name=ovlp-gf-P12-7 Ended, Run time
13:13:32". More importantly, it does NOT report the fact that the job
actually failed, as shown in log:
slurmstepd: Job 82771 exceeded memory limit (41943176 > 41943040), being
killed
slurmstepd: Exceeded job memory limit
slurmstepd: *** JOB 82771 CANCELLED AT 2014-04-22T14:19:04 ***
slurmstepd: Exceeded step memory limit at some point. Step may have been
partially swapped out to disk.

They would like at the very least job status/exit code or something that they can look at and see which jobs need a 2nd look - ideally as much as possible.

I don't see any way of achieving that as is, any suggestions or workarounds?
Comment 1 David Bigagli 2014-04-22 07:36:01 MDT
Josko let me look into this and get back to you.

David
Comment 2 Moe Jette 2014-04-25 09:02:05 MDT
I've added a job's exit state (COMPELTED, FAILED, NODE_FAIL, etc.) plus its exit code to the email message, for example:

SLURM Job_id=200 Name=tmp Failed, Run time 00:00:01, FAILED, ExitCode 3

This change will be in v14.03.2 when released, probably next week.

https://github.com/SchedMD/slurm/commit/e8dce1a7ce6d812a8ca0c52a9d4e76e1a6f2c02e
Comment 3 Josko Plazonic 2014-04-28 01:23:34 MDT
Perfect, thanks!