Our users have been complaining that they do not get much info from slurm's emails. E.g. this is what a user reported: The emails from slurm are much less informative, with empty body and a title like "SLURM Job_id=82771 Name=ovlp-gf-P12-7 Ended, Run time 13:13:32". More importantly, it does NOT report the fact that the job actually failed, as shown in log: slurmstepd: Job 82771 exceeded memory limit (41943176 > 41943040), being killed slurmstepd: Exceeded job memory limit slurmstepd: *** JOB 82771 CANCELLED AT 2014-04-22T14:19:04 *** slurmstepd: Exceeded step memory limit at some point. Step may have been partially swapped out to disk. They would like at the very least job status/exit code or something that they can look at and see which jobs need a 2nd look - ideally as much as possible. I don't see any way of achieving that as is, any suggestions or workarounds?
Josko let me look into this and get back to you. David
I've added a job's exit state (COMPELTED, FAILED, NODE_FAIL, etc.) plus its exit code to the email message, for example: SLURM Job_id=200 Name=tmp Failed, Run time 00:00:01, FAILED, ExitCode 3 This change will be in v14.03.2 when released, probably next week. https://github.com/SchedMD/slurm/commit/e8dce1a7ce6d812a8ca0c52a9d4e76e1a6f2c02e
Perfect, thanks!