Ticket 17051

Summary: How does SLURM report that a job was killed because of a MEM contraint violation?
Product: Slurm Reporter: Brent G <brent.gawryluik>
Component: ConfigurationAssignee: Benjamin Witham <benjamin.witham>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: benjamin.witham
Version: 22.05.8   
Hardware: Linux   
OS: Linux   
Site: Recursion Pharma Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Brent G 2023-06-26 15:45:42 MDT
We've recently enabled MEMORY constraints and have successfully tested SLURM OOM killing jobs that have violated these constraints. The problem is that we can't find anything concrete in the log to explain the activity...

scontrol show job says job status is FAILED but it doesn't tell you why.

The err output says that a PID was killed but not why.

There is nothing in the node's slurmd log, nor is there anything specific that I can find in slurmctld...

The reason I ask is that I suspect we are going to start getting a great deal of new support tickets from users asking us why their compute jobs suddenly failed.

Any info here would be greatly appreciated.
Comment 1 Benjamin Witham 2023-06-26 16:13:50 MDT
Hello Brent,

Slurm should report that a job has failed with the reason tag in scontrol show job. Are you not seeing this behavior from your scontrol? Are you needing more information for a job failure, or just that it was out of memory?

The complete list of reason codes can be found here:
> https://slurm.schedmd.com/resource_limits.html
Comment 2 Brent G 2023-06-26 20:40:42 MDT
Thank you for the quick response. We will test this again tomorrow and look more closely at the `scontrol` output.
Comment 3 Benjamin Witham 2023-07-05 13:20:17 MDT
Hello Brent, 

Just checking in to see if your scontrol is working properly and displaying your OOM kill reason. If so, I'll go ahead and close this ticket.
Comment 4 Benjamin Witham 2023-07-18 09:07:23 MDT
Closing ticket