Ticket 21733 - Out of Memory report doesn't work
Summary: Out of Memory report doesn't work
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 24.05.5
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2025-01-01 20:51 MST by Jo Rhett
Modified: 2025-01-03 14:38 MST (History)
1 user (show)

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Jo Rhett 2025-01-01 20:51:44 MST
Simply put, queries for OUT OF MEMORY do not return results even when the events have occurred

$ sacct --state=oom --format=User,JobId,JobName,ReqMem,MaxRSS,MaxVMSize,start,end,state --start=now-6hours --end=now
     User JobID           JobName     ReqMem     MaxRSS  MaxVMSize               Start                 End      State
--------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ----------

$ sacct --state=out_of_memory --format=User,JobId,JobName,ReqMem,MaxRSS,MaxVMSize,start,end,state --start=now-6hours --end=now
     User JobID           JobName     ReqMem     MaxRSS  MaxVMSize               Start                 End      State
--------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ----------

$ sacct -N exo10 --state=f --format=User,JobId,JobName,ReqMem,MaxRSS,MaxVMSize,start,end,state --start=now-6hours --end=now-5hours
     User JobID           JobName     ReqMem     MaxRSS  MaxVMSize               Start                 End      State
--------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ----------
ci_runner 699462       sau:cov_r+        40G                       2025-01-01T14:01:20 2025-01-01T14:03:13     FAILED
          699462.batch      batch                 4660K          0 2025-01-01T14:01:20 2025-01-01T14:03:13     FAILED
          699462.exte+     extern                  256K          0 2025-01-01T14:01:20 2025-01-01T14:03:13  COMPLETED
          699462.0           make             34425392K          0 2025-01-01T14:01:20 2025-01-01T14:03:13 OUT_OF_ME+
ci_runner 699500       ipu_wrapp+        20G                       2025-01-01T14:27:11 2025-01-01T14:28:22     FAILED
          699500.batch      batch                 4912K          0 2025-01-01T14:27:11 2025-01-01T14:28:22     FAILED
          699500.exte+     extern                     0          0 2025-01-01T14:27:11 2025-01-01T14:28:22  COMPLETED
          699500.0           make                54504K          0 2025-01-01T14:27:11 2025-01-01T14:28:22  CANCELLED
Comment 1 Ricard Zarco Badia 2025-01-02 08:06:01 MST
Hello Jo,

The example you have provided is expected behavior. The --state flag filters results at the job level, not step level. You will see similar behaviors if you try it with other job states, specifically in cases where the job state is not the same as all its steps.

Best regards, Ricard.
Comment 2 Jo Rhett 2025-01-02 09:12:05 MST
This has got to be the most useless reply I have ever received. You saw my need, and you only told me that it wasn't successful. I knew this already, thus the ticket.

Please tell me how I can see all jobs that have failed for out of menmory.
Comment 3 Jo Rhett 2025-01-02 09:33:40 MST
I would like this issue escalated. Not only was the response I received insulting, but it defies your own documentation

From https://slurm.schedmd.com/sacct.html

> -s, --state=<state_list>
> Selects jobs based on their state during the time period given. Unless otherwise specified, the start and end time will be the current time when the --state option is specified and only currently running jobs can be displayed. A start and/or end time must be specified to view information about jobs not currently running. See the JOB STATE CODES section below for a list of state designators. 

From that section:

? JOB STATE CODES

> The following states are recognized by sacct. A full list of possible states is available at <https://slurm.schedmd.com/job_state_codes.html>.

...

> OOM OUT_OF_MEMORY Job experienced out of memory error.


So your documentation clearly states that a job can have an OOM status. Sacct allows this query (as opposed to names not in this list) and yet Richard insists that the step state will never be matched.

1. Your docs are wrong
2. Your response has the least helpful you could possibly have been short of not responding at all
Comment 4 Jason Booth 2025-01-02 11:16:44 MST
Jo, Ricard brought this issue to my attention.

After reviewing his response, I do not see any insult or malice with his reply.  

I do understand your frustration with the documentation as it relates to the --state definition. With that said, Ricard is correct
the --state refers to the job and not step. Since a job can have hundreds of steps and with some workflows a step might be expected to
fail as long as the job completes correctly. For example, the below job runs a salloc and submits a step that OOM's. Then the job runs
another srun to query hostname. The entire job completes and has its global state recorded as completed. 

I do ask that you consider the voice in which you interact with our support. It would be better to work with us by requesting an enhancement or improvement to the documentation rather than the hostile escalation method you have chooses for this ticket. 
Additionally ticket severity should reflect impact to the cluster, and escalations for minor issues like this do help your case. 


[Example]

[jason@nh-grey 24.11]$ salloc -t 250 --mem=100MB -c 2
salloc: Granted job allocation 5043
salloc: Nodes n1 are ready for job
srun: ROUTE: split_hostlist: hl=n1 tree_width 16
(salloc) [jason@nh-grey 24.11]$ sacct -j 5043
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
5043         interacti+       defq    schedmd          2    RUNNING      0:0
5043.intera+ interacti+               schedmd          2    RUNNING      0:0
5043.extern      extern               schedmd          2    RUNNING      0:0
[jason@nh-grey 24.11]$ srun ~/tools/eat_mem/eat_while 50
srun: ROUTE: split_hostlist: hl=n1 tree_width 16
pid: 120081, Tot mem=10mb
pid: 120081, Tot mem=20mb
pid: 120081, Tot mem=30mb
pid: 120081, Tot mem=40mb
pid: 120081, Tot mem=50mb
pid: 120081, Tot mem=60mb
pid: 120081, Tot mem=70mb
pid: 120081, Tot mem=80mb
slurmstepd-n1: error: Detected 1 oom_kill event in StepId=5043.0. Some of the step tasks have been OOM Killed.
srun: error: n1: task 0: Out Of Memory
srun: Terminating StepId=5043.0
(salloc) [jason@nh-grey 24.11]$ srun hostname
srun: ROUTE: split_hostlist: hl=n1 tree_width 16
nh-grey
[jason@nh-grey 24.11]$ sacct -j 5043
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
5043         interacti+       defq    schedmd          2    RUNNING      0:0
5043.intera+ interacti+               schedmd          2    RUNNING      0:0
5043.extern      extern               schedmd          2    RUNNING      0:0
5043.0        eat_while               schedmd          2 OUT_OF_ME+    0:125
5043.1         hostname               schedmd          2  COMPLETED      0:0
(salloc) [jason@nh-grey 24.11]$exit

$ sacct -j 5043
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
5043         interacti+       defq    schedmd          2  COMPLETED      0:0
5043.intera+ interacti+               schedmd          2  COMPLETED      0:0
5043.extern      extern               schedmd          2  COMPLETED      0:0
5043.0        eat_while               schedmd          2 OUT_OF_ME+    0:125
5043.1         hostname               schedmd          2  COMPLETED      0:0
Comment 5 Jo Rhett 2025-01-02 12:43:18 MST
(In reply to Jason Booth from comment #4)
> After reviewing his response, I do not see any insult or malice with his
> reply.  

Deliberately avoiding answering the need can be (and has been) seen as insulting to the requestor, and a malicious waste of the requestor's time. Responses like this are given as examples of hostile behavior in every community that has a posted guide.

> I do understand your frustration with the documentation as it relates to the
> --state definition. With that said, Ricard is correct
> the --state refers to the job and not step. Since a job can have hundreds of
> steps and with some workflows a step might be expected to
> fail as long as the job completes correctly.
 
You have justified that is plausible for a job to have multiple status for different steps. Because it is plausible for there to be multiple steps has no relationship to why jobs with a single step cannot be retrieved by query.

> For example, the below job runs
> a salloc and submits a step that OOM's. Then the job runs
> another srun to query hostname. The entire job completes and has its global
> state recorded as completed. 

Why relates to the job posted in the problem report exactly how?

We aren't dealing with some theoretical situation with multiple steps. This job has one step. That one step failed with OOM. We cannot retrieve that from a query.

I completely understand that someone somewhere might have some more complex situation whereby the state might be different. But this situation is not that, and in your example you've doubled down on a situation unlike ours, and somehow again completely avoided answering the real question.

We have a job with a single step, with a single step state, that clearly failed for OOM. Why am I unable to retrieve this from a query?

Therefore, you have doubled down on refusing to answer the question asked, which is how to query jobs that failed with OUT_OF_MEMORY.

> I do ask that you consider the voice in which you interact with our support.
> It would be better to work with us by requesting an enhancement or
> improvement to the documentation rather than the hostile escalation method
> you have chooses for this ticket. 

I would ask that when we are paying you for support, that the support attempt to help us rather than toss out theoritical situations with zero bearing on the problem reported. This is wasting my time, as has this entire conversation so far.

It would be really nice if you would return to the problem reported and tell us how we can retrieve a report of OOM events to solve our IMMEDIATE, BREAKING and HIGH PRIORITY need

> Additionally ticket severity should reflect impact to the cluster, and
> escalations for minor issues like this do help your case. 

This need is very high impact, is breaking mission-critical jobs on our cluster, and matters at the level of an emergency for us. Continuing to downplay this while refusing to assist us is undeniably HOSTILE
Comment 6 Ricard Zarco Badia 2025-01-03 11:21:48 MST
Hello Jo,

First of all, I want to make clear that there was no ill intent behind my first response. It is not uncommon for us to receive reports of potential bugs in Slurm's features from customers because they noticed something at some point during their workflow. That response was focused just on investigating if such feature was working as expected or not and report my findings. I am sorry that I initially did not pick it up as an ongoing need, but we can steer this conversation towards that.

Having said that, I will first provide a TL;DR if you wish to go straight to the point without having to go over all the technicalities of job/step state handling.

-- TL;DR --
Since the OUT_OF_MEMORY state will not necessarily be present as a *job* state when one of its steps encounters an OOM event, the easiest approach right now would be to just skip the "--state" flag altogether and do a general sacct of your wanted time frame, forwarding that to a grep of your needed state. A basic example using your command as a base would be this:

>> $ sacct --format=User,JobId,JobName,ReqMem,MaxRSS,MaxVMSize,start,end,state --start=<start_time> --end=<end_time> | grep OUT_OF_ME

You will get the exact steps reported as OUT_OF_MEMORY, and you can extract the jobID from them either visually or programmatically if you choose so. After that, you can then analyze your desired jobs as a whole (via "sacct -j", for example). It is not the most elegant approach and there are other avenues (more on that in point 9 and onwards from the section below), but this is functional and relatively hassle-free. If this is still not enough for your needs, let me know.



-- Detailed explanation on job/step state handling --
I will start going from the basic facts to their derived implications, to avoid logic gaps in the narrative:

1. A job does not start processes by itself. By definition, it is a collection of steps and tasks (they can also be performed in parallel). A summary with further details about how jobs, steps, tasks and other related components can be found here [1].

2. For a job, the finalization state is registered at step level. Steps can have parallel tasks and some of those can trigger an OOM event.

3. There are some considerations to take into account about the reliability of steps being accurately tagged as OOM, but since you are using cgroup/v2 if I am not mistaken, we can skip this. When using cgroup/v2 (and if the kernel supports it), we can accurately detect if a step has produced an OOM event by checking the oom_kill field of its memory.events cgroup file.

4. Since a numbered step (<jobID>.<number>) is almost always where the real user workload resides (aka spawned by a srun command either directly or from a batch script), it is the most important one to check, normally through sacct. However, it is perfectly possible to trigger a OOM in non-numbered steps, like a batch step for example.

5. The state of the job will be interpreted as the state *of the last step in the allocation*. This is important, and it implies that the job state is sensitive to step execution order.

6. For a job spawned via sbatch, the last step is the batch step. Even if something has been launched from the batch step at some point through srun and ends up with a OOM, the batch step as a whole will have a state based on its *final exit code*. This is the reason why you can see jobs with COMPLETED or FAILED states, but only see the OOM reported in one of their steps. This  applies even for single-step *batch* jobs, since the batch step will still be the last step.

7. In these cases, it is up to the user to add error checking in their batch scripts to accurately mark a whole job as failed or completed depending on their use case. The failure of some steps within a job can be viewed as acceptable (even expected) for some workloads without meaning that the job as a whole has failed. This is the reasoning behind the whole mechanism explained in point 6.

8. For jobs spawned directly through srun, a batch step will not be present. The state of the job will be the state of its numbered step, so these should be able to be retrieved with the "--state" sacct option.

9. That being said, you might be interested in using a combination of ExitCode [2] and DerivedExitCode [3] in your sacct output to have a quick glance at job/step error codes and signals (if they had a role to play in the step termination). 

10. If you need advanced data parsing, it would be a good idea to check out the "--json" flag for sacct [4] (or "--yaml" if that is more your cup of tea). You can also query the rest API [5], but I would stick with sacct if it suffices your needs for now.

I think that this covers mostly everything that is related to the topic at hand. Let me know if there are any doubts.

Best regards, Ricard.

[1] https://slurm.schedmd.com/job_launch.html
[2] https://slurm.schedmd.com/sacct.html#OPT_ExitCode
[3] https://slurm.schedmd.com/sacct.html#OPT_DerivedExitCode
[4] https://slurm.schedmd.com/sacct.html#OPT_json
[5] https://slurm.schedmd.com/rest.html
Comment 8 Jo Rhett 2025-01-03 12:45:40 MST
(In reply to Ricard Zarco Badia from comment #6)
> Since the OUT_OF_MEMORY state will not necessarily be present as a *job*
> state when one of its steps encounters an OOM event, 

Is there any way to rectify this, to ensure that the job state does change to OOM?  This would seem by far to be an easier fix.

> >> $ sacct --format=User,JobId,JobName,ReqMem,MaxRSS,MaxVMSize,start,end,state --start=<start_time> --end=<end_time> | grep OUT_OF_ME
> 
> You will get the exact steps reported as OUT_OF_MEMORY, and you can extract
> the jobID from them either visually or programmatically if you choose so.

This is basically what we're doing now by dumping to JSON and extracting in one go. We'd really prefer something less custom and easier to document for others.

> -- Detailed explanation on job/step state handling --
> I will start going from the basic facts to their derived implications, to
> avoid logic gaps in the narrative:

I deeply appreciate you taking the time to share this information. It really does help us understand. Thank you.
 
> 2. For a job, the finalization state is registered at step level. Steps can
> have parallel tasks and some of those can trigger an OOM event.

99.9% of our jobs at this time are single task

> 3. There are some considerations to take into account about the reliability
> of steps being accurately tagged as OOM, but since you are using cgroup/v2
> if I am not mistaken, we can skip this. When using cgroup/v2 (and if the
> kernel supports it), we can accurately detect if a step has produced an OOM
> event by checking the oom_kill field of its memory.events cgroup file.

Yep, that's how we're making up OOM reports right now, somewhat manually.
 
> 5. The state of the job will be interpreted as the state *of the last step
> in the allocation*. This is important, and it implies that the job state is
> sensitive to step execution order.

Hm... the OOMed step is the ONLY step in every one I've looked at.

> 6. For a job spawned via sbatch, the last step is the batch step. Even if
...
> This  applies even for single-step *batch* jobs, since the batch step
> will still be the last step.

Ah, interesting.

> 7. In these cases, it is up to the user to add error checking in their batch
> scripts to accurately mark a whole job as failed or completed depending on
> their use case. The failure of some steps within a job can be viewed as
> acceptable (even expected) for some workloads without meaning that the job
> as a whole has failed. This is the reasoning behind the whole mechanism
> explained in point 6.

TTBOMK the few times we run sbatch here, it's a single command invocation just like srun but without maintaining the open connection (usually from Github Actions)

> 9. That being said, you might be interested in using a combination of
> ExitCode [2] and DerivedExitCode [3] in your sacct output to have a quick
> glance at job/step error codes and signals (if they had a role to play in
> the step termination). 

Every OOM I can find shows "0:125" for both

     User JobID           JobName     ReqMem     MaxRSS  MaxVMSize               Start                 End      State ExitCode DerivedExitCode
--------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- -------- ---------------
ci_runner 724181       sau:cov_r+        40G                       2025-01-03T10:46:23 2025-01-03T10:48:13     FAILED      1:0           0:125
          724181.batch      batch                 4660K          0 2025-01-03T10:46:23 2025-01-03T10:48:13     FAILED      1:0
          724181.exte+     extern                     0          0 2025-01-03T10:46:23 2025-01-03T10:48:13  COMPLETED      0:0
          724181.0           make             35340188K          0 2025-01-03T10:46:23 2025-01-03T10:48:13 OUT_OF_ME+    0:125

Does this mean it would be more effective to search for "--derived-exit-code=0:125" (totally making up an argument that might not exist here) or something similar?

> 10. If you need advanced data parsing, it would be a good idea to check out
> the "--json" flag for sacct [4] (or "--yaml" if that is more your cup of
> tea). You can also query the rest API [5], but I would stick with sacct if
> it suffices your needs for now.

We make extensive use of --json. Even after rebuilding the packages with yaml support most commands don't support it so we've give up trying to use it.