Simply put, queries for OUT OF MEMORY do not return results even when the events have occurred $ sacct --state=oom --format=User,JobId,JobName,ReqMem,MaxRSS,MaxVMSize,start,end,state --start=now-6hours --end=now User JobID JobName ReqMem MaxRSS MaxVMSize Start End State --------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- $ sacct --state=out_of_memory --format=User,JobId,JobName,ReqMem,MaxRSS,MaxVMSize,start,end,state --start=now-6hours --end=now User JobID JobName ReqMem MaxRSS MaxVMSize Start End State --------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- $ sacct -N exo10 --state=f --format=User,JobId,JobName,ReqMem,MaxRSS,MaxVMSize,start,end,state --start=now-6hours --end=now-5hours User JobID JobName ReqMem MaxRSS MaxVMSize Start End State --------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ci_runner 699462 sau:cov_r+ 40G 2025-01-01T14:01:20 2025-01-01T14:03:13 FAILED 699462.batch batch 4660K 0 2025-01-01T14:01:20 2025-01-01T14:03:13 FAILED 699462.exte+ extern 256K 0 2025-01-01T14:01:20 2025-01-01T14:03:13 COMPLETED 699462.0 make 34425392K 0 2025-01-01T14:01:20 2025-01-01T14:03:13 OUT_OF_ME+ ci_runner 699500 ipu_wrapp+ 20G 2025-01-01T14:27:11 2025-01-01T14:28:22 FAILED 699500.batch batch 4912K 0 2025-01-01T14:27:11 2025-01-01T14:28:22 FAILED 699500.exte+ extern 0 0 2025-01-01T14:27:11 2025-01-01T14:28:22 COMPLETED 699500.0 make 54504K 0 2025-01-01T14:27:11 2025-01-01T14:28:22 CANCELLED
Hello Jo, The example you have provided is expected behavior. The --state flag filters results at the job level, not step level. You will see similar behaviors if you try it with other job states, specifically in cases where the job state is not the same as all its steps. Best regards, Ricard.
This has got to be the most useless reply I have ever received. You saw my need, and you only told me that it wasn't successful. I knew this already, thus the ticket. Please tell me how I can see all jobs that have failed for out of menmory.
I would like this issue escalated. Not only was the response I received insulting, but it defies your own documentation From https://slurm.schedmd.com/sacct.html > -s, --state=<state_list> > Selects jobs based on their state during the time period given. Unless otherwise specified, the start and end time will be the current time when the --state option is specified and only currently running jobs can be displayed. A start and/or end time must be specified to view information about jobs not currently running. See the JOB STATE CODES section below for a list of state designators. From that section: ? JOB STATE CODES > The following states are recognized by sacct. A full list of possible states is available at <https://slurm.schedmd.com/job_state_codes.html>. ... > OOM OUT_OF_MEMORY Job experienced out of memory error. So your documentation clearly states that a job can have an OOM status. Sacct allows this query (as opposed to names not in this list) and yet Richard insists that the step state will never be matched. 1. Your docs are wrong 2. Your response has the least helpful you could possibly have been short of not responding at all
Jo, Ricard brought this issue to my attention. After reviewing his response, I do not see any insult or malice with his reply. I do understand your frustration with the documentation as it relates to the --state definition. With that said, Ricard is correct the --state refers to the job and not step. Since a job can have hundreds of steps and with some workflows a step might be expected to fail as long as the job completes correctly. For example, the below job runs a salloc and submits a step that OOM's. Then the job runs another srun to query hostname. The entire job completes and has its global state recorded as completed. I do ask that you consider the voice in which you interact with our support. It would be better to work with us by requesting an enhancement or improvement to the documentation rather than the hostile escalation method you have chooses for this ticket. Additionally ticket severity should reflect impact to the cluster, and escalations for minor issues like this do help your case. [Example] [jason@nh-grey 24.11]$ salloc -t 250 --mem=100MB -c 2 salloc: Granted job allocation 5043 salloc: Nodes n1 are ready for job srun: ROUTE: split_hostlist: hl=n1 tree_width 16 (salloc) [jason@nh-grey 24.11]$ sacct -j 5043 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 5043 interacti+ defq schedmd 2 RUNNING 0:0 5043.intera+ interacti+ schedmd 2 RUNNING 0:0 5043.extern extern schedmd 2 RUNNING 0:0 [jason@nh-grey 24.11]$ srun ~/tools/eat_mem/eat_while 50 srun: ROUTE: split_hostlist: hl=n1 tree_width 16 pid: 120081, Tot mem=10mb pid: 120081, Tot mem=20mb pid: 120081, Tot mem=30mb pid: 120081, Tot mem=40mb pid: 120081, Tot mem=50mb pid: 120081, Tot mem=60mb pid: 120081, Tot mem=70mb pid: 120081, Tot mem=80mb slurmstepd-n1: error: Detected 1 oom_kill event in StepId=5043.0. Some of the step tasks have been OOM Killed. srun: error: n1: task 0: Out Of Memory srun: Terminating StepId=5043.0 (salloc) [jason@nh-grey 24.11]$ srun hostname srun: ROUTE: split_hostlist: hl=n1 tree_width 16 nh-grey [jason@nh-grey 24.11]$ sacct -j 5043 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 5043 interacti+ defq schedmd 2 RUNNING 0:0 5043.intera+ interacti+ schedmd 2 RUNNING 0:0 5043.extern extern schedmd 2 RUNNING 0:0 5043.0 eat_while schedmd 2 OUT_OF_ME+ 0:125 5043.1 hostname schedmd 2 COMPLETED 0:0 (salloc) [jason@nh-grey 24.11]$exit $ sacct -j 5043 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 5043 interacti+ defq schedmd 2 COMPLETED 0:0 5043.intera+ interacti+ schedmd 2 COMPLETED 0:0 5043.extern extern schedmd 2 COMPLETED 0:0 5043.0 eat_while schedmd 2 OUT_OF_ME+ 0:125 5043.1 hostname schedmd 2 COMPLETED 0:0
(In reply to Jason Booth from comment #4) > After reviewing his response, I do not see any insult or malice with his > reply. Deliberately avoiding answering the need can be (and has been) seen as insulting to the requestor, and a malicious waste of the requestor's time. Responses like this are given as examples of hostile behavior in every community that has a posted guide. > I do understand your frustration with the documentation as it relates to the > --state definition. With that said, Ricard is correct > the --state refers to the job and not step. Since a job can have hundreds of > steps and with some workflows a step might be expected to > fail as long as the job completes correctly. You have justified that is plausible for a job to have multiple status for different steps. Because it is plausible for there to be multiple steps has no relationship to why jobs with a single step cannot be retrieved by query. > For example, the below job runs > a salloc and submits a step that OOM's. Then the job runs > another srun to query hostname. The entire job completes and has its global > state recorded as completed. Why relates to the job posted in the problem report exactly how? We aren't dealing with some theoretical situation with multiple steps. This job has one step. That one step failed with OOM. We cannot retrieve that from a query. I completely understand that someone somewhere might have some more complex situation whereby the state might be different. But this situation is not that, and in your example you've doubled down on a situation unlike ours, and somehow again completely avoided answering the real question. We have a job with a single step, with a single step state, that clearly failed for OOM. Why am I unable to retrieve this from a query? Therefore, you have doubled down on refusing to answer the question asked, which is how to query jobs that failed with OUT_OF_MEMORY. > I do ask that you consider the voice in which you interact with our support. > It would be better to work with us by requesting an enhancement or > improvement to the documentation rather than the hostile escalation method > you have chooses for this ticket. I would ask that when we are paying you for support, that the support attempt to help us rather than toss out theoritical situations with zero bearing on the problem reported. This is wasting my time, as has this entire conversation so far. It would be really nice if you would return to the problem reported and tell us how we can retrieve a report of OOM events to solve our IMMEDIATE, BREAKING and HIGH PRIORITY need > Additionally ticket severity should reflect impact to the cluster, and > escalations for minor issues like this do help your case. This need is very high impact, is breaking mission-critical jobs on our cluster, and matters at the level of an emergency for us. Continuing to downplay this while refusing to assist us is undeniably HOSTILE
Hello Jo, First of all, I want to make clear that there was no ill intent behind my first response. It is not uncommon for us to receive reports of potential bugs in Slurm's features from customers because they noticed something at some point during their workflow. That response was focused just on investigating if such feature was working as expected or not and report my findings. I am sorry that I initially did not pick it up as an ongoing need, but we can steer this conversation towards that. Having said that, I will first provide a TL;DR if you wish to go straight to the point without having to go over all the technicalities of job/step state handling. -- TL;DR -- Since the OUT_OF_MEMORY state will not necessarily be present as a *job* state when one of its steps encounters an OOM event, the easiest approach right now would be to just skip the "--state" flag altogether and do a general sacct of your wanted time frame, forwarding that to a grep of your needed state. A basic example using your command as a base would be this: >> $ sacct --format=User,JobId,JobName,ReqMem,MaxRSS,MaxVMSize,start,end,state --start=<start_time> --end=<end_time> | grep OUT_OF_ME You will get the exact steps reported as OUT_OF_MEMORY, and you can extract the jobID from them either visually or programmatically if you choose so. After that, you can then analyze your desired jobs as a whole (via "sacct -j", for example). It is not the most elegant approach and there are other avenues (more on that in point 9 and onwards from the section below), but this is functional and relatively hassle-free. If this is still not enough for your needs, let me know. -- Detailed explanation on job/step state handling -- I will start going from the basic facts to their derived implications, to avoid logic gaps in the narrative: 1. A job does not start processes by itself. By definition, it is a collection of steps and tasks (they can also be performed in parallel). A summary with further details about how jobs, steps, tasks and other related components can be found here [1]. 2. For a job, the finalization state is registered at step level. Steps can have parallel tasks and some of those can trigger an OOM event. 3. There are some considerations to take into account about the reliability of steps being accurately tagged as OOM, but since you are using cgroup/v2 if I am not mistaken, we can skip this. When using cgroup/v2 (and if the kernel supports it), we can accurately detect if a step has produced an OOM event by checking the oom_kill field of its memory.events cgroup file. 4. Since a numbered step (<jobID>.<number>) is almost always where the real user workload resides (aka spawned by a srun command either directly or from a batch script), it is the most important one to check, normally through sacct. However, it is perfectly possible to trigger a OOM in non-numbered steps, like a batch step for example. 5. The state of the job will be interpreted as the state *of the last step in the allocation*. This is important, and it implies that the job state is sensitive to step execution order. 6. For a job spawned via sbatch, the last step is the batch step. Even if something has been launched from the batch step at some point through srun and ends up with a OOM, the batch step as a whole will have a state based on its *final exit code*. This is the reason why you can see jobs with COMPLETED or FAILED states, but only see the OOM reported in one of their steps. This applies even for single-step *batch* jobs, since the batch step will still be the last step. 7. In these cases, it is up to the user to add error checking in their batch scripts to accurately mark a whole job as failed or completed depending on their use case. The failure of some steps within a job can be viewed as acceptable (even expected) for some workloads without meaning that the job as a whole has failed. This is the reasoning behind the whole mechanism explained in point 6. 8. For jobs spawned directly through srun, a batch step will not be present. The state of the job will be the state of its numbered step, so these should be able to be retrieved with the "--state" sacct option. 9. That being said, you might be interested in using a combination of ExitCode [2] and DerivedExitCode [3] in your sacct output to have a quick glance at job/step error codes and signals (if they had a role to play in the step termination). 10. If you need advanced data parsing, it would be a good idea to check out the "--json" flag for sacct [4] (or "--yaml" if that is more your cup of tea). You can also query the rest API [5], but I would stick with sacct if it suffices your needs for now. I think that this covers mostly everything that is related to the topic at hand. Let me know if there are any doubts. Best regards, Ricard. [1] https://slurm.schedmd.com/job_launch.html [2] https://slurm.schedmd.com/sacct.html#OPT_ExitCode [3] https://slurm.schedmd.com/sacct.html#OPT_DerivedExitCode [4] https://slurm.schedmd.com/sacct.html#OPT_json [5] https://slurm.schedmd.com/rest.html
(In reply to Ricard Zarco Badia from comment #6) > Since the OUT_OF_MEMORY state will not necessarily be present as a *job* > state when one of its steps encounters an OOM event, Is there any way to rectify this, to ensure that the job state does change to OOM? This would seem by far to be an easier fix. > >> $ sacct --format=User,JobId,JobName,ReqMem,MaxRSS,MaxVMSize,start,end,state --start=<start_time> --end=<end_time> | grep OUT_OF_ME > > You will get the exact steps reported as OUT_OF_MEMORY, and you can extract > the jobID from them either visually or programmatically if you choose so. This is basically what we're doing now by dumping to JSON and extracting in one go. We'd really prefer something less custom and easier to document for others. > -- Detailed explanation on job/step state handling -- > I will start going from the basic facts to their derived implications, to > avoid logic gaps in the narrative: I deeply appreciate you taking the time to share this information. It really does help us understand. Thank you. > 2. For a job, the finalization state is registered at step level. Steps can > have parallel tasks and some of those can trigger an OOM event. 99.9% of our jobs at this time are single task > 3. There are some considerations to take into account about the reliability > of steps being accurately tagged as OOM, but since you are using cgroup/v2 > if I am not mistaken, we can skip this. When using cgroup/v2 (and if the > kernel supports it), we can accurately detect if a step has produced an OOM > event by checking the oom_kill field of its memory.events cgroup file. Yep, that's how we're making up OOM reports right now, somewhat manually. > 5. The state of the job will be interpreted as the state *of the last step > in the allocation*. This is important, and it implies that the job state is > sensitive to step execution order. Hm... the OOMed step is the ONLY step in every one I've looked at. > 6. For a job spawned via sbatch, the last step is the batch step. Even if ... > This applies even for single-step *batch* jobs, since the batch step > will still be the last step. Ah, interesting. > 7. In these cases, it is up to the user to add error checking in their batch > scripts to accurately mark a whole job as failed or completed depending on > their use case. The failure of some steps within a job can be viewed as > acceptable (even expected) for some workloads without meaning that the job > as a whole has failed. This is the reasoning behind the whole mechanism > explained in point 6. TTBOMK the few times we run sbatch here, it's a single command invocation just like srun but without maintaining the open connection (usually from Github Actions) > 9. That being said, you might be interested in using a combination of > ExitCode [2] and DerivedExitCode [3] in your sacct output to have a quick > glance at job/step error codes and signals (if they had a role to play in > the step termination). Every OOM I can find shows "0:125" for both User JobID JobName ReqMem MaxRSS MaxVMSize Start End State ExitCode DerivedExitCode --------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- -------- --------------- ci_runner 724181 sau:cov_r+ 40G 2025-01-03T10:46:23 2025-01-03T10:48:13 FAILED 1:0 0:125 724181.batch batch 4660K 0 2025-01-03T10:46:23 2025-01-03T10:48:13 FAILED 1:0 724181.exte+ extern 0 0 2025-01-03T10:46:23 2025-01-03T10:48:13 COMPLETED 0:0 724181.0 make 35340188K 0 2025-01-03T10:46:23 2025-01-03T10:48:13 OUT_OF_ME+ 0:125 Does this mean it would be more effective to search for "--derived-exit-code=0:125" (totally making up an argument that might not exist here) or something similar? > 10. If you need advanced data parsing, it would be a good idea to check out > the "--json" flag for sacct [4] (or "--yaml" if that is more your cup of > tea). You can also query the rest API [5], but I would stick with sacct if > it suffices your needs for now. We make extensive use of --json. Even after rebuilding the packages with yaml support most commands don't support it so we've give up trying to use it.