| Summary: | Slurm detecting OOM issues lately | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Tana Vinod <vinodkumar.tana> |
| Component: | Other | Assignee: | Oriol Vilarrubi <jvilarru> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | felip.moll, vinodkumar.tana |
| Version: | - Unsupported Older Versions | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Cerence AI | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
> Slurm Version-20.02.7
The OOM code has been improved since 20.02. You should upgrade to take advantage of the fixes we have made around OOM detection. I will have Oriol follow up with you should you have additional questions.
Thanks Jason. Will check with latest version and get back to you. Hello Tana, In case you are not doing this already in your 20.02 installation, we would recommend to use the cgroup task plugin rather than the jobacctgather linux for the OOM control as explained here: https://slurm.schedmd.com/slurm.conf.html#OPT_OverMemoryKill. Do you agree that we close this bug and in case that you find OOM related issue when you test it with a recent slurm version you reopen it? Regards. Hi, we are already using JobAcctGatherType=jobacct_gather/cgroup in current environment. you may close this ticket as we are planning to use new version of Slurm. Resolving. |
Hello Team, some jobs are killing with OOM errors after several hours or days sometimes. any idea why it is taking so much time to detect OOM event by slurm? We can reproduce this issue.. is there a way to check realtime memory usage of a job while it started and ended with OOM. Slurm Version-20.02.7 for example Jobid-51427778 [2023-01-13T11:00:20.680] [51427778.0] _oom_event_monitor: oom-kill event count: 1 [2023-01-13T11:00:20.998] [51427778.0] Step 51427778.0 hit memory+swap limit at least once during execution. This may or may not result in some failure. [2023-01-13T11:00:20.999] [51427778.0] error: Detected 1 oom-kill event(s) in step 51427778.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler. #sacct -j 51427778 --format=start,end Start End ------------------- ------------------- 2023-01-13T04:00:14 2023-01-13T11:00:21 2023-01-13T04:00:14 2023-01-13T11:00:21 2023-01-13T04:00:15 2023-01-13T11:00:21 [root@cn013 ~]# seff 51427778 Job ID: 51427778 Cluster: crg2 User/Group: fuqiang_luo/users State: OUT_OF_MEMORY (exit code 0) Cores: 1 CPU Utilized: 06:57:30 CPU Efficiency: 99.38% of 07:00:07 core-walltime Job Wall-clock time: 07:00:07 Memory Utilized: 3.75 GB Memory Efficiency: 125.00% of 3.00 GB FYI- We don't have real time monitoring graphs for memory/cpu usage of jobs. Planning to implement with the help of slurm support next month.