| Summary: | Slurmtcld socket timeout happens by srun -p compute -t 0-12 -c 32 -m plane=32 --pty bash | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Koji Tanaka <it-hpc> |
| Component: | slurmctld | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | bart |
| Version: | 20.02.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=9248 | ||
| Site: | OIST | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 20.02.4 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Koji Tanaka
2020-08-21 02:00:41 MDT
Hi I suspect this is a duplicate of bug 9248. Next time when this happens, could you take backtrace from slurmctld? eg.: #gcore $(pidof slurmctld) load its result to gdb and share a backtrace with us: #gdb <path to slurmctld> <path to core> (gdb)t a a bt f Dominik Thanks, Dominik. Yes, that's what we've seen here at our cluster. I'll upgrade the Slurm to 20.02.4. Thanks also for sharing an example of backtrace. I'll do so next time. (Please keep this open for a couple of days from now until we confirm the problem is over. I'll update the status later this week.) Bests, Koji After updated the version to 20.02.4, I haven't seen the problem. Thanks, Koji Hi I am glad to hear that update to 20.02.4 solved problem. Could we drop severity now? Dominik Yes, no problem. Koji We haven't seen the same problem since the update. So we're good now (RESOLVED). Hello again, It looks like our Slurm started having some problem and generating the following messages after the version update to 20.02.4. It happens sometimes on some nodes when a job tries to use more memory than allocated. [2020-09-03T15:37:32.749] [1182781.batch] _oom_event_monitor: oom-kill event count: 1182778 [2020-09-03T15:47:18.847] [1182781.batch] _oom_event_monitor: oom-kill event count: 11182778 [2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827780 [2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827781 [2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827782 [2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827783 [2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827784 [2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827785 [2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827786 [2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827787 [2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827788 [2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827789 [2020-09-03T15:57:21.792] [1182781.batch] _oom_event_monitor: oom-kill event count: 21182778 The count go up endlessly..., and an interesting thing is that the job 1182781 in the log has already been cancelled by the user because he found similar oom killer messages in his log file. I confirmed taht the job and the process have already disappeared from the compute node, but the Slurmd keeps trying to kill it(?) for some reason. This issue stays there until the disk partition (where slurmd.log is located) becomes full, or until the slurmd is restarted. It might not be related to the version update, but I thought this could possibly be a new bug. If I should open a new ticket instead of reopening this, please let me know I'll do so. Thanks a lot, Koji Hi You are right. This looks like a separate bug. If you could open a new ticket, this will be best. Dominik Will do. Thanks. Koji |