Ticket 9633 - Slurmtcld socket timeout happens by srun -p compute -t 0-12 -c 32 -m plane=32 --pty bash
Summary: Slurmtcld socket timeout happens by srun -p compute -t 0-12 -c 32 -m plane=32...
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 20.02.2
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-08-21 02:00 MDT by Koji Tanaka
Modified: 2020-09-06 23:46 MDT (History)
1 user (show)

See Also:
Site: OIST
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 20.02.4
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Koji Tanaka 2020-08-21 02:00:41 MDT
Hi,

The SLURMCTLD stopped responding by "socket timeout" on our cluster yesterday. There wasn't much in the slurmctld.log about the issue, but we ended up finding that the problem was caused by the following command that one of our users executed from a login node.

$ srun -p compute -t 0-12 -c 32 -m plane=32 --pty bash

He probably just tried it without thinking much about each parameter. And we didn't think it was the cause, so one of our members just gave it a try, and then the socket timeout problem happened right after that. Unfortunately, we didn't set a higher debug level this time as well because the cluster is a production system and we didn't think it was the cause. But we noticed a rapid increase of CLOSE_WAIT sessions in the output of the `netstat -a` command.

We're still in the middle of the investigation, but if you could check if it's a known bug or something that you already have an answer, it would be great.

Also, I don't know if it's related, but most processors of our computes are AMD EPYC 7702 64-Core Processors.

Thanks a lot,
Koji
Comment 2 Dominik Bartkiewicz 2020-08-21 02:37:34 MDT
Hi

I suspect this is a duplicate of bug 9248.

Next time when this happens, could you take backtrace from slurmctld?

eg.:
#gcore $(pidof slurmctld)
load its result to gdb and share a backtrace with us:
#gdb <path to slurmctld> <path to core>
(gdb)t a a bt f

Dominik
Comment 3 Koji Tanaka 2020-08-23 20:11:55 MDT
Thanks, Dominik.

Yes, that's what we've seen here at our cluster. I'll upgrade the Slurm to 20.02.4. Thanks also for sharing an example of backtrace. I'll do so next time.

(Please keep this open for a couple of days from now until we confirm the problem is over. I'll update the status later this week.)

Bests,
Koji
Comment 4 Koji Tanaka 2020-08-23 23:02:57 MDT
After updated the version to 20.02.4, I haven't seen the problem.

Thanks,
Koji
Comment 5 Dominik Bartkiewicz 2020-08-24 02:42:03 MDT
Hi

I am glad to hear that update to 20.02.4 solved problem.
Could we drop severity now?

Dominik
Comment 6 Koji Tanaka 2020-08-24 18:14:55 MDT
Yes, no problem.

Koji
Comment 7 Koji Tanaka 2020-08-26 01:11:49 MDT
We haven't seen the same problem since the update. So we're good now (RESOLVED).
Comment 8 Koji Tanaka 2020-09-03 02:16:13 MDT
Hello again,

It looks like our Slurm started having some problem and generating the following messages after the version update to 20.02.4. It happens sometimes on some nodes when a job tries to use more memory than allocated.

[2020-09-03T15:37:32.749] [1182781.batch] _oom_event_monitor: oom-kill event count: 1182778
[2020-09-03T15:47:18.847] [1182781.batch] _oom_event_monitor: oom-kill event count: 11182778
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827780
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827781
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827782
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827783
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827784
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827785
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827786
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827787
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827788
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827789
[2020-09-03T15:57:21.792] [1182781.batch] _oom_event_monitor: oom-kill event count: 21182778

The count go up endlessly..., and an interesting thing is that the job 1182781 in the log has already been cancelled by the user because he found similar oom killer messages in his log file. I confirmed taht the job and the process have already disappeared from the compute node, but the Slurmd keeps trying to kill it(?) for some reason. This issue stays there until the disk partition (where slurmd.log is located) becomes full, or until the slurmd is restarted.

It might not be related to the version update, but I thought this could possibly be a new bug. If I should open a new ticket instead of reopening this, please let me know I'll do so.

Thanks a lot,
Koji
Comment 9 Dominik Bartkiewicz 2020-09-03 02:56:24 MDT
Hi

You are right. This looks like a separate bug. If you could open a new ticket, this will be best.

Dominik
Comment 10 Koji Tanaka 2020-09-06 23:46:32 MDT
Will do. Thanks.

Koji