9633 – Slurmtcld socket timeout happens by srun -p compute -t 0-12 -c 32 -m plane=32 --pty bash

Ticket 9633 - Slurmtcld socket timeout happens by srun -p compute -t 0-12 -c 32 -m plane=32 --pty bash

Summary: Slurmtcld socket timeout happens by srun -p compute -t 0-12 -c 32 -m plane=32...

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	20.02.2
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Dominik Bartkiewicz
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-08-21 02:00 MDT by Koji Tanaka
Modified:	2020-09-06 23:46 MDT (History)
CC List:	1 user (show)

See Also:	9248
Site:	OIST
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	20.02.4
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Koji Tanaka 2020-08-21 02:00:41 MDT

Hi,

The SLURMCTLD stopped responding by "socket timeout" on our cluster yesterday. There wasn't much in the slurmctld.log about the issue, but we ended up finding that the problem was caused by the following command that one of our users executed from a login node.

$ srun -p compute -t 0-12 -c 32 -m plane=32 --pty bash

He probably just tried it without thinking much about each parameter. And we didn't think it was the cause, so one of our members just gave it a try, and then the socket timeout problem happened right after that. Unfortunately, we didn't set a higher debug level this time as well because the cluster is a production system and we didn't think it was the cause. But we noticed a rapid increase of CLOSE_WAIT sessions in the output of the `netstat -a` command.

We're still in the middle of the investigation, but if you could check if it's a known bug or something that you already have an answer, it would be great.

Also, I don't know if it's related, but most processors of our computes are AMD EPYC 7702 64-Core Processors.

Thanks a lot,
Koji

Comment 2 Dominik Bartkiewicz 2020-08-21 02:37:34 MDT

Hi

I suspect this is a duplicate of bug 9248.

Next time when this happens, could you take backtrace from slurmctld?

eg.:
#gcore $(pidof slurmctld)
load its result to gdb and share a backtrace with us:
#gdb <path to slurmctld> <path to core>
(gdb)t a a bt f

Dominik

Comment 3 Koji Tanaka 2020-08-23 20:11:55 MDT

Thanks, Dominik.

Yes, that's what we've seen here at our cluster. I'll upgrade the Slurm to 20.02.4. Thanks also for sharing an example of backtrace. I'll do so next time.

(Please keep this open for a couple of days from now until we confirm the problem is over. I'll update the status later this week.)

Bests,
Koji

Comment 4 Koji Tanaka 2020-08-23 23:02:57 MDT

After updated the version to 20.02.4, I haven't seen the problem.

Thanks,
Koji

Comment 5 Dominik Bartkiewicz 2020-08-24 02:42:03 MDT

Hi

I am glad to hear that update to 20.02.4 solved problem.
Could we drop severity now?

Dominik

Comment 6 Koji Tanaka 2020-08-24 18:14:55 MDT

Yes, no problem.

Koji

Comment 7 Koji Tanaka 2020-08-26 01:11:49 MDT

We haven't seen the same problem since the update. So we're good now (RESOLVED).

Comment 8 Koji Tanaka 2020-09-03 02:16:13 MDT

Hello again,

It looks like our Slurm started having some problem and generating the following messages after the version update to 20.02.4. It happens sometimes on some nodes when a job tries to use more memory than allocated.

[2020-09-03T15:37:32.749] [1182781.batch] _oom_event_monitor: oom-kill event count: 1182778
[2020-09-03T15:47:18.847] [1182781.batch] _oom_event_monitor: oom-kill event count: 11182778
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827780
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827781
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827782
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827783
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827784
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827785
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827786
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827787
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827788
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827789
[2020-09-03T15:57:21.792] [1182781.batch] _oom_event_monitor: oom-kill event count: 21182778

The count go up endlessly..., and an interesting thing is that the job 1182781 in the log has already been cancelled by the user because he found similar oom killer messages in his log file. I confirmed taht the job and the process have already disappeared from the compute node, but the Slurmd keeps trying to kill it(?) for some reason. This issue stays there until the disk partition (where slurmd.log is located) becomes full, or until the slurmd is restarted.

It might not be related to the version update, but I thought this could possibly be a new bug. If I should open a new ticket instead of reopening this, please let me know I'll do so.

Thanks a lot,
Koji

Comment 9 Dominik Bartkiewicz 2020-09-03 02:56:24 MDT

Hi

You are right. This looks like a separate bug. If you could open a new ticket, this will be best.

Dominik

Comment 10 Koji Tanaka 2020-09-06 23:46:32 MDT

Will do. Thanks.

Koji