Ticket 9771

Summary:	Sinfo/sbatch/squeue can not connect to server.
Product:	Slurm	Reporter:	Bill Broadley <bill.broadley>
Component:	slurmctld	Assignee:	Nate Rini <nate>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	bart, nate
Version:	19.05.2
Hardware:	Linux
OS:	Linux
Site:	NREL	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	19.05.2
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmctld.log slurm.conf Requested gdb log from core dump patch for 19.05.2

Description Bill Broadley 2020-09-08 11:57:57 MDT

Created attachment 15790 [details]
slurmctld.log

We have slurm 19.05.2 installed, compiled with SchedMD recommended patches.

We had what looked like a NFS problem Sunday, and the slurm controller was likely offline from Sunday to this morning (Tuesday).

I'll attach the slurm.conf.

It looks like slurmctld is running and has the socket open:
```
# lsof | grep slurmctld | grep LISTEN
slurmctld 14477                slurm    4u     IPv4             118937         0t0        TCP *:pentbox-sim (LISTEN)
slurmctld 14477 14478          slurm    4u     IPv4             118937         0t0        TCP *:pentbox-sim (LISTEN)
slurmctld 14477 14482          slurm    4u     IPv4             118937         0t0        TCP *:pentbox-sim (LISTEN)
slurmctld 14477 14878          slurm    4u     IPv4             118937         0t0        TCP *:pentbox-sim (LISTEN)
```

Sinfo running on the same machine:

Iptables shows minimal rules:
```
# iptables --list
Chain INPUT (policy ACCEPT)
target     prot opt source               destination
ACCEPT     udp  --  anywhere             anywhere             udp dpt:domain
ACCEPT     tcp  --  anywhere             anywhere             tcp dpt:domain
ACCEPT     udp  --  anywhere             anywhere             udp dpt:bootps
ACCEPT     tcp  --  anywhere             anywhere             tcp dpt:bootps

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination
ACCEPT     all  --  anywhere             192.168.122.0/24     ctstate RELATED,ESTABLISHED
ACCEPT     all  --  192.168.122.0/24     anywhere
ACCEPT     all  --  anywhere             anywhere
REJECT     all  --  anywhere             anywhere             reject-with icmp-port-unreachable
REJECT     all  --  anywhere             anywhere             reject-with icmp-port-unreachable

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
ACCEPT     udp  --  anywhere             anywhere             udp dpt:bootpc
```

When starting manually with -D -v -v -d I did see:
slurmctld: server_thread_count over limit (256), waiting

I also included a snippet of a slurmctld.log

Any ideas why slurm won't allow connections to port 6817 to allow sinfo/sbatch to work?

Comment 1 Bill Broadley 2020-09-08 12:01:04 MDT

Created attachment 15791 [details]
slurm.conf

Comment 2 Nate Rini 2020-09-08 12:08:26 MDT

Please call the following on your controller:
> scontrol ping
> sdiag
> sacctmgr show stats

Comment 3 Jason Booth 2020-09-08 12:10:38 MDT

In addition to Nate's request do you know what was happening before this event? Was there a large batch of jobs completing?

Would you also let us know what your Prolog and Epilog scripts are doing?

Comment 4 Bill Broadley 2020-09-08 12:15:20 MDT

```
[root@emgmt1 etc]# scontrol ping
Slurmctld(primary) at emgmt1 is DOWN
*****************************************
** RESTORE SLURMCTLD DAEMON TO SERVICE **
*****************************************
[root@emgmt1 ~]# sacctmgr show stats
Rollup statistics
	Hour       count:2      ave_time:11682051 max_time:15003950     total_time:23364102
	Day        count:2      ave_time:2043439 max_time:3652675      total_time:4086878
	Month      count:0      ave_time:0      max_time:0            total_time:0

Remote Procedure Call statistics by message type
	DBD_JOB_COMPLETE         ( 1424) count:530    ave_time:820    total_time:434931
	DBD_FINI                 ( 1401) count:379    ave_time:911937 total_time:345624311
	SLURM_PERSIST_INIT       ( 6500) count:132    ave_time:389    total_time:51380
	DBD_GET_QOS              ( 1448) count:131    ave_time:553    total_time:72530
	DBD_GET_ASSOCS           ( 1410) count:127    ave_time:7781   total_time:988208
	DBD_MODIFY_ASSOCS        ( 1429) count:122    ave_time:1935   total_time:236077
	DBD_STEP_COMPLETE        ( 1441) count:16     ave_time:9769   total_time:156309
	DBD_SEND_MULT_MSG        ( 1474) count:5      ave_time:103461 total_time:517306
	DBD_GET_JOBS_COND        ( 1444) count:4      ave_time:9382   total_time:37529
	DBD_GET_TRES             ( 1486) count:3      ave_time:444    total_time:1334
	DBD_STEP_START           ( 1442) count:2      ave_time:6933   total_time:13867
	DBD_REGISTER_CTLD        ( 1434) count:1      ave_time:955    total_time:955
	DBD_CLUSTER_TRES         ( 1407) count:1      ave_time:532    total_time:532
	DBD_GET_FEDERATIONS      ( 1494) count:1      ave_time:300    total_time:300
	DBD_GET_USERS            ( 1415) count:1      ave_time:48797  total_time:48797
	DBD_GET_RES              ( 1478) count:1      ave_time:1366   total_time:1366
	DBD_SEND_MULT_JOB_START  ( 1472) count:1      ave_time:250493 total_time:250493
	DBD_MODIFY_RESV          ( 1463) count:1      ave_time:702    total_time:702

Remote Procedure Call statistics by user
	root                (         0) count:878    ave_time:394930 total_time:346749090
	slurm               (       989) count:564    ave_time:2916   total_time:1644696
	kregimba            (    120043) count:16     ave_time:2696   total_time:43141
[root@emgmt1 ~]#
```

I ran sdiag, but it's just hanging, I'll post again if it returns.  I'll followup with the prolog/epilog questions.

Comment 5 Nate Rini 2020-09-08 12:17:46 MDT

(In reply to Bill Broadley from comment #4)
> I ran sdiag, but it's just hanging, I'll post again if it returns.  I'll
> followup with the prolog/epilog questions.

Please use gcore to take a core dump from slurmctld and then dump the backtrace:
> pgrep slurmctld |xargs -i gcore -a {}
> gdb $(which slurmctd) $PATH_TO_CORE
> set pagination off 
> set print pretty on
> t a a bt full

Comment 7 Bill Broadley 2020-09-08 12:45:37 MDT

Created attachment 15797 [details]
Requested gdb log from core dump

Comment 9 Nate Rini 2020-09-08 12:51:02 MDT

Bill,

This is very likely a duplicate of bug#8978. Your install is too back level to apply the patch directly (https://github.com/SchedMD/slurm/commit/f17cf91ccc56ccf87). The issue can be resolved by clearing out the existing jobs in statesavelocation or we can do a one off patch specific for your release version.

Please call the following if you want to try a patch.
> slurmctld -V

--Nate

Comment 10 Bill Broadley 2020-09-08 13:04:17 MDT

We have a substantial number of jobs running, and we would like to avoid losing that state if at all possible.

Could you send a patch specific to our version:

[root@emgmt1 log]# ls -al /proc/14477/exe
lrwxrwxrwx. 1 slurm geoclue 0 Sep  8 11:35 /proc/14477/exe -> /nopt/slurm/19.05.2/sbin/slurmctld
[root@emgmt1 log]# /nopt/slurm/19.05.2/sbin/slurmctld -V
slurm 19.05.2

Comment 11 Bill Broadley 2020-09-08 13:40:56 MDT

I looked at the patch and it was super simple, so I applied it manually.  Do you think this is likely to work:

[root@emgmt1 slurmctld]# diff job_mgr.c.orig job_mgr.c
10610c10610
< 				char *cmd_line = NULL;
---
> 				char *cmd_line = NULL, *pos = NULL;
10612,10614c10612,10614
< 					if (i != 0)
< 						xstrcatchar(cmd_line, ' ');
< 					xstrcat(cmd_line, detail_ptr->argv[i]);
---
> 					xstrfmtcatat(cmd_line, &pos, "%s%s",
> 					             (i ? " " : ""),
> 						     detail_ptr->argv[i]);
[root@emgmt1 slurmctld]#

It looks to me to match the patch that you linked to.

Comment 12 Nate Rini 2020-09-08 13:51:27 MDT

Created attachment 15801 [details]
patch for 19.05.2

(In reply to Bill Broadley from comment #11)
> I looked at the patch and it was super simple, so I applied it manually.  Do
> you think this is likely to work:

Patch is attached and yes it was simple. Looks like 19.05.2 had the previous patch to add xstrfmtcatat() which I was worried about. Please give it a try.

Comment 13 Nate Rini 2020-09-08 14:33:42 MDT

(In reply to Nate Rini from comment #12)
> Patch is attached and yes it was simple. Looks like 19.05.2 had the previous
> patch to add xstrfmtcatat() which I was worried about. Please give it a try.

Reducing this to SEV3 as a workaround has been provided.

Comment 14 Bill Broadley 2020-09-08 14:53:40 MDT

Thanks, the patch worked, and things are back to normal.  Quite a few jobs survived.

We plan to upgrade to the current SchedMD recommended LTS in early Oct, I'll open a ticket for specific recommendations.

I'm closing this ticket.