Ticket 14789

Summary: slurm_load_jobs error: Unable to contact slurm controller (connect failure)
Product: Slurm Reporter: Jeff Haferman <jlhaferm>
Component: slurmctldAssignee: Nate Rini <nate>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 21.08.6   
Hardware: Linux   
OS: Linux   
Site: NPS HPC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: dmesg
2nd attempt at corefile
gdb output 2nd time

Description Jeff Haferman 2022-08-19 10:36:28 MDT
slurm commands are timing out on all nodes (login node, controller node), since about 10 pm local time last night (Fri Aug 18).

We have restarted slurmctld and munge and they restarted fine and logs indicate 538 jobs recovered on restart. I see jobs running on the compute nodes. It's just that slurm commands are timing out.  

The slurmctld log (at about the time of the initial failure):
[2022-08-18T23:49:53.264] error: slurm_receive_msg [0.0.0.0:0]: Transport endpoint is not connected
[2022-08-18T23:49:53.264] error: slurm_receive_msg [3.0.0.0:0]: Transport endpoint is not connected
[2022-08-18T23:49:53.264] error: slurm_receive_msg [3.0.0.0:0]: Transport endpoint is not connected
[2022-08-18T23:49:53.264] error: slurm_receive_msg [0.0.0.0:0]: Transport endpoint is not connected
[2022-08-18T23:49:53.264] error: slurm_receive_msg [3.0.0.0:0]: Transport endpoint is not connected
[2022-08-18T23:49:53.264] error: slurm_receive_msg [0.0.0.0:0]: Transport endpoint is not connected
[2022-08-18T23:49:53.264] error: slurm_receive_msg [3.0.0.0:0]: Transport endpoint is not connected
[2022-08-18T23:49:53.265] error: slurm_receive_msg [0.0.0.0:0]: Transport endpoint is not connected
[2022-08-18T23:49:53.265] error: slurm_receive_msg [3.0.0.0:0]: Transport endpoint is not connected
[2022-08-18T23:49:53.265] error: slurm_receive_msg [0.0.0.0:0]: Transport endpoint is not connected
[2022-08-18T23:49:53.265] error: Munge decode failed: Expired credential
[2022-08-18T23:49:53.265] error: slurm_receive_msg [0.0.0.0:0]: Transport endpoint is not connected
[2022-08-18T23:49:53.265] error: slurm_receive_msg [0.0.0.0:0]: Transport endpoint is not connected
[2022-08-18T23:49:53.265] error: Munge decode failed: Expired credential
[2022-08-18T23:49:53.265] error: Munge decode failed: Expired credential
[2022-08-18T23:49:53.265] auth/munge: _print_cred: ENCODED: Thu Aug 18 21:06:50 2022
[2022-08-18T23:49:53.265] auth/munge: _print_cred: DECODED: Thu Aug 18 23:49:53 2022
[2022-08-18T23:49:53.265] error: slurm_unpack_received_msg: auth_g_verify: REQUEST_FED_INFO has authentication error: Unspecified error
[2022-08-18T23:49:53.265] error: slurm_unpack_received_msg: Protocol authentication error

We have eyeballs of 3 sys-admins on this and don't see anything "else" going on with our system.

Please assist, this is considered urgent for us as our users cannot monitor or submit jobs.
Comment 1 Nate Rini 2022-08-19 10:49:03 MDT
Please attach slurm.conf (and related config files).

Did something change on the controllers?

Call this command from a node where commands are failing:
> scontrol ping
> scontrol -vvvv show config
> srun -vvvv uptime
Comment 2 Jeff Haferman 2022-08-19 10:58:00 MDT
Created attachment 26393 [details]
slurm.conf
Comment 3 Jeff Haferman 2022-08-19 10:59:07 MDT
configuration files attached.

I'm running all command below from the node where slurmctld is running (but I get the same results on other nodes as well):

> scontrol ping
Slurmctld(primary) at hamming is DOWN

> scontrol -vvvv show config
scontrol: debug2: _slurm_connect: failed to connect to 10.1.1.201:6817: Connection refused
scontrol: debug2: Error connecting slurm stream socket at 10.1.1.201:6817: Connection refused

> srun -vvvv uptime
srun: defined options
srun: -------------------- --------------------
srun: verbose             : 4
srun: -------------------- --------------------
srun: end of defined options
srun: debug:  propagating RLIMIT_CPU=18446744073709551615
srun: debug:  propagating RLIMIT_FSIZE=18446744073709551615
srun: debug:  propagating RLIMIT_DATA=18446744073709551615
srun: debug:  propagating RLIMIT_STACK=18446744073709551615
srun: debug:  propagating RLIMIT_CORE=0
srun: debug:  propagating RLIMIT_RSS=18446744073709551615
srun: debug:  propagating RLIMIT_NPROC=4096
srun: debug:  propagating RLIMIT_NOFILE=1024
srun: debug:  propagating RLIMIT_AS=18446744073709551615
srun: debug:  propagating SLURM_PRIO_PROCESS=0
srun: debug:  propagating UMASK=0022
srun: debug2: srun PMI messages to port=46333
srun: debug:  Entering slurm_allocation_msg_thr_create()
srun: debug:  port from net_stream_listen is 46826
srun: debug:  Entering _msg_thr_internal
srun: debug3: eio_message_socket_readable: shutdown 0 fd 4
srun: debug2: _slurm_connect: failed to connect to 10.1.1.201:6817: Connection refused
srun: debug2: Error connecting slurm stream socket at 10.1.1.201:6817: Connection refused
srun: debug2: _slurm_connect: failed to connect to 10.1.1.201:6817: Connection refused
Comment 4 Jeff Haferman 2022-08-19 10:59:37 MDT
Created attachment 26394 [details]
nodes.conf

This is included by slurm.conf
Comment 5 Jeff Haferman 2022-08-19 10:59:57 MDT
Created attachment 26395 [details]
gres.conf
Comment 6 Nate Rini 2022-08-19 11:08:15 MDT
On node hamming, please call:
> ps -ef|grep -i slurmctld
> systemctl status slurmctld
Comment 7 Jeff Haferman 2022-08-19 11:22:50 MDT
Yes, it's up and running. That was the first thing I checked this morning. I even restarted slurmctld and munge, and I don't see any obvious problems. They are both running:

> ps -ef|grep -i slurmctld
slurm    27215     1 25 08:35 ?        00:26:40 /usr/sbin/slurmctld -D -s
slurm    27218 27215  0 08:35 ?        00:00:00 slurmctld: slurmscriptd
root     32404 31979  0 10:21 pts/5    00:00:00 grep --color=auto -i slurmctld

> systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2022-08-19 08:35:06 PDT; 1h 46min ago
 Main PID: 27215 (slurmctld)
    Tasks: 8
   CGroup: /system.slice/slurmctld.service
           ├─27215 /usr/sbin/slurmctld -D -s
           └─27218 slurmctld: slurmscriptd

Aug 19 08:35:21 hamming.uc.nps.edu slurmctld[27215]: slurmctld: Recovered JobId=45295704_21423(45317214) Assoc=1057
Aug 19 08:35:21 hamming.uc.nps.edu slurmctld[27215]: slurmctld: Recovered JobId=45295704_21424(45317215) Assoc=1057
Aug 19 08:35:21 hamming.uc.nps.edu slurmctld[27215]: slurmctld: Recovered JobId=45295704_* Assoc=1057
Aug 19 08:35:21 hamming.uc.nps.edu slurmctld[27215]: slurmctld: Recovered information about 583 jobs
Aug 19 08:35:21 hamming.uc.nps.edu slurmctld[27215]: slurmctld: select/cons_res: part_data_create_array: select/cons_res: preparin...itions
Aug 19 08:35:21 hamming.uc.nps.edu slurmctld[27215]: slurmctld: _sync_nodes_to_comp_job: JobId=45295704_21098(45316889) in completing state
Aug 19 08:35:21 hamming.uc.nps.edu slurmctld[27215]: slurmctld: select/cons_res: job_res_rm_job: plugin still initializing
Aug 19 08:35:21 hamming.uc.nps.edu slurmctld[27215]: slurmctld: _sync_nodes_to_comp_job: JobId=45295704_21161(45316952) in completing state
Aug 19 08:35:21 hamming.uc.nps.edu slurmctld[27215]: slurmctld: select/cons_res: job_res_rm_job: plugin still initializing
Aug 19 08:35:21 hamming.uc.nps.edu slurmctld[27215]: slurmctld: _sync_nodes_to_comp_job: completing 2 jobs
Hint: Some lines were ellipsized, use -l to show in full.
Comment 8 Nate Rini 2022-08-19 11:24:00 MDT
Please call:
> sudo lsof -p 27215
Comment 9 Jeff Haferman 2022-08-19 11:26:49 MDT
> sudo lsof -p 27215

COMMAND     PID  USER   FD   TYPE             DEVICE SIZE/OFF      NODE NAME
slurmctld 27215 slurm  cwd    DIR             253,16     4096  67159852 /var/log/slurm
slurmctld 27215 slurm  rtd    DIR             253,12     4096        64 /
slurmctld 27215 slurm  txt    REG             253,12  4215160  17987183 /usr/sbin/slurmctld
slurmctld 27215 slurm  mem    REG             253,17  6406312      6753 /var/lib/sss/mc/group
slurmctld 27215 slurm  mem    REG             253,12   199976   6022963 /usr/lib64/slurm/route_default.so
slurmctld 27215 slurm  mem    REG             253,12   405232   6022965 /usr/lib64/slurm/sched_backfill.so
slurmctld 27215 slurm  mem    REG             253,12   402384  16778655 /usr/lib64/libpcre.so.1.2.0
slurmctld 27215 slurm  mem    REG             253,12   155784  16989840 /usr/lib64/libselinux.so.1
slurmctld 27215 slurm  mem    REG             253,12    15688  16778856 /usr/lib64/libkeyutils.so.1.5
slurmctld 27215 slurm  mem    REG             253,12    67104  16855599 /usr/lib64/libkrb5support.so.0.1
slurmctld 27215 slurm  mem    REG             253,12   210784  16827272 /usr/lib64/libk5crypto.so.3.1
slurmctld 27215 slurm  mem    REG             253,12    15920  16778578 /usr/lib64/libcom_err.so.2.1
slurmctld 27215 slurm  mem    REG             253,12   967784  16778651 /usr/lib64/libkrb5.so.3.3
slurmctld 27215 slurm  mem    REG             253,12   320784  16778902 /usr/lib64/libgssapi_krb5.so.2.2
slurmctld 27215 slurm  mem    REG             253,12   991616  16778586 /usr/lib64/libstdc++.so.6.0.19
slurmctld 27215 slurm  mem    REG             253,12  2521144  18036194 /usr/lib64/libcrypto.so.1.0.2k
slurmctld 27215 slurm  mem    REG             253,12   470376  16997857 /usr/lib64/libssl.so.1.0.2k
slurmctld 27215 slurm  mem    REG             253,12  3135672  16888971 /usr/lib64/mysql/libmysqlclient.so.18.0.0
slurmctld 27215 slurm  mem    REG             253,12    90248  16778577 /usr/lib64/libz.so.1.2.7
slurmctld 27215 slurm  mem    REG             253,12   367000    332793 /usr/lib64/slurm/jobcomp_mysql.so
slurmctld 27215 slurm  mem    REG             253,12   157136   6091367 /usr/lib64/slurm/topology_none.so
slurmctld 27215 slurm  mem    REG             253,12   320888   6022975 /usr/lib64/slurm/switch_cray_aries.so
slurmctld 27215 slurm  mem    REG             253,12   236848   6091360 /usr/lib64/slurm/switch_none.so
slurmctld 27215 slurm  mem    REG             253,17  8406312      6752 /var/lib/sss/mc/passwd
slurmctld 27215 slurm  mem    REG             253,12    31408  16989797 /usr/lib64/libnss_dns-2.17.so
slurmctld 27215 slurm  mem    REG             253,12   539360   6135855 /usr/lib64/slurm/accounting_storage_slurmdbd.so
slurmctld 27215 slurm  mem    REG             253,12   231352   6135866 /usr/lib64/slurm/ext_sensors_none.so
slurmctld 27215 slurm  mem    REG             253,12   249056    437075 /usr/lib64/slurm/prep_script.so
slurmctld 27215 slurm  mem    REG             253,12   299232    332786 /usr/lib64/slurm/jobacct_gather_linux.so
slurmctld 27215 slurm  mem    REG             253,12   200856   6135861 /usr/lib64/slurm/acct_gather_filesystem_none.so
slurmctld 27215 slurm  mem    REG             253,12   200888   6135862 /usr/lib64/slurm/acct_gather_interconnect_none.so
slurmctld 27215 slurm  mem    REG             253,12   219168   6135863 /usr/lib64/slurm/acct_gather_profile_none.so
slurmctld 27215 slurm  mem    REG             253,12   201440   6135857 /usr/lib64/slurm/acct_gather_energy_none.so
slurmctld 27215 slurm  mem    REG             253,12   186224    292203 /usr/lib64/slurm/preempt_none.so
slurmctld 27215 slurm  mem    REG             253,12   414824   6022970 /usr/lib64/slurm/select_linear.so
slurmctld 27215 slurm  mem    REG             253,12   935832   6022967 /usr/lib64/slurm/select_cons_res.so
slurmctld 27215 slurm  mem    REG             253,12   449704   6022969 /usr/lib64/slurm/select_cray_aries.so
slurmctld 27215 slurm  mem    REG             253,12  1021848   6022968 /usr/lib64/slurm/select_cons_tres.so
slurmctld 27215 slurm  mem    REG             253,12   236592    165644 /usr/lib64/slurm/auth_munge.so
slurmctld 27215 slurm  mem    REG             253,12    41768  18025250 /usr/lib64/libmunge.so.2.0.0
slurmctld 27215 slurm  mem    REG             253,12   184928   6135865 /usr/lib64/slurm/cred_munge.so
slurmctld 27215 slurm  mem    REG             253,17 10406312      7593 /var/lib/sss/mc/initgroups
slurmctld 27215 slurm  mem    REG             253,12    37168  16930869 /usr/lib64/libnss_sss.so.2
slurmctld 27215 slurm  mem    REG             253,12    61624  16989799 /usr/lib64/libnss_files-2.17.so
slurmctld 27215 slurm  mem    REG             253,12    88776  19246625 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
slurmctld 27215 slurm  mem    REG             253,12  2156160  16778557 /usr/lib64/libc-2.17.so
slurmctld 27215 slurm  mem    REG             253,12   142232  16989807 /usr/lib64/libpthread-2.17.so
slurmctld 27215 slurm  mem    REG             253,12   105824  16989809 /usr/lib64/libresolv-2.17.so
slurmctld 27215 slurm  mem    REG             253,12  1137024  16930765 /usr/lib64/libm-2.17.so
slurmctld 27215 slurm  mem    REG             253,12    19288  16930763 /usr/lib64/libdl-2.17.so
slurmctld 27215 slurm  mem    REG             253,12  8436648    332941 /usr/lib64/slurm/libslurmfull.so
slurmctld 27215 slurm  mem    REG             253,12   163400  16778640 /usr/lib64/ld-2.17.so
slurmctld 27215 slurm    0r   CHR                1,3      0t0      1028 /dev/null
slurmctld 27215 slurm    1u  unix 0xffff96a423d44000      0t0 317251424 socket
slurmctld 27215 slurm    2u  unix 0xffff96a423d44000      0t0 317251424 socket
slurmctld 27215 slurm    3w   REG             253,16 34773937  67159866 /var/log/slurm/slurmsched.log
slurmctld 27215 slurm    4u  unix 0xffff9664b3e8d800      0t0 317223943 socket
slurmctld 27215 slurm    5wW  REG               0,20        6     52855 /run/slurmctld.pid
slurmctld 27215 slurm    6r   REG             253,17 10406312      7593 /var/lib/sss/mc/initgroups
slurmctld 27215 slurm    7r   REG             253,12      989  22723851 /etc/group
slurmctld 27215 slurm    8r  FIFO                0,9      0t0 317120482 pipe
slurmctld 27215 slurm    9w  FIFO                0,9      0t0 317120480 pipe
slurmctld 27215 slurm   10r  FIFO                0,9      0t0 317120481 pipe
slurmctld 27215 slurm   11w  FIFO                0,9      0t0 317120482 pipe
slurmctld 27215 slurm   12u  IPv4          317193398      0t0       TCP hamming.hamming.cluster:37548->hamming.hamming.cluster:6819 (ESTABLISHED)
slurmctld 27215 slurm   13r   REG             253,17  8406312      6752 /var/lib/sss/mc/passwd
slurmctld 27215 slurm   14w   REG             253,16 98964453  67159864 /var/log/slurm/slurmctld.log
slurmctld 27215 slurm   15r   REG             253,17  6406312      6753 /var/lib/sss/mc/group
slurmctld 27215 slurm   16u  sock                0,7      0t0 327933970 protocol: UNIX
Comment 10 Nate Rini 2022-08-19 11:28:32 MDT
Please attach the contents of /var/log/slurm/slurmsched.log from today.

Please also call:
> sudo netstat -lpnt |grep -i slurm
Comment 11 Jeff Haferman 2022-08-19 11:32:52 MDT
I see that nothing has been written to /var/log/slurm/slurmsched.log since April 29 of this year.


> sudo netstat -lpnt |grep -i slurm
tcp        0      0 0.0.0.0:6819            0.0.0.0:*               LISTEN      3704/slurmdbd
Comment 12 Nate Rini 2022-08-19 11:36:14 MDT
(In reply to Jeff Haferman from comment #11)
> I see that nothing has been written to /var/log/slurm/slurmsched.log since
> April 29 of this year.
Please also provide today's logs from /var/log/slurm/slurmctld.log then.

(In reply to Jeff Haferman from comment #11)
> tcp        0      0 0.0.0.0:6819            0.0.0.0:*               LISTEN  
> 3704/slurmdbd
Looks like slurmctld is not listening for connections.

Let's grab a core from it before we restart it:
> sudo gcore 27215

Then call (as root):
> systemctl stop slurmctld
> slurmctld -Dvvvvvvvv

Please attach the slurmctld stdio once it is stable.
Comment 13 Jeff Haferman 2022-08-19 11:41:24 MDT
Created attachment 26396 [details]
slurmctld.log
Comment 14 Jeff Haferman 2022-08-19 11:44:08 MDT
Created attachment 26397 [details]
core.27215.gz
Comment 15 Nate Rini 2022-08-19 11:45:05 MDT
(In reply to Jeff Haferman from comment #14)
> Created attachment 26397 [details]
> core.27215.gz

Cores are not useful without of all the binaries and libraries used. Please call this instead:
> gdb -ex 't a a bt full' $PATH_TO_core.27215 $(which slurmctld)
Comment 16 Nate Rini 2022-08-19 11:45:20 MDT
Please provide the output of:
> ip a
Comment 18 Nate Rini 2022-08-19 11:47:04 MDT
Please also check the output of 'dmesg' for any errors, especially OOM errors or messages.
Comment 19 Jeff Haferman 2022-08-19 11:55:18 MDT
Created attachment 26398 [details]
slurmctld verbose output
Comment 20 Nate Rini 2022-08-19 12:02:38 MDT
Please call with the verbose slurmctld running:
> netstat -lpnt | grep -i -e slurm -e munge
> nc hamming 6817 |hexdump -C #should return a pile of hex chars
Comment 21 Nate Rini 2022-08-19 12:04:00 MDT
Please call this on hamming and at least one of the compute nodes:
> getent hosts 10.1.1.201 hamming
Comment 22 Jeff Haferman 2022-08-19 12:06:51 MDT
Note I restarted slurmctld, it is now PID 10825.

> sudo gcore 10825

I think you have the arguments reversed in the following?

> gdb -ex 't a a bt full' $PATH_TO_core.10825 $(which slurmctld)

But this works:

> gdb -ex 't a a bt full' $(which slurmctld) $PATH_TO_core.10825 

Output attached.
Comment 23 Jeff Haferman 2022-08-19 12:07:06 MDT
Created attachment 26399 [details]
gdb output
Comment 24 Jeff Haferman 2022-08-19 12:10:11 MDT
dmesg (and other system logs) don't show anything unusual.

important interfaces (below) are all up. enp2s0f0 is our ethernet interface, and ib0 is our infiniband interface.

> ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: enp2s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether fc:aa:14:ea:8a:6b brd ff:ff:ff:ff:ff:ff
    inet 10.1.1.201/16 brd 10.1.255.255 scope global noprefixroute enp2s0f0
       valid_lft forever preferred_lft forever
3: enp2s0f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether fc:aa:14:ea:8a:6c brd ff:ff:ff:ff:ff:ff
4: ens2f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 40:8d:5c:39:56:64 brd ff:ff:ff:ff:ff:ff
    inet 172.20.32.200/22 brd 172.20.35.255 scope global noprefixroute ens2f0
       valid_lft forever preferred_lft forever
5: ens2f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 40:8d:5c:39:56:65 brd ff:ff:ff:ff:ff:ff
    inet 10.200.1.1/16 brd 10.200.255.255 scope global noprefixroute ens2f1
       valid_lft forever preferred_lft forever
6: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
    link/infiniband 00:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:3b:01:2e brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 10.100.1.28/16 brd 10.100.255.255 scope global noprefixroute ib0
       valid_lft forever preferred_lft forever
7: ib1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc mq state DOWN group default qlen 256
    link/infiniband 00:00:00:65:fe:80:00:00:00:00:00:00:24:8a:07:03:00:3b:01:2f brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
8: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 52:54:00:5d:0d:57 brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
       valid_lft forever preferred_lft forever
9: virbr0-nic: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast master virbr0 state DOWN group default qlen 1000
    link/ether 52:54:00:5d:0d:57 brd ff:ff:ff:ff:ff:ff
Comment 25 Jeff Haferman 2022-08-19 12:16:30 MDT
>  netstat -lpnt | grep -i -e slurm -e munge
tcp        0      0 0.0.0.0:6819            0.0.0.0:*               LISTEN      3704/slurmdbd
Comment 26 Jeff Haferman 2022-08-19 12:17:30 MDT
> nc hamming 6817 |hexdump -C 
Ncat: Connection refused.


This is weird...
Comment 27 Jeff Haferman 2022-08-19 12:19:56 MDT
On hamming:
> getent hosts 10.1.1.201 hamming
10.1.1.201      hamming.hamming.cluster
10.1.1.201      hamming.hamming.cluster

On login node:
> getent hosts 10.1.1.201 hamming
10.1.1.201      hamming.hamming.cluster
10.1.1.201      hamming.hamming.cluster

On random compute node:
> getent hosts 10.1.1.201 hamming
10.1.1.201      hamming.hamming.cluster
10.1.1.201      hamming.hamming.cluster
Comment 28 Nate Rini 2022-08-19 12:21:44 MDT
Please call:
> ip route
> ip route get 10.1.1.201
Comment 29 Nate Rini 2022-08-19 12:22:09 MDT
(In reply to Nate Rini from comment #28)
> Please call:
> > ip route
> > ip route get 10.1.1.201

on hamming and a random compute note
Comment 30 Jeff Haferman 2022-08-19 12:25:36 MDT
On hamming:
> ip route
default via 172.20.32.1 dev ens2f0 proto static metric 102 
10.1.0.0/16 dev enp2s0f0 proto kernel scope link src 10.1.1.201 metric 100 
10.100.0.0/16 dev ib0 proto kernel scope link src 10.100.1.28 metric 150 
10.200.0.0/16 dev ens2f1 proto kernel scope link src 10.200.1.1 metric 103 
172.20.32.0/22 dev ens2f0 proto kernel scope link src 172.20.32.200 metric 102 
192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1 

> ip route get 10.1.1.201
local 10.1.1.201 dev lo src 10.1.1.201 
    cache <local> 


On random compute node:
> ip route
default via 10.1.1.201 dev enp2s0f0 proto static metric 100 
10.1.0.0/16 dev enp2s0f0 proto kernel scope link src 10.1.3.1 metric 100 
10.100.0.0/16 dev ib0 proto kernel scope link src 10.100.3.1 
169.254.0.0/16 dev ib0 scope link metric 1004 

> ip route get 10.1.1.201
10.1.1.201 dev enp2s0f0 src 10.1.3.1 
    cache
Comment 31 Nate Rini 2022-08-19 12:26:34 MDT
(In reply to Jeff Haferman from comment #26)
> > nc hamming 6817 |hexdump -C 
> Ncat: Connection refused.
> 
> 
> This is weird...

Was this on hamming or compute node?
Comment 32 Nate Rini 2022-08-19 12:28:04 MDT
(In reply to Jeff Haferman from comment #25)
> >  netstat -lpnt | grep -i -e slurm -e munge
> tcp        0      0 0.0.0.0:6819            0.0.0.0:*               LISTEN  
> 3704/slurmdbd

Slurmctld is still not listening on :6817.

Please call:
> strace -ff -e open,connect,listen -o /tmp/strace.slurmctld.log -- slurmctld -Dvv

Please attach the stdio output and /tmp/strace.slurmctld.log.* once it is stable.
Comment 33 Jeff Haferman 2022-08-19 12:30:10 MDT
> > nc hamming 6817 |hexdump -C 
> Ncat: Connection refused.

I get this on BOTH hamming and a random compute node.
Comment 34 Nate Rini 2022-08-19 12:33:45 MDT
(In reply to Jeff Haferman from comment #33)
> > > nc hamming 6817 |hexdump -C 
> > Ncat: Connection refused.
> 
> I get this on BOTH hamming and a random compute node.

That at least means this is very unlikely to be a routing issue. The strace logs should show why it is not listening.

Is there anything special about this node? Such as being in a container, VM, or another kind of network virtualization?
Comment 35 Jeff Haferman 2022-08-19 12:35:23 MDT
Created attachment 26401 [details]
strace stdio
Comment 36 Jeff Haferman 2022-08-19 12:40:24 MDT
Created attachment 26402 [details]
strace tarball
Comment 37 Jeff Haferman 2022-08-19 12:41:05 MDT
> Is there anything special about this node? Such as being in a container, VM, or another kind of network virtualization?

No.
Comment 39 Nate Rini 2022-08-19 12:47:49 MDT
> 
connect(12, {sa_family=AF_INET, sin_port=htons(6819), sin_addr=inet_addr("10.1.1.201")}, 128) = -1 EINPROGRESS (Operation now in progress)

Something is in the kernel blocking slurmctld from opening the listening socket or the kernel is going really slow. The fastest way to resolve this is likely to restart the node.

Please attach the contents of `dmesg`.
Comment 40 Nate Rini 2022-08-19 12:50:08 MDT
Please also call on hamming:
> sudo ss -at '( sport = :6817 )'
Comment 41 Jeff Haferman 2022-08-19 12:52:47 MDT
Created attachment 26403 [details]
dmesg
Comment 42 Jeff Haferman 2022-08-19 12:54:49 MDT
> sudo ss -at '( sport = :6817 )'
State       Recv-Q Send-Q                      Local Address:Port                                       Peer Address:Port                



Nothing :(
Comment 43 Nate Rini 2022-08-19 12:58:05 MDT
Please call gcore again against the restarted slurmctld. The last backtrace shows it potentially hung while calling getpwnam_r() which shouldn't happen.

Please also provide:
> cat /etc/nsswitch.conf
Comment 44 Jeff Haferman 2022-08-19 12:59:55 MDT
> cat /etc/nsswitch.conf
#
# /etc/nsswitch.conf
#
# An example Name Service Switch config file. This file should be
# sorted with the most-used services at the beginning.
#
# The entry '[NOTFOUND=return]' means that the search for an
# entry should stop if the search in the previous entry turned
# up nothing. Note that if the search failed due to some other reason
# (like no NIS server responding) then the search continues with the
# next entry.
#
# Valid entries include:
#
#	nisplus			Use NIS+ (NIS version 3)
#	nis			Use NIS (NIS version 2), also called YP
#	dns			Use DNS (Domain Name Service)
#	files			Use the local files
#	db			Use the local database (.db) files
#	compat			Use NIS on compat mode
#	hesiod			Use Hesiod for user lookups
#	[NOTFOUND=return]	Stop searching if not found so far
#

# To use db, put the "db" in front of "files" for entries you want to be
# looked up first in the databases
#
# Example:
#passwd:    db files nisplus nis
#shadow:    db files nisplus nis
#group:     db files nisplus nis

passwd:     files sss
shadow:     files sss
group:      files sss
#initgroups: files

#hosts:     db files nisplus nis dns
hosts:      files dns myhostname

# Example - obey only what nisplus tells us...
#services:   nisplus [NOTFOUND=return] files
#networks:   nisplus [NOTFOUND=return] files
#protocols:  nisplus [NOTFOUND=return] files
#rpc:        nisplus [NOTFOUND=return] files
#ethers:     nisplus [NOTFOUND=return] files
#netmasks:   nisplus [NOTFOUND=return] files     

bootparams: nisplus [NOTFOUND=return] files

ethers:     files
netmasks:   files
networks:   files
protocols:  files
rpc:        files
services:   files sss

netgroup:   files sss

publickey:  nisplus

automount:  files sss
aliases:    files nisplus
Comment 45 Jeff Haferman 2022-08-19 13:07:57 MDT
Created attachment 26404 [details]
2nd attempt at corefile
Comment 46 Jeff Haferman 2022-08-19 13:08:18 MDT
Created attachment 26405 [details]
gdb output 2nd time
Comment 48 Nate Rini 2022-08-19 13:17:23 MDT
Please restart sssd:
> systemctl restart sssd
> systemctl status sssd
Comment 49 Jeff Haferman 2022-08-19 13:19:09 MDT
> systemctl restart sssd
Done.

> systemctl status sssd
● sssd.service - System Security Services Daemon
   Loaded: loaded (/usr/lib/systemd/system/sssd.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2022-08-19 12:17:58 PDT; 2s ago
 Main PID: 7155 (sssd)
    Tasks: 5
   Memory: 41.1M
   CGroup: /system.slice/sssd.service
           ├─7155 /usr/sbin/sssd -i --logger=files
           ├─7156 /usr/libexec/sssd/sssd_be --domain default --uid 0 --gid 0 --logger=files
           ├─7157 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
           ├─7158 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files
           └─7159 /usr/libexec/sssd/sssd_autofs --uid 0 --gid 0 --logger=files

Aug 19 12:17:58 hamming.uc.nps.edu systemd[1]: Starting System Security Services Daemon...
Aug 19 12:17:58 hamming.uc.nps.edu sssd[7155]: Starting up
Aug 19 12:17:58 hamming.uc.nps.edu sssd[be[default]][7156]: Starting up
Aug 19 12:17:58 hamming.uc.nps.edu sssd[be[default]][7156]: Your configuration uses the autofs provider with schema set to rfc2307 ...utes.
Aug 19 12:17:58 hamming.uc.nps.edu sssd[nss][7157]: Starting up
Aug 19 12:17:58 hamming.uc.nps.edu sssd[autofs][7159]: Starting up
Aug 19 12:17:58 hamming.uc.nps.edu sssd[pam][7158]: Starting up
Aug 19 12:17:58 hamming.uc.nps.edu systemd[1]: Started System Security Services Daemon.
Hint: Some lines were ellipsized, use -l to show in full.
Comment 50 Jeff Haferman 2022-08-19 13:19:39 MDT
That seems to have fixed it!
Comment 51 Nate Rini 2022-08-19 13:21:02 MDT
It's possible sssd hit this bug:
> https://github.com/SSSD/sssd
Comment 52 Nate Rini 2022-08-19 13:22:01 MDT
(In reply to Nate Rini from comment #51)
> It's possible sssd hit this bug:
> > https://github.com/SSSD/sssd

This bug specifically:
> https://github.com/SSSD/sssd/issues/2711
Comment 53 Jeff Haferman 2022-08-19 13:25:43 MDT
OK, is there a particular clue you saw that suggested we restart sssd?
Comment 54 Nate Rini 2022-08-19 13:27:29 MDT
(In reply to Nate Rini from comment #47)
> (In reply to Jeff Haferman from comment #46)
> > Created attachment 26405 [details]
> > gdb output 2nd time
> > #0  0x00007fcb47a38bed in poll () from /lib64/libc.so.6
> > #1  0x00007fcb473148c8 in sss_cli_make_request_nochecks () from /lib64/libnss_sss.so.2
> > #2  0x00007fcb47315346 in sss_nss_make_request_timeout () from /lib64/libnss_sss.so.2
> > #3  0x00007fcb47316687 in internal_getgrent_r () from /lib64/libnss_sss.so.2
> > #4  0x00007fcb47316fac in _nss_sss_getgrent_r () from /lib64/libnss_sss.so.2
> > #5  0x00007fcb47a581c5 in __nss_getent_r () from /lib64/libc.so.6
> > #6  0x00007fcb47a077e7 in getgrent_r@@GLIBC_2.2.5 () from /lib64/libc.so.6
> > #7  0x00000000004455ed in _get_group_members (group_name=0x1be26e0 "kraken") at groups.c:288
> > #8  0x0000000000445a1a in get_groups_members (group_names=<optimized out>) at groups.c:170
> > #9  0x000000000048c32f in _update_part_uid_access_list (x=0x136a1d0, arg=0x7ffe0f0d42f4) at partition_mgr.c:1951
> > #10 0x00007fcb4870001b in list_for_each_max (l=0x1358b40, max=max@entry=0x7ffe0f0d42c4, 
> > #11 0x00007fcb48700101 in list_for_each (l=<optimized out>, f=f@entry=0x48c2fe <_update_part_uid_access_list>, 
> > #12 0x00000000004901ae in load_part_uid_allow_list (force=force@entry=1) at partition_mgr.c:1992
> > #13 0x00000000004a9a51 in read_slurm_conf (recover=<optimized out>, reconfig=reconfig@entry=false) at read_config.c:1373
> > #14 0x000000000042f3be in main (argc=<optimized out>, argv=<optimized out>) at controller.c:654

(In reply to Jeff Haferman from comment #53)
> OK, is there a particular clue you saw that suggested we restart sssd?

In the 2nd stack trace attached, it showed slurmctld was again waiting on sssd (via sss_cli_make_request_nochecks()), which should complete in milliseconds.
Comment 55 Jeff Haferman 2022-08-19 13:31:46 MDT
OK... it (slurm commands) started timing out again.

Restart of sssd fixes it. But I'm wondering if it's going to continue to flap.
Comment 56 Jeff Haferman 2022-08-19 13:35:39 MDT
So far things are stable after the second restart of sssd....
Comment 57 Jeff Haferman 2022-08-19 13:39:40 MDT
It's continuing to flap :(
Comment 58 Jeff Haferman 2022-08-19 13:42:50 MDT
Do you think we should reboot the controller node?
Comment 59 Nate Rini 2022-08-19 13:43:45 MDT
(In reply to Jeff Haferman from comment #57)
> It's continuing to flap :(

I assume you mean it's freezing and not responding?

Please take another gcore and then restart sssd again to see if it comes back.
Comment 60 Nate Rini 2022-08-19 13:44:45 MDT
(In reply to Jeff Haferman from comment #58)
> Do you think we should reboot the controller node?

That might fix it. sssd is technically not supported by SchedMD support. Might just be easier to take sssd out of nssswitch.conf while debugging the issue with sssd.
Comment 61 Jeff Haferman 2022-08-19 13:47:25 MDT
OK....

yes, restarting sssd brings slurm back. but only for a few minutes. I may try a reboot, or we could look at updating to the latest sssd as well. or try your option of removing from nsswitch.conf.
Comment 62 Nate Rini 2022-08-19 14:34:01 MDT
(In reply to Jeff Haferman from comment #61)
> OK....
> 
> yes, restarting sssd brings slurm back. but only for a few minutes. I may
> try a reboot, or we could look at updating to the latest sssd as well. or
> try your option of removing from nsswitch.conf.

Please keep us updated on what your site plans to try.
Comment 63 Jeff Haferman 2022-08-19 14:35:38 MDT
I rebooted and we're still seeing the "flapping".

For now, I've created a cronjob to restart sssd every 2 minutes. We'll continue to diagnose.
Comment 64 Nate Rini 2022-08-19 14:43:19 MDT
(In reply to Jeff Haferman from comment #63)
> I rebooted and we're still seeing the "flapping".
> 
> For now, I've created a cronjob to restart sssd every 2 minutes. We'll
> continue to diagnose.

Slurm expects that libnss will work and will be fast. This needs to be fixed upstream in sssd or sssd needs to be replaced.
Comment 65 Jeff Haferman 2022-08-19 14:45:15 MDT
I'm wondering why this is the first time we have run into this issue...
Comment 66 Jeff Haferman 2022-08-19 15:17:38 MDT
I'm wondering if you have links to any related issues? I did a search and found
https://bugs.schedmd.com/show_bug.cgi?id=3159

which is old, but mentions sssd. In there, I see the command:
> sacctmgr show problems

When I run this on our system, I see:
  Cluster    Account       User                                  Problem 
---------- ---------- ---------- ---------------------------------------- 
                        bchainou                 User does not have a uid 
                      christian+                 User does not have a uid 
                         mpakin1                 User does not have a uid 
                      stewart.a+                 User does not have a uid 


Is this a problem?
Comment 67 Nate Rini 2022-08-19 15:20:17 MDT
(In reply to Jeff Haferman from comment #66)
> I'm wondering if you have links to any related issues? I did a search and
> found
> https://bugs.schedmd.com/show_bug.cgi?id=3159
> 
> which is old, but mentions sssd. In there, I see the command:
> > sacctmgr show problems
> 
> When I run this on our system, I see:
>   Cluster    Account       User                                  Problem 
> ---------- ---------- ---------- ---------------------------------------- 
>                         bchainou                 User does not have a uid 
>                       christian+                 User does not have a uid 
>                          mpakin1                 User does not have a uid 
>                       stewart.a+                 User does not have a uid 

This just means that Slurm could not resolve the user id for the given list of users. Most likely, those users have been deleted, or sssd is somehow missing users.

It is possible to verify outside of Slurm via:
> getent passwd mpakin1
Comment 68 Jeff Haferman 2022-08-19 15:23:43 MDT
> This just means that Slurm could not resolve the user id for the given list of users. Most likely, those users have been deleted, or sssd is somehow missing users.

Right, those users don't exist, but there is still an account for them in slurm/

OK... we'll look at updating sssd.
Comment 69 Nate Rini 2022-08-19 16:02:52 MDT
(In reply to Jeff Haferman from comment #68)
> > This just means that Slurm could not resolve the user id for the given list of users. Most likely, those users have been deleted, or sssd is somehow missing users.
> 
> Right, those users don't exist, but there is still an account for them in
> slurm/

Slurm maintains the accounting records indefinitely (unless purge is configured). There is no requirement that those users exist in the system to dump accounting data for those users.
Comment 70 Jeff Haferman 2022-08-19 16:15:48 MDT
OK, one of my sys-admins might have come up with a fix... at least so far this is working. I got rid of my cronjob kludge, then:

On the slurm controller node, we removed the sssd cache:

> systemctl stop sssd
> /bin/rm /var/lib/sss/db/*
> systemctl start sssd


This appears to have worked (slurm has been responsive for over 10 minutes, which is the longest since the problem began.
Comment 71 Nate Rini 2022-08-19 16:25:22 MDT
Are there any more related questions?
Comment 72 Jeff Haferman 2022-08-19 16:26:55 MDT
No, I just want to say thank-you for assisting and your quick responses. This was a nasty one.
Comment 73 Nate Rini 2022-08-19 16:52:22 MDT
Closing per the last comment. Please reply if any more related questions come up.
Comment 74 Jeff Haferman 2022-08-24 13:40:55 MDT
We have fully resolved this issue, and I just wanted to update here:

The root cause was a "bad" (but valid!) entry in our LDAP database. Specifically, a user of the form "foo@bar.com" was entered as a member of a group, where the correct entry should have been "foo". 

I believe LDAP attempted to authenticate the user by reaching out to "bar.com", which was unavailable, thus the timeout. As far as I can tell, this was only triggered when a member of the group in question attempted to perform some operation using slurm.