| Summary: | slurm_load_jobs error: Unable to contact slurm controller (connect failure) | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Jeff Haferman <jlhaferm> |
| Component: | slurmctld | Assignee: | Nate Rini <nate> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 21.08.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NPS HPC | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
dmesg
2nd attempt at corefile gdb output 2nd time |
||
|
Description
Jeff Haferman
2022-08-19 10:36:28 MDT
Please attach slurm.conf (and related config files).
Did something change on the controllers?
Call this command from a node where commands are failing:
> scontrol ping
> scontrol -vvvv show config
> srun -vvvv uptime
Created attachment 26393 [details]
slurm.conf
configuration files attached. I'm running all command below from the node where slurmctld is running (but I get the same results on other nodes as well): > scontrol ping Slurmctld(primary) at hamming is DOWN > scontrol -vvvv show config scontrol: debug2: _slurm_connect: failed to connect to 10.1.1.201:6817: Connection refused scontrol: debug2: Error connecting slurm stream socket at 10.1.1.201:6817: Connection refused > srun -vvvv uptime srun: defined options srun: -------------------- -------------------- srun: verbose : 4 srun: -------------------- -------------------- srun: end of defined options srun: debug: propagating RLIMIT_CPU=18446744073709551615 srun: debug: propagating RLIMIT_FSIZE=18446744073709551615 srun: debug: propagating RLIMIT_DATA=18446744073709551615 srun: debug: propagating RLIMIT_STACK=18446744073709551615 srun: debug: propagating RLIMIT_CORE=0 srun: debug: propagating RLIMIT_RSS=18446744073709551615 srun: debug: propagating RLIMIT_NPROC=4096 srun: debug: propagating RLIMIT_NOFILE=1024 srun: debug: propagating RLIMIT_AS=18446744073709551615 srun: debug: propagating SLURM_PRIO_PROCESS=0 srun: debug: propagating UMASK=0022 srun: debug2: srun PMI messages to port=46333 srun: debug: Entering slurm_allocation_msg_thr_create() srun: debug: port from net_stream_listen is 46826 srun: debug: Entering _msg_thr_internal srun: debug3: eio_message_socket_readable: shutdown 0 fd 4 srun: debug2: _slurm_connect: failed to connect to 10.1.1.201:6817: Connection refused srun: debug2: Error connecting slurm stream socket at 10.1.1.201:6817: Connection refused srun: debug2: _slurm_connect: failed to connect to 10.1.1.201:6817: Connection refused Created attachment 26394 [details]
nodes.conf
This is included by slurm.conf
Created attachment 26395 [details]
gres.conf
On node hamming, please call:
> ps -ef|grep -i slurmctld
> systemctl status slurmctld
Yes, it's up and running. That was the first thing I checked this morning. I even restarted slurmctld and munge, and I don't see any obvious problems. They are both running: > ps -ef|grep -i slurmctld slurm 27215 1 25 08:35 ? 00:26:40 /usr/sbin/slurmctld -D -s slurm 27218 27215 0 08:35 ? 00:00:00 slurmctld: slurmscriptd root 32404 31979 0 10:21 pts/5 00:00:00 grep --color=auto -i slurmctld > systemctl status slurmctld ● slurmctld.service - Slurm controller daemon Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2022-08-19 08:35:06 PDT; 1h 46min ago Main PID: 27215 (slurmctld) Tasks: 8 CGroup: /system.slice/slurmctld.service ├─27215 /usr/sbin/slurmctld -D -s └─27218 slurmctld: slurmscriptd Aug 19 08:35:21 hamming.uc.nps.edu slurmctld[27215]: slurmctld: Recovered JobId=45295704_21423(45317214) Assoc=1057 Aug 19 08:35:21 hamming.uc.nps.edu slurmctld[27215]: slurmctld: Recovered JobId=45295704_21424(45317215) Assoc=1057 Aug 19 08:35:21 hamming.uc.nps.edu slurmctld[27215]: slurmctld: Recovered JobId=45295704_* Assoc=1057 Aug 19 08:35:21 hamming.uc.nps.edu slurmctld[27215]: slurmctld: Recovered information about 583 jobs Aug 19 08:35:21 hamming.uc.nps.edu slurmctld[27215]: slurmctld: select/cons_res: part_data_create_array: select/cons_res: preparin...itions Aug 19 08:35:21 hamming.uc.nps.edu slurmctld[27215]: slurmctld: _sync_nodes_to_comp_job: JobId=45295704_21098(45316889) in completing state Aug 19 08:35:21 hamming.uc.nps.edu slurmctld[27215]: slurmctld: select/cons_res: job_res_rm_job: plugin still initializing Aug 19 08:35:21 hamming.uc.nps.edu slurmctld[27215]: slurmctld: _sync_nodes_to_comp_job: JobId=45295704_21161(45316952) in completing state Aug 19 08:35:21 hamming.uc.nps.edu slurmctld[27215]: slurmctld: select/cons_res: job_res_rm_job: plugin still initializing Aug 19 08:35:21 hamming.uc.nps.edu slurmctld[27215]: slurmctld: _sync_nodes_to_comp_job: completing 2 jobs Hint: Some lines were ellipsized, use -l to show in full. Please call:
> sudo lsof -p 27215
> sudo lsof -p 27215
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
slurmctld 27215 slurm cwd DIR 253,16 4096 67159852 /var/log/slurm
slurmctld 27215 slurm rtd DIR 253,12 4096 64 /
slurmctld 27215 slurm txt REG 253,12 4215160 17987183 /usr/sbin/slurmctld
slurmctld 27215 slurm mem REG 253,17 6406312 6753 /var/lib/sss/mc/group
slurmctld 27215 slurm mem REG 253,12 199976 6022963 /usr/lib64/slurm/route_default.so
slurmctld 27215 slurm mem REG 253,12 405232 6022965 /usr/lib64/slurm/sched_backfill.so
slurmctld 27215 slurm mem REG 253,12 402384 16778655 /usr/lib64/libpcre.so.1.2.0
slurmctld 27215 slurm mem REG 253,12 155784 16989840 /usr/lib64/libselinux.so.1
slurmctld 27215 slurm mem REG 253,12 15688 16778856 /usr/lib64/libkeyutils.so.1.5
slurmctld 27215 slurm mem REG 253,12 67104 16855599 /usr/lib64/libkrb5support.so.0.1
slurmctld 27215 slurm mem REG 253,12 210784 16827272 /usr/lib64/libk5crypto.so.3.1
slurmctld 27215 slurm mem REG 253,12 15920 16778578 /usr/lib64/libcom_err.so.2.1
slurmctld 27215 slurm mem REG 253,12 967784 16778651 /usr/lib64/libkrb5.so.3.3
slurmctld 27215 slurm mem REG 253,12 320784 16778902 /usr/lib64/libgssapi_krb5.so.2.2
slurmctld 27215 slurm mem REG 253,12 991616 16778586 /usr/lib64/libstdc++.so.6.0.19
slurmctld 27215 slurm mem REG 253,12 2521144 18036194 /usr/lib64/libcrypto.so.1.0.2k
slurmctld 27215 slurm mem REG 253,12 470376 16997857 /usr/lib64/libssl.so.1.0.2k
slurmctld 27215 slurm mem REG 253,12 3135672 16888971 /usr/lib64/mysql/libmysqlclient.so.18.0.0
slurmctld 27215 slurm mem REG 253,12 90248 16778577 /usr/lib64/libz.so.1.2.7
slurmctld 27215 slurm mem REG 253,12 367000 332793 /usr/lib64/slurm/jobcomp_mysql.so
slurmctld 27215 slurm mem REG 253,12 157136 6091367 /usr/lib64/slurm/topology_none.so
slurmctld 27215 slurm mem REG 253,12 320888 6022975 /usr/lib64/slurm/switch_cray_aries.so
slurmctld 27215 slurm mem REG 253,12 236848 6091360 /usr/lib64/slurm/switch_none.so
slurmctld 27215 slurm mem REG 253,17 8406312 6752 /var/lib/sss/mc/passwd
slurmctld 27215 slurm mem REG 253,12 31408 16989797 /usr/lib64/libnss_dns-2.17.so
slurmctld 27215 slurm mem REG 253,12 539360 6135855 /usr/lib64/slurm/accounting_storage_slurmdbd.so
slurmctld 27215 slurm mem REG 253,12 231352 6135866 /usr/lib64/slurm/ext_sensors_none.so
slurmctld 27215 slurm mem REG 253,12 249056 437075 /usr/lib64/slurm/prep_script.so
slurmctld 27215 slurm mem REG 253,12 299232 332786 /usr/lib64/slurm/jobacct_gather_linux.so
slurmctld 27215 slurm mem REG 253,12 200856 6135861 /usr/lib64/slurm/acct_gather_filesystem_none.so
slurmctld 27215 slurm mem REG 253,12 200888 6135862 /usr/lib64/slurm/acct_gather_interconnect_none.so
slurmctld 27215 slurm mem REG 253,12 219168 6135863 /usr/lib64/slurm/acct_gather_profile_none.so
slurmctld 27215 slurm mem REG 253,12 201440 6135857 /usr/lib64/slurm/acct_gather_energy_none.so
slurmctld 27215 slurm mem REG 253,12 186224 292203 /usr/lib64/slurm/preempt_none.so
slurmctld 27215 slurm mem REG 253,12 414824 6022970 /usr/lib64/slurm/select_linear.so
slurmctld 27215 slurm mem REG 253,12 935832 6022967 /usr/lib64/slurm/select_cons_res.so
slurmctld 27215 slurm mem REG 253,12 449704 6022969 /usr/lib64/slurm/select_cray_aries.so
slurmctld 27215 slurm mem REG 253,12 1021848 6022968 /usr/lib64/slurm/select_cons_tres.so
slurmctld 27215 slurm mem REG 253,12 236592 165644 /usr/lib64/slurm/auth_munge.so
slurmctld 27215 slurm mem REG 253,12 41768 18025250 /usr/lib64/libmunge.so.2.0.0
slurmctld 27215 slurm mem REG 253,12 184928 6135865 /usr/lib64/slurm/cred_munge.so
slurmctld 27215 slurm mem REG 253,17 10406312 7593 /var/lib/sss/mc/initgroups
slurmctld 27215 slurm mem REG 253,12 37168 16930869 /usr/lib64/libnss_sss.so.2
slurmctld 27215 slurm mem REG 253,12 61624 16989799 /usr/lib64/libnss_files-2.17.so
slurmctld 27215 slurm mem REG 253,12 88776 19246625 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
slurmctld 27215 slurm mem REG 253,12 2156160 16778557 /usr/lib64/libc-2.17.so
slurmctld 27215 slurm mem REG 253,12 142232 16989807 /usr/lib64/libpthread-2.17.so
slurmctld 27215 slurm mem REG 253,12 105824 16989809 /usr/lib64/libresolv-2.17.so
slurmctld 27215 slurm mem REG 253,12 1137024 16930765 /usr/lib64/libm-2.17.so
slurmctld 27215 slurm mem REG 253,12 19288 16930763 /usr/lib64/libdl-2.17.so
slurmctld 27215 slurm mem REG 253,12 8436648 332941 /usr/lib64/slurm/libslurmfull.so
slurmctld 27215 slurm mem REG 253,12 163400 16778640 /usr/lib64/ld-2.17.so
slurmctld 27215 slurm 0r CHR 1,3 0t0 1028 /dev/null
slurmctld 27215 slurm 1u unix 0xffff96a423d44000 0t0 317251424 socket
slurmctld 27215 slurm 2u unix 0xffff96a423d44000 0t0 317251424 socket
slurmctld 27215 slurm 3w REG 253,16 34773937 67159866 /var/log/slurm/slurmsched.log
slurmctld 27215 slurm 4u unix 0xffff9664b3e8d800 0t0 317223943 socket
slurmctld 27215 slurm 5wW REG 0,20 6 52855 /run/slurmctld.pid
slurmctld 27215 slurm 6r REG 253,17 10406312 7593 /var/lib/sss/mc/initgroups
slurmctld 27215 slurm 7r REG 253,12 989 22723851 /etc/group
slurmctld 27215 slurm 8r FIFO 0,9 0t0 317120482 pipe
slurmctld 27215 slurm 9w FIFO 0,9 0t0 317120480 pipe
slurmctld 27215 slurm 10r FIFO 0,9 0t0 317120481 pipe
slurmctld 27215 slurm 11w FIFO 0,9 0t0 317120482 pipe
slurmctld 27215 slurm 12u IPv4 317193398 0t0 TCP hamming.hamming.cluster:37548->hamming.hamming.cluster:6819 (ESTABLISHED)
slurmctld 27215 slurm 13r REG 253,17 8406312 6752 /var/lib/sss/mc/passwd
slurmctld 27215 slurm 14w REG 253,16 98964453 67159864 /var/log/slurm/slurmctld.log
slurmctld 27215 slurm 15r REG 253,17 6406312 6753 /var/lib/sss/mc/group
slurmctld 27215 slurm 16u sock 0,7 0t0 327933970 protocol: UNIX
Please attach the contents of /var/log/slurm/slurmsched.log from today.
Please also call:
> sudo netstat -lpnt |grep -i slurm
I see that nothing has been written to /var/log/slurm/slurmsched.log since April 29 of this year.
> sudo netstat -lpnt |grep -i slurm
tcp 0 0 0.0.0.0:6819 0.0.0.0:* LISTEN 3704/slurmdbd
(In reply to Jeff Haferman from comment #11) > I see that nothing has been written to /var/log/slurm/slurmsched.log since > April 29 of this year. Please also provide today's logs from /var/log/slurm/slurmctld.log then. (In reply to Jeff Haferman from comment #11) > tcp 0 0 0.0.0.0:6819 0.0.0.0:* LISTEN > 3704/slurmdbd Looks like slurmctld is not listening for connections. Let's grab a core from it before we restart it: > sudo gcore 27215 Then call (as root): > systemctl stop slurmctld > slurmctld -Dvvvvvvvv Please attach the slurmctld stdio once it is stable. Created attachment 26396 [details]
slurmctld.log
Created attachment 26397 [details]
core.27215.gz
(In reply to Jeff Haferman from comment #14) > Created attachment 26397 [details] > core.27215.gz Cores are not useful without of all the binaries and libraries used. Please call this instead: > gdb -ex 't a a bt full' $PATH_TO_core.27215 $(which slurmctld) Please provide the output of:
> ip a
Please also check the output of 'dmesg' for any errors, especially OOM errors or messages. Created attachment 26398 [details]
slurmctld verbose output
Please call with the verbose slurmctld running:
> netstat -lpnt | grep -i -e slurm -e munge
> nc hamming 6817 |hexdump -C #should return a pile of hex chars
Please call this on hamming and at least one of the compute nodes:
> getent hosts 10.1.1.201 hamming
Note I restarted slurmctld, it is now PID 10825. > sudo gcore 10825 I think you have the arguments reversed in the following? > gdb -ex 't a a bt full' $PATH_TO_core.10825 $(which slurmctld) But this works: > gdb -ex 't a a bt full' $(which slurmctld) $PATH_TO_core.10825 Output attached. Created attachment 26399 [details]
gdb output
dmesg (and other system logs) don't show anything unusual.
important interfaces (below) are all up. enp2s0f0 is our ethernet interface, and ib0 is our infiniband interface.
> ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: enp2s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether fc:aa:14:ea:8a:6b brd ff:ff:ff:ff:ff:ff
inet 10.1.1.201/16 brd 10.1.255.255 scope global noprefixroute enp2s0f0
valid_lft forever preferred_lft forever
3: enp2s0f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether fc:aa:14:ea:8a:6c brd ff:ff:ff:ff:ff:ff
4: ens2f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 40:8d:5c:39:56:64 brd ff:ff:ff:ff:ff:ff
inet 172.20.32.200/22 brd 172.20.35.255 scope global noprefixroute ens2f0
valid_lft forever preferred_lft forever
5: ens2f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether 40:8d:5c:39:56:65 brd ff:ff:ff:ff:ff:ff
inet 10.200.1.1/16 brd 10.200.255.255 scope global noprefixroute ens2f1
valid_lft forever preferred_lft forever
6: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
link/infiniband 00:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:3b:01:2e brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
inet 10.100.1.28/16 brd 10.100.255.255 scope global noprefixroute ib0
valid_lft forever preferred_lft forever
7: ib1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc mq state DOWN group default qlen 256
link/infiniband 00:00:00:65:fe:80:00:00:00:00:00:00:24:8a:07:03:00:3b:01:2f brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
8: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
link/ether 52:54:00:5d:0d:57 brd ff:ff:ff:ff:ff:ff
inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
valid_lft forever preferred_lft forever
9: virbr0-nic: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast master virbr0 state DOWN group default qlen 1000
link/ether 52:54:00:5d:0d:57 brd ff:ff:ff:ff:ff:ff
> netstat -lpnt | grep -i -e slurm -e munge
tcp 0 0 0.0.0.0:6819 0.0.0.0:* LISTEN 3704/slurmdbd
> nc hamming 6817 |hexdump -C
Ncat: Connection refused.
This is weird...
On hamming: > getent hosts 10.1.1.201 hamming 10.1.1.201 hamming.hamming.cluster 10.1.1.201 hamming.hamming.cluster On login node: > getent hosts 10.1.1.201 hamming 10.1.1.201 hamming.hamming.cluster 10.1.1.201 hamming.hamming.cluster On random compute node: > getent hosts 10.1.1.201 hamming 10.1.1.201 hamming.hamming.cluster 10.1.1.201 hamming.hamming.cluster Please call:
> ip route
> ip route get 10.1.1.201
(In reply to Nate Rini from comment #28) > Please call: > > ip route > > ip route get 10.1.1.201 on hamming and a random compute note On hamming: > ip route default via 172.20.32.1 dev ens2f0 proto static metric 102 10.1.0.0/16 dev enp2s0f0 proto kernel scope link src 10.1.1.201 metric 100 10.100.0.0/16 dev ib0 proto kernel scope link src 10.100.1.28 metric 150 10.200.0.0/16 dev ens2f1 proto kernel scope link src 10.200.1.1 metric 103 172.20.32.0/22 dev ens2f0 proto kernel scope link src 172.20.32.200 metric 102 192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1 > ip route get 10.1.1.201 local 10.1.1.201 dev lo src 10.1.1.201 cache <local> On random compute node: > ip route default via 10.1.1.201 dev enp2s0f0 proto static metric 100 10.1.0.0/16 dev enp2s0f0 proto kernel scope link src 10.1.3.1 metric 100 10.100.0.0/16 dev ib0 proto kernel scope link src 10.100.3.1 169.254.0.0/16 dev ib0 scope link metric 1004 > ip route get 10.1.1.201 10.1.1.201 dev enp2s0f0 src 10.1.3.1 cache (In reply to Jeff Haferman from comment #26) > > nc hamming 6817 |hexdump -C > Ncat: Connection refused. > > > This is weird... Was this on hamming or compute node? (In reply to Jeff Haferman from comment #25) > > netstat -lpnt | grep -i -e slurm -e munge > tcp 0 0 0.0.0.0:6819 0.0.0.0:* LISTEN > 3704/slurmdbd Slurmctld is still not listening on :6817. Please call: > strace -ff -e open,connect,listen -o /tmp/strace.slurmctld.log -- slurmctld -Dvv Please attach the stdio output and /tmp/strace.slurmctld.log.* once it is stable. > > nc hamming 6817 |hexdump -C
> Ncat: Connection refused.
I get this on BOTH hamming and a random compute node.
(In reply to Jeff Haferman from comment #33) > > > nc hamming 6817 |hexdump -C > > Ncat: Connection refused. > > I get this on BOTH hamming and a random compute node. That at least means this is very unlikely to be a routing issue. The strace logs should show why it is not listening. Is there anything special about this node? Such as being in a container, VM, or another kind of network virtualization? Created attachment 26401 [details]
strace stdio
Created attachment 26402 [details]
strace tarball
> Is there anything special about this node? Such as being in a container, VM, or another kind of network virtualization?
No.
>
connect(12, {sa_family=AF_INET, sin_port=htons(6819), sin_addr=inet_addr("10.1.1.201")}, 128) = -1 EINPROGRESS (Operation now in progress)
Something is in the kernel blocking slurmctld from opening the listening socket or the kernel is going really slow. The fastest way to resolve this is likely to restart the node.
Please attach the contents of `dmesg`.
Please also call on hamming:
> sudo ss -at '( sport = :6817 )'
Created attachment 26403 [details]
dmesg
> sudo ss -at '( sport = :6817 )'
State Recv-Q Send-Q Local Address:Port Peer Address:Port
Nothing :(
Please call gcore again against the restarted slurmctld. The last backtrace shows it potentially hung while calling getpwnam_r() which shouldn't happen.
Please also provide:
> cat /etc/nsswitch.conf
> cat /etc/nsswitch.conf
#
# /etc/nsswitch.conf
#
# An example Name Service Switch config file. This file should be
# sorted with the most-used services at the beginning.
#
# The entry '[NOTFOUND=return]' means that the search for an
# entry should stop if the search in the previous entry turned
# up nothing. Note that if the search failed due to some other reason
# (like no NIS server responding) then the search continues with the
# next entry.
#
# Valid entries include:
#
# nisplus Use NIS+ (NIS version 3)
# nis Use NIS (NIS version 2), also called YP
# dns Use DNS (Domain Name Service)
# files Use the local files
# db Use the local database (.db) files
# compat Use NIS on compat mode
# hesiod Use Hesiod for user lookups
# [NOTFOUND=return] Stop searching if not found so far
#
# To use db, put the "db" in front of "files" for entries you want to be
# looked up first in the databases
#
# Example:
#passwd: db files nisplus nis
#shadow: db files nisplus nis
#group: db files nisplus nis
passwd: files sss
shadow: files sss
group: files sss
#initgroups: files
#hosts: db files nisplus nis dns
hosts: files dns myhostname
# Example - obey only what nisplus tells us...
#services: nisplus [NOTFOUND=return] files
#networks: nisplus [NOTFOUND=return] files
#protocols: nisplus [NOTFOUND=return] files
#rpc: nisplus [NOTFOUND=return] files
#ethers: nisplus [NOTFOUND=return] files
#netmasks: nisplus [NOTFOUND=return] files
bootparams: nisplus [NOTFOUND=return] files
ethers: files
netmasks: files
networks: files
protocols: files
rpc: files
services: files sss
netgroup: files sss
publickey: nisplus
automount: files sss
aliases: files nisplus
Created attachment 26404 [details]
2nd attempt at corefile
Created attachment 26405 [details]
gdb output 2nd time
Please restart sssd:
> systemctl restart sssd
> systemctl status sssd
> systemctl restart sssd Done. > systemctl status sssd ● sssd.service - System Security Services Daemon Loaded: loaded (/usr/lib/systemd/system/sssd.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2022-08-19 12:17:58 PDT; 2s ago Main PID: 7155 (sssd) Tasks: 5 Memory: 41.1M CGroup: /system.slice/sssd.service ├─7155 /usr/sbin/sssd -i --logger=files ├─7156 /usr/libexec/sssd/sssd_be --domain default --uid 0 --gid 0 --logger=files ├─7157 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files ├─7158 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files └─7159 /usr/libexec/sssd/sssd_autofs --uid 0 --gid 0 --logger=files Aug 19 12:17:58 hamming.uc.nps.edu systemd[1]: Starting System Security Services Daemon... Aug 19 12:17:58 hamming.uc.nps.edu sssd[7155]: Starting up Aug 19 12:17:58 hamming.uc.nps.edu sssd[be[default]][7156]: Starting up Aug 19 12:17:58 hamming.uc.nps.edu sssd[be[default]][7156]: Your configuration uses the autofs provider with schema set to rfc2307 ...utes. Aug 19 12:17:58 hamming.uc.nps.edu sssd[nss][7157]: Starting up Aug 19 12:17:58 hamming.uc.nps.edu sssd[autofs][7159]: Starting up Aug 19 12:17:58 hamming.uc.nps.edu sssd[pam][7158]: Starting up Aug 19 12:17:58 hamming.uc.nps.edu systemd[1]: Started System Security Services Daemon. Hint: Some lines were ellipsized, use -l to show in full. That seems to have fixed it! It's possible sssd hit this bug:
> https://github.com/SSSD/sssd
(In reply to Nate Rini from comment #51) > It's possible sssd hit this bug: > > https://github.com/SSSD/sssd This bug specifically: > https://github.com/SSSD/sssd/issues/2711 OK, is there a particular clue you saw that suggested we restart sssd? (In reply to Nate Rini from comment #47) > (In reply to Jeff Haferman from comment #46) > > Created attachment 26405 [details] > > gdb output 2nd time > > #0 0x00007fcb47a38bed in poll () from /lib64/libc.so.6 > > #1 0x00007fcb473148c8 in sss_cli_make_request_nochecks () from /lib64/libnss_sss.so.2 > > #2 0x00007fcb47315346 in sss_nss_make_request_timeout () from /lib64/libnss_sss.so.2 > > #3 0x00007fcb47316687 in internal_getgrent_r () from /lib64/libnss_sss.so.2 > > #4 0x00007fcb47316fac in _nss_sss_getgrent_r () from /lib64/libnss_sss.so.2 > > #5 0x00007fcb47a581c5 in __nss_getent_r () from /lib64/libc.so.6 > > #6 0x00007fcb47a077e7 in getgrent_r@@GLIBC_2.2.5 () from /lib64/libc.so.6 > > #7 0x00000000004455ed in _get_group_members (group_name=0x1be26e0 "kraken") at groups.c:288 > > #8 0x0000000000445a1a in get_groups_members (group_names=<optimized out>) at groups.c:170 > > #9 0x000000000048c32f in _update_part_uid_access_list (x=0x136a1d0, arg=0x7ffe0f0d42f4) at partition_mgr.c:1951 > > #10 0x00007fcb4870001b in list_for_each_max (l=0x1358b40, max=max@entry=0x7ffe0f0d42c4, > > #11 0x00007fcb48700101 in list_for_each (l=<optimized out>, f=f@entry=0x48c2fe <_update_part_uid_access_list>, > > #12 0x00000000004901ae in load_part_uid_allow_list (force=force@entry=1) at partition_mgr.c:1992 > > #13 0x00000000004a9a51 in read_slurm_conf (recover=<optimized out>, reconfig=reconfig@entry=false) at read_config.c:1373 > > #14 0x000000000042f3be in main (argc=<optimized out>, argv=<optimized out>) at controller.c:654 (In reply to Jeff Haferman from comment #53) > OK, is there a particular clue you saw that suggested we restart sssd? In the 2nd stack trace attached, it showed slurmctld was again waiting on sssd (via sss_cli_make_request_nochecks()), which should complete in milliseconds. OK... it (slurm commands) started timing out again. Restart of sssd fixes it. But I'm wondering if it's going to continue to flap. So far things are stable after the second restart of sssd.... It's continuing to flap :( Do you think we should reboot the controller node? (In reply to Jeff Haferman from comment #57) > It's continuing to flap :( I assume you mean it's freezing and not responding? Please take another gcore and then restart sssd again to see if it comes back. (In reply to Jeff Haferman from comment #58) > Do you think we should reboot the controller node? That might fix it. sssd is technically not supported by SchedMD support. Might just be easier to take sssd out of nssswitch.conf while debugging the issue with sssd. OK.... yes, restarting sssd brings slurm back. but only for a few minutes. I may try a reboot, or we could look at updating to the latest sssd as well. or try your option of removing from nsswitch.conf. (In reply to Jeff Haferman from comment #61) > OK.... > > yes, restarting sssd brings slurm back. but only for a few minutes. I may > try a reboot, or we could look at updating to the latest sssd as well. or > try your option of removing from nsswitch.conf. Please keep us updated on what your site plans to try. I rebooted and we're still seeing the "flapping". For now, I've created a cronjob to restart sssd every 2 minutes. We'll continue to diagnose. (In reply to Jeff Haferman from comment #63) > I rebooted and we're still seeing the "flapping". > > For now, I've created a cronjob to restart sssd every 2 minutes. We'll > continue to diagnose. Slurm expects that libnss will work and will be fast. This needs to be fixed upstream in sssd or sssd needs to be replaced. I'm wondering why this is the first time we have run into this issue... I'm wondering if you have links to any related issues? I did a search and found https://bugs.schedmd.com/show_bug.cgi?id=3159 which is old, but mentions sssd. In there, I see the command: > sacctmgr show problems When I run this on our system, I see: Cluster Account User Problem ---------- ---------- ---------- ---------------------------------------- bchainou User does not have a uid christian+ User does not have a uid mpakin1 User does not have a uid stewart.a+ User does not have a uid Is this a problem? (In reply to Jeff Haferman from comment #66) > I'm wondering if you have links to any related issues? I did a search and > found > https://bugs.schedmd.com/show_bug.cgi?id=3159 > > which is old, but mentions sssd. In there, I see the command: > > sacctmgr show problems > > When I run this on our system, I see: > Cluster Account User Problem > ---------- ---------- ---------- ---------------------------------------- > bchainou User does not have a uid > christian+ User does not have a uid > mpakin1 User does not have a uid > stewart.a+ User does not have a uid This just means that Slurm could not resolve the user id for the given list of users. Most likely, those users have been deleted, or sssd is somehow missing users. It is possible to verify outside of Slurm via: > getent passwd mpakin1 > This just means that Slurm could not resolve the user id for the given list of users. Most likely, those users have been deleted, or sssd is somehow missing users.
Right, those users don't exist, but there is still an account for them in slurm/
OK... we'll look at updating sssd.
(In reply to Jeff Haferman from comment #68) > > This just means that Slurm could not resolve the user id for the given list of users. Most likely, those users have been deleted, or sssd is somehow missing users. > > Right, those users don't exist, but there is still an account for them in > slurm/ Slurm maintains the accounting records indefinitely (unless purge is configured). There is no requirement that those users exist in the system to dump accounting data for those users. OK, one of my sys-admins might have come up with a fix... at least so far this is working. I got rid of my cronjob kludge, then:
On the slurm controller node, we removed the sssd cache:
> systemctl stop sssd
> /bin/rm /var/lib/sss/db/*
> systemctl start sssd
This appears to have worked (slurm has been responsive for over 10 minutes, which is the longest since the problem began.
Are there any more related questions? No, I just want to say thank-you for assisting and your quick responses. This was a nasty one. Closing per the last comment. Please reply if any more related questions come up. We have fully resolved this issue, and I just wanted to update here: The root cause was a "bad" (but valid!) entry in our LDAP database. Specifically, a user of the form "foo@bar.com" was entered as a member of a group, where the correct entry should have been "foo". I believe LDAP attempted to authenticate the user by reaching out to "bar.com", which was unavailable, thus the timeout. As far as I can tell, this was only triggered when a member of the group in question attempted to perform some operation using slurm. |