| Summary: | Sinfo/sbatch/squeue can not connect to server. | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Bill Broadley <bill.broadley> |
| Component: | slurmctld | Assignee: | Nate Rini <nate> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | bart, nate |
| Version: | 19.05.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NREL | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 19.05.2 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurmctld.log
slurm.conf Requested gdb log from core dump patch for 19.05.2 |
||
Created attachment 15791 [details]
slurm.conf
Please call the following on your controller:
> scontrol ping
> sdiag
> sacctmgr show stats
In addition to Nate's request do you know what was happening before this event? Was there a large batch of jobs completing? Would you also let us know what your Prolog and Epilog scripts are doing? ``` [root@emgmt1 etc]# scontrol ping Slurmctld(primary) at emgmt1 is DOWN ***************************************** ** RESTORE SLURMCTLD DAEMON TO SERVICE ** ***************************************** [root@emgmt1 ~]# sacctmgr show stats Rollup statistics Hour count:2 ave_time:11682051 max_time:15003950 total_time:23364102 Day count:2 ave_time:2043439 max_time:3652675 total_time:4086878 Month count:0 ave_time:0 max_time:0 total_time:0 Remote Procedure Call statistics by message type DBD_JOB_COMPLETE ( 1424) count:530 ave_time:820 total_time:434931 DBD_FINI ( 1401) count:379 ave_time:911937 total_time:345624311 SLURM_PERSIST_INIT ( 6500) count:132 ave_time:389 total_time:51380 DBD_GET_QOS ( 1448) count:131 ave_time:553 total_time:72530 DBD_GET_ASSOCS ( 1410) count:127 ave_time:7781 total_time:988208 DBD_MODIFY_ASSOCS ( 1429) count:122 ave_time:1935 total_time:236077 DBD_STEP_COMPLETE ( 1441) count:16 ave_time:9769 total_time:156309 DBD_SEND_MULT_MSG ( 1474) count:5 ave_time:103461 total_time:517306 DBD_GET_JOBS_COND ( 1444) count:4 ave_time:9382 total_time:37529 DBD_GET_TRES ( 1486) count:3 ave_time:444 total_time:1334 DBD_STEP_START ( 1442) count:2 ave_time:6933 total_time:13867 DBD_REGISTER_CTLD ( 1434) count:1 ave_time:955 total_time:955 DBD_CLUSTER_TRES ( 1407) count:1 ave_time:532 total_time:532 DBD_GET_FEDERATIONS ( 1494) count:1 ave_time:300 total_time:300 DBD_GET_USERS ( 1415) count:1 ave_time:48797 total_time:48797 DBD_GET_RES ( 1478) count:1 ave_time:1366 total_time:1366 DBD_SEND_MULT_JOB_START ( 1472) count:1 ave_time:250493 total_time:250493 DBD_MODIFY_RESV ( 1463) count:1 ave_time:702 total_time:702 Remote Procedure Call statistics by user root ( 0) count:878 ave_time:394930 total_time:346749090 slurm ( 989) count:564 ave_time:2916 total_time:1644696 kregimba ( 120043) count:16 ave_time:2696 total_time:43141 [root@emgmt1 ~]# ``` I ran sdiag, but it's just hanging, I'll post again if it returns. I'll followup with the prolog/epilog questions. (In reply to Bill Broadley from comment #4) > I ran sdiag, but it's just hanging, I'll post again if it returns. I'll > followup with the prolog/epilog questions. Please use gcore to take a core dump from slurmctld and then dump the backtrace: > pgrep slurmctld |xargs -i gcore -a {} > gdb $(which slurmctd) $PATH_TO_CORE > set pagination off > set print pretty on > t a a bt full Created attachment 15797 [details]
Requested gdb log from core dump
Bill, This is very likely a duplicate of bug#8978. Your install is too back level to apply the patch directly (https://github.com/SchedMD/slurm/commit/f17cf91ccc56ccf87). The issue can be resolved by clearing out the existing jobs in statesavelocation or we can do a one off patch specific for your release version. Please call the following if you want to try a patch. > slurmctld -V --Nate We have a substantial number of jobs running, and we would like to avoid losing that state if at all possible. Could you send a patch specific to our version: [root@emgmt1 log]# ls -al /proc/14477/exe lrwxrwxrwx. 1 slurm geoclue 0 Sep 8 11:35 /proc/14477/exe -> /nopt/slurm/19.05.2/sbin/slurmctld [root@emgmt1 log]# /nopt/slurm/19.05.2/sbin/slurmctld -V slurm 19.05.2 I looked at the patch and it was super simple, so I applied it manually. Do you think this is likely to work: [root@emgmt1 slurmctld]# diff job_mgr.c.orig job_mgr.c 10610c10610 < char *cmd_line = NULL; --- > char *cmd_line = NULL, *pos = NULL; 10612,10614c10612,10614 < if (i != 0) < xstrcatchar(cmd_line, ' '); < xstrcat(cmd_line, detail_ptr->argv[i]); --- > xstrfmtcatat(cmd_line, &pos, "%s%s", > (i ? " " : ""), > detail_ptr->argv[i]); [root@emgmt1 slurmctld]# It looks to me to match the patch that you linked to. Created attachment 15801 [details] patch for 19.05.2 (In reply to Bill Broadley from comment #11) > I looked at the patch and it was super simple, so I applied it manually. Do > you think this is likely to work: Patch is attached and yes it was simple. Looks like 19.05.2 had the previous patch to add xstrfmtcatat() which I was worried about. Please give it a try. (In reply to Nate Rini from comment #12) > Patch is attached and yes it was simple. Looks like 19.05.2 had the previous > patch to add xstrfmtcatat() which I was worried about. Please give it a try. Reducing this to SEV3 as a workaround has been provided. Thanks, the patch worked, and things are back to normal. Quite a few jobs survived. We plan to upgrade to the current SchedMD recommended LTS in early Oct, I'll open a ticket for specific recommendations. I'm closing this ticket. |
Created attachment 15790 [details] slurmctld.log We have slurm 19.05.2 installed, compiled with SchedMD recommended patches. We had what looked like a NFS problem Sunday, and the slurm controller was likely offline from Sunday to this morning (Tuesday). I'll attach the slurm.conf. It looks like slurmctld is running and has the socket open: ``` # lsof | grep slurmctld | grep LISTEN slurmctld 14477 slurm 4u IPv4 118937 0t0 TCP *:pentbox-sim (LISTEN) slurmctld 14477 14478 slurm 4u IPv4 118937 0t0 TCP *:pentbox-sim (LISTEN) slurmctld 14477 14482 slurm 4u IPv4 118937 0t0 TCP *:pentbox-sim (LISTEN) slurmctld 14477 14878 slurm 4u IPv4 118937 0t0 TCP *:pentbox-sim (LISTEN) ``` Sinfo running on the same machine: Iptables shows minimal rules: ``` # iptables --list Chain INPUT (policy ACCEPT) target prot opt source destination ACCEPT udp -- anywhere anywhere udp dpt:domain ACCEPT tcp -- anywhere anywhere tcp dpt:domain ACCEPT udp -- anywhere anywhere udp dpt:bootps ACCEPT tcp -- anywhere anywhere tcp dpt:bootps Chain FORWARD (policy ACCEPT) target prot opt source destination ACCEPT all -- anywhere 192.168.122.0/24 ctstate RELATED,ESTABLISHED ACCEPT all -- 192.168.122.0/24 anywhere ACCEPT all -- anywhere anywhere REJECT all -- anywhere anywhere reject-with icmp-port-unreachable REJECT all -- anywhere anywhere reject-with icmp-port-unreachable Chain OUTPUT (policy ACCEPT) target prot opt source destination ACCEPT udp -- anywhere anywhere udp dpt:bootpc ``` When starting manually with -D -v -v -d I did see: slurmctld: server_thread_count over limit (256), waiting I also included a snippet of a slurmctld.log Any ideas why slurm won't allow connections to port 6817 to allow sinfo/sbatch to work?