Our SLURM controller is currently hung and unresponsive. Killing and restarting the daemon yields the same state. The log has several messages like the following: [2020-06-10T16:26:59.950] error: _slurm_rpc_complete_batch_script: Could not find batch step for JobId=63512707_27(63515356), this should never happen [2020-06-10T16:26:59.951] error: _slurm_rpc_complete_batch_script: Could not find batch step for JobId=63515913, this should never happen [2020-06-10T16:26:59.951] error: _slurm_rpc_complete_batch_script: Could not find batch step for JobId=63512859, this should never happen I'll attach the full slurmctld log shortly.
Created attachment 14625 [details] Slurmctld log
That message is a known bug in 19.05, but (to my knowledge) it's harmless unless you shrink a batch job. See bug 7499 for more details about that. Anyway I'll study the slurmctld log. Can you also run gcore on the running slurmctld to get the coredump? Then can you run these gdb commands? thread apply all backtrace thread apply all backtrace full
Also, I would encourage you to upgrade to the latest version of 19.05.1. We always encourage upgrading to the micro latest version since it has the most bug fixes. But before doing that I'll analyze the logs and the backtrace when you get it.
I'm tailing the logs and it seems like its not completely hung, but doing a little bit of work and then hanging in a cycle. It's not currently responsive to user commands.
Created attachment 14626 [details] gdb
Thanks, I'll take a look at the gdb output. Can you also upload your current slurm.conf? From my first brief look at the logs, there are tons of these two types of messages: [2020-06-10T16:40:30.717] cleanup_completing: JobId=63512030 completion process took 1922 seconds [2020-06-10T16:42:59.513] Resending TERMINATE_JOB request JobId=63178784 Nodelist=css-126 My first idea is to add SchedulerParameters=defer to slurm.conf, and also to set the state set of all partitions to DRAIN. defer will force slurmctld to not try to schedule jobs on job completion or submission, and setting partitions to drain won't cancel any jobs currently running but won't allow slurmctld to schedule any new jobs. Of course this isn't desirable long term but might let slurmctld have some time to catch up with the traffic. If slurmctld ever does respond to client commands, can you run sdiag and upload the output? When you do don't run client commands in a loop without any sleeps in between so as not to effectively DDOS the slurmctld.
Created attachment 14627 [details] slurm.conf
Here's the output of sdiag: ******************************************************* sdiag output at Wed Jun 10 17:16:54 2020 (1591823814) Data since Wed Dec 31 19:00:00 1969 (0) ******************************************************* Server thread count: 123 Agent queue size: 0 Agent count: 0 DBD Agent queue size: 0 Jobs submitted: 0 Jobs started: 0 Jobs completed: 0 Jobs canceled: 0 Jobs failed: 0 Job states ts: Wed Dec 31 19:00:00 1969 (0) Jobs pending: 0 Jobs running: 0 Main schedule statistics (microseconds): Last cycle: 0 Max cycle: 0 Total cycles: 0 Cycles per minute: 0 Last queue length: 0 Backfilling stats Total backfilled jobs (since last slurm start): 0 Total backfilled jobs (since last stats cycle start): 0 Total backfilled heterogeneous job components: 0 Total cycles: 0 Last cycle when: Wed Dec 31 19:00:00 1969 (0) Last cycle: 0 Max cycle: 0 Last depth cycle: 0 Last depth cycle (try sched): 0 Last queue length: 0 Latency for 1000 calls to gettimeofday(): 24 microseconds Remote Procedure Call statistics by message type REQUEST_PARTITION_INFO ( 2009) count:115 ave_time:6550 total_time:753316 REQUEST_NODE_INFO_SINGLE ( 2040) count:86 ave_time:991 total_time:85281 REQUEST_FED_INFO ( 2049) count:23 ave_time:2667 total_time:61343 REQUEST_JOB_USER_INFO ( 2039) count:19 ave_time:8075 total_time:153428 REQUEST_NODE_INFO ( 2007) count:7 ave_time:1145801 total_time:8020611 REQUEST_STATS_INFO ( 2035) count:3 ave_time:857 total_time:2573 REQUEST_BUILD_INFO ( 2001) count:3 ave_time:5127 total_time:15382 REQUEST_PING ( 1008) count:3 ave_time:15771 total_time:47313 REQUEST_JOB_INFO_SINGLE ( 2021) count:3 ave_time:184 total_time:554 REQUEST_SHARE_INFO ( 2022) count:2 ave_time:520024 total_time:1040048 MESSAGE_EPILOG_COMPLETE ( 6012) count:2 ave_time:3410 total_time:6820 REQUEST_PRIORITY_FACTORS ( 2026) count:1 ave_time:5896735 total_time:5896735 REQUEST_JOB_INFO ( 2003) count:1 ave_time:3954408 total_time:3954408 REQUEST_ASSOC_MGR_INFO ( 2043) count:1 ave_time:6153969 total_time:6153969 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:0 ave_time:0 total_time:0 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:0 ave_time:0 total_time:0 Remote Procedure Call statistics by user nobody ( 997) count:172 ave_time:1351 total_time:232435 rhtoczyd ( 1035122) count:30 ave_time:17306 total_time:519191 root ( 0) count:21 ave_time:854235 total_time:17938944 lotzsamu ( 758827) count:16 ave_time:14536 total_time:232586 keenandr ( 256005) count:8 ave_time:495750 total_time:3966002 jal ( 175025) count:6 ave_time:580 total_time:3483 vanant25 ( 991427) count:3 ave_time:1082378 total_time:3247135 benucci ( 833945) count:3 ave_time:2210 total_time:6632 mamooz ( 961043) count:3 ave_time:3546 total_time:10640 wuyunqiq ( 918863) count:3 ave_time:1966 total_time:5900 jansenn1 ( 1027710) count:3 ave_time:6926 total_time:20778 bazavov ( 874719) count:1 ave_time:8055 total_time:8055 sanch373 ( 1002301) count:0 ave_time:0 total_time:0 rysewykd ( 830611) count:0 ave_time:0 total_time:0 nisamehr ( 1025924) count:0 ave_time:0 total_time:0 pelicion ( 1028178) count:0 ave_time:0 total_time:0 rahnamou ( 1036377) count:0 ave_time:0 total_time:0 koppercl ( 988288) count:0 ave_time:0 total_time:0 sinzpaul ( 921266) count:0 ave_time:0 total_time:0 tdeyoung ( 799192) count:0 ave_time:0 total_time:0 baclark ( 1035984) count:0 ave_time:0 total_time:0 greyjosh ( 1037610) count:0 ave_time:0 total_time:0 peiskera ( 876366) count:0 ave_time:0 total_time:0 nowick70 ( 996635) count:0 ave_time:0 total_time:0 twagiray ( 961133) count:0 ave_time:0 total_time:0 nguye639 ( 949996) count:0 ave_time:0 total_time:0 kowalc40 ( 936676) count:0 ave_time:0 total_time:0 priesbr1 ( 929898) count:0 ave_time:0 total_time:0 yarranto ( 767367) count:0 ave_time:0 total_time:0 fordste5 ( 885046) count:0 ave_time:0 total_time:0 Pending RPC statistics No pending RPCs
From the gdb output: 268 threads right now. 231 _slurm_rpc_epilog_complete 15 _slurm_rpc_complete_batch_script All the epilog complete RPC’s are waiting in _throttle_start() on this line: slurm_cond_wait(&throttle_cond, &throttle_mutex); It looks like there's a storm of job completions that is slowing down slurmctld right now. So I definitely would like to try setting defer and also setting all partitions to DRAIN in slurm.conf. Restart slurmctld to make the changes happen since scontrol reconfigure probably.
Looking at sdiag: ******************************************************* sdiag output at Wed Jun 10 17:16:54 2020 (1591823814) Data since Wed Dec 31 19:00:00 1969 (0) ******************************************************* Data since the Unix epoch? I'm guessing you got the sdiag output before some value in slurmctld was updated, so it probably means not much time has passed since a restart. With this is mind, the data is just what happened since the last restart, which was presumably not long ago. I'm concerned about the top RPC's and top user: Remote Procedure Call statistics by message type REQUEST_PARTITION_INFO ( 2009) count:115 ave_time:6550 total_time:753316 REQUEST_NODE_INFO_SINGLE ( 2040) count:86 ave_time:991 total_time:85281 Remote Procedure Call statistics by user nobody ( 997) count:172 ave_time:1351 total_time:232435 rhtoczyd ( 1035122) count:30 ave_time:17306 total_time:519191 root ( 0) count:21 ave_time:854235 total_time:17938944 Since you probably don't have a user named "nobody" this means that getpwuid_r on that user returned an error. I'm also concerned that the top RPC's are requests for partition info and node info. Is somebody running a script that is calling squeue or scontrol in a loop? It's also interesting that not lots of new MESSAGE_EPILOG_COMPLETE messages have come in since the last restart, even though that was by far the highest RPC going on in the gdb backtrace. Obviously some threads could have completed and different RPC's happened in the meantime. Are you able to get a new gcore and run those gdb commands again, maybe waiting a few minutes after restarting slurmctld with those changes from my last comment? Let me know if slurmctld starts responding a little better or worse.
Do the job prolog(s)/epilog(s) in /etc/slurm/prolog.d/* and /etc/slurm/epilog.d/* call squeue/sinfo/scontrol at all? That could aggravate the problem when too many jobs complete at once. Same with lots of those calls in the job prolog when too many jobs start at once. We've seen other sites have this issue.
Created attachment 14628 [details] gdb
Steve, Is it possible to apply this patch and restart slurmctld? https://github.com/SchedMD/slurm/commit/f17cf91ccc56ccf87
I think the xstrfmtcatat function was added in a newer version of SLURM: /root/rpmbuild/BUILD/slurm-19.05.1-4/src/slurmctld/job_mgr.c:10612: undefined reference to `xstrfmtcatat'
We've had all partitions set to drain and defer set in the SchedulerParams for some time now. I'm not sure that many jobs completing is what's causing this. Judging by "_job_complete" entries in the logs, job completion has been lower today than spikes we've had in the past without issue. Our prolog scripts do call slurm commands, but if the prologs were the issue, I would expect the system to recover when drained. Epilog doesn't call any slurm commands.
(In reply to Steve Ford from comment #16) > I think the xstrfmtcatat function was added in a newer version of SLURM: > > /root/rpmbuild/BUILD/slurm-19.05.1-4/src/slurmctld/job_mgr.c:10612: > undefined reference to `xstrfmtcatat' This looks like a great reason to upgrade to slurm-19-05-7-1 as xstrfmtcatat() was added in slurm-19-05-2-1. You should be able to safely upgrade slurmdbd and slurmctld for now and then push out the upgrade to the nodes eventually. Looks like issue may be caused by user kowal165 (or who ever owns "/mnt/ufs18/home-049/kowal165/test_sb_files/test1000.sb"). From the core in comment #12, can you please provide this gdb output: > set pagination off > set print pretty on > t 263 > f 2 > p *job_ptr > p *job_ptr->details
(gdb) t 263 [Switching to thread 263 (Thread 0x7f662a09d700 (LWP 53162))] #0 0x00007f6637750995 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 (gdb) f 2 #2 0x000000000048badf in _slurm_rpc_step_complete (msg=0x7f662a09ce50, running_composite=<optimized out>) at proc_req.c:3768 3768 proc_req.c: No such file or directory. (gdb) p *job_ptr No symbol "job_ptr" in current context. (gdb) p *job_ptr->details No symbol "job_ptr" in current context. I will begin upgrading to 19.05.7 with your suggested patch.
Hey Steve, can you save a copy of the old slurmctld binary before upgrading so we can ask for more things with the core?
Also did you run those gdb commands on the second core? If so that explains why the output was nonsensical. Can you type thread 264 instead? That's the thread we're looking for in the second core. Thread 264 (Thread 0x7f65aaa9c700 (LWP 53163)): #0 0x00007f6637ce2b8c in makespace (str=str@entry=0x7f65aaa9b8c0, needed=11) at xstring.c:106 #1 0x00007f6637ce2d75 in _xstrcat (str1=str1@entry=0x7f65aaa9b8c0, str2=<optimized out>) at xstring.c:134 #2 0x00000000004401fd in _pack_default_job_details (job_ptr=job_ptr@entry=0x26bb5f60, buffer=buffer@entry=0x7f65e0011e30, protocol_version=protocol_version@entry=8704) at job_mgr.c:10614
We're up with that patch on 19.05.7. Here is the gdb info for thread 264: (gdb) t 264 [Switching to thread 264 (Thread 0x7f65aaa9c700 (LWP 53163))] #0 xstrcasecmp (s1=0x7f65e0007ff9 "b test1745.sb test1746.sb test1747.sb test1748.sb test1749.sb test174.sb test1750.sb test1751.sb test1752.sb test1753.sb test1754.sb test1755.sb test1756.sb test1757.sb test1758.sb test1759.sb test175"..., s2=0xb <Address 0xb out of bounds>) at xstring.c:605 605 xstring.c: No such file or directory. (gdb) f 2 #2 0x0000000026bcd790 in ?? () (gdb) p *job_ptr No symbol "job_ptr" in current context. (gdb) p *job_ptr->details No symbol "job_ptr" in current context. I did point it at the saved copy of the old slurmctld.
Oh well, too bad about the weird backtrace/core file. It does look like that job has a lot of arguments; the args in the gdb output that we have got cut off, and that thread is in the problematic part of code. By "up" with the patch do you mean slurmctld is running and responsive, and partitions are up and scheduling again?
Yes. Slurmctld is running and responsive and all partitions are up and jobs are scheduling.
Bumping the priority down based on comment #25
Good morning Steve, Is slurmctld still up and running and smooth today? If so then I'll mark this as a duplicate of bug 8978. Much thanks to Nate for recognizing this as a duplicate.
Yes, slurmctld is still up and running smoothly.
Great to hear. I'll close this as a dup of bug 8978. *** This ticket has been marked as a duplicate of ticket 8978 ***