I need help understanding this message I have seen on a few nodes Nodes are in down state due to NODE_FAIL - catamount up infinite 4 down* n0002.catamount0,n0051.catamount0,n0062.catamount0,n0081.catamount0 sinfo -R --partition=catamonut Not responding slurm 2018-08-08T02:59:47 n0051.catamount0 Job(s) were killed due to this failure: [2018-08-08T02:59:47.722] Killing job_id 13827786 on failed node n0051.catamount0 The line I see on a few nodes in slurmd.log was [2018-08-08T02:54:47.449] active_threads == MAX_THREADS(256) I noticed that the MAX_THREADS is set in slurmd.c to 256. #define MAX_THREADS 256 Unfamiliar with this and would like advice for how to resolve this issue or troubleshoot the cause of the issue.
any update to this. Can we change the MAX_Thread count in slurmd.c to a value higher than 256? Do you have any recommendations on what value should be set? Thanks Jackie
Hi Jackie, I'm looking into this. - How often is this happening? Every day? Every week? How many nodes per day or week? - Can you upload a slurmd log file from one of the failed nodes? - Can you upload your slurm.conf as well? You could change MAX_THREAD higher than 256, though depending on what the actual problem is, that may not help the situation at all. I don't recommend it right now.
At this time I do not have the log files to show you. I have not seen it happening since the day I reported it last week. If I see it happening again I will definitely capture the logs and send them to you along with what I see on each of the node. In all cases I do remember seeing an orphaned slurmstepd processes hanging around on the node causing me to have to restart slurmd on the node after killing the stepd process. And that brought the node back online.
Okay. If it happens again, I'd like to see the slurmd log file. (In reply to Jacqueline Scoggins from comment #3) > In all cases I do remember seeing an > orphaned slurmstepd processes hanging around on the node causing me to have > to restart slurmd on the node after killing the stepd process. And that > brought the node back online. That is really important. This may or may not be a separate issue, or might be the cause of the issue. Whenever this happens, can you gdb attach to the hung stepd process and get the following: thread apply all bt full We've fixed a couple different problems (in versions after 17.11.3) that caused stepd's to deadlock, so your issue may have already been fixed.
I am now seeing this message on the master server - server_thread_count over limit (256), waiting [2018-08-28T14:37:19.420] backfill: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt The server is still running but I am not sure where this value is set. Thanks Jackie On Tue, Aug 28, 2018 at 9:38 AM, <bugs@schedmd.com> wrote: > *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=5545#c4> on bug > 5545 <https://bugs.schedmd.com/show_bug.cgi?id=5545> from Marshall Garey > <marshall@schedmd.com> * > > Okay. If it happens again, I'd like to see the slurmd log file. > > (In reply to Jacqueline Scoggins from comment #3 <https://bugs.schedmd.com/show_bug.cgi?id=5545#c3>)> In all cases I do remember seeing an > > orphaned slurmstepd processes hanging around on the node causing me to have > > to restart slurmd on the node after killing the stepd process. And that > > brought the node back online. > > That is really important. This may or may not be a separate issue, or might be > the cause of the issue. Whenever this happens, can you gdb attach to the hung > stepd process and get the following: > > thread apply all bt full > > We've fixed a couple different problems (in versions after 17.11.3) that caused > stepd's to deadlock, so your issue may have already been fixed. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
That just means your controller is busy and has a lot of RPC's to take care of, but it doesn't have anything to do with this bug. If you start seeing performance issues, feel free to submit a separate bug. max_rpc_cnt is a slurm.conf option. See the slurm.conf man page.
Ok thanks Jackie On Tue, Aug 28, 2018 at 2:43 PM, <bugs@schedmd.com> wrote: > *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=5545#c6> on bug > 5545 <https://bugs.schedmd.com/show_bug.cgi?id=5545> from Marshall Garey > <marshall@schedmd.com> * > > That just means your controller is busy and has a lot of RPC's to take care of, > but it doesn't have anything to do with this bug. If you start seeing > performance issues, feel free to submit a separate bug. > > max_rpc_cnt is a slurm.conf option. See the slurm.conf man page. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Created attachment 7714 [details] catamount107.082818.gz I just found one and it is possible it is user caused by I don't know what is causing it. [root@n0107.catamount0 ~]# tail /var/log/slurm/slurmd.log [2018-08-24T08:21:37.536] launch task 13898426.150 request from 42081.505@10.0.29.104 (port 45261) [2018-08-24T01:21:37.568] [13898426.150] Munge cryptographic signature plugin loaded [2018-08-24T01:21:37.570] [13898426.150] spank_collect_script: Job script /var/spool/slurmd/job13898426/slurm_script does not exist, ignore [2018-08-24T01:21:37.573] [13898426.150] task/cgroup: /slurm/uid_42081/job_13898426: alloc=64000MB mem.limit=64000MB memsw.limit=64000MB [2018-08-24T01:21:37.573] [13898426.150] task/cgroup: /slurm/uid_42081/job_13898426/step_150: alloc=64000MB mem.limit=64000MB memsw.limit=64000MB [2018-08-24T01:21:37.574] [13898426.150] debug level = 2 [2018-08-24T01:21:37.575] [13898426.150] starting 1 tasks [2018-08-24T01:21:37.576] [13898426.150] task 2 (31465) started 2018-08-24T01:21:37 [2018-08-24T01:30:51.320] [13898426.150] task 2 (31465) exited with exit code 0. [2018-08-24T19:02:30.917] active_threads == MAX_THREADS(256) Current state of the node - sinfo -R --partition=catamount | grep 107 Not responding slurm 2018-08-28T12:55:51 n0107.catamount0 perceus-00 n0107.catamoun+ 2018-08-28T12:55:51 Unknown DOWN* Not responding slurm(497) [root@n0107.catamount0 ~]# !ps ps -eaef|grep slurm root 5009 1 0 Aug20 ? 00:00:06 /usr/sbin/*slurm*d root 31459 1 0 Aug24 ? 00:00:00 *slurm*stepd: [13898426.150] sacct -j 13898426 -X -o 'jobid,user,account,partition,reqmem,ncpus,start,end,exitcode' JobID User Account Partition ReqMem NCPUS Start End ExitCode ------------ --------- ---------- ---------- ---------- ---------- ------------------- ------------------- -------- 13898426 ding catamount catamount 62.50Gn 64 2018-08-23T02:26:08 2018-08-24T03:14:33 0:0 I've attached a gdb of the slurmstepd process for you. Please let me know if you need anything else. Jackie On Mon, Aug 27, 2018 at 10:35 AM, <bugs@schedmd.com> wrote: > *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=5545#c2> on bug > 5545 <https://bugs.schedmd.com/show_bug.cgi?id=5545> from Marshall Garey > <marshall@schedmd.com> * > > Hi Jackie, > > I'm looking into this. > > - How often is this happening? Every day? Every week? How many nodes per day or > week? > - Can you upload a slurmd log file from one of the failed nodes? > - Can you upload your slurm.conf as well? > > You could change MAX_THREAD higher than 256, though depending on what the > actual problem is, that may not help the situation at all. I don't recommend it > right now. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
The hung/deadlocked stepd is a duplicate of bug 5103. There were series of commits landed in 17.11.6 and 17.11.7 that fix this. I recommend upgrading to the most recent release of 17.11 (17.11.9-2), or waiting for 17.11.10 which will probably be released in the next few weeks. It looks like is probably the cause of the "slurmd NODE_FAIL due to active_threads=maxthreads(256)" issue, since your backtrace has 259 threads.
Can I help with anything else on this ticket?
No we will have to schedule an update to fix this issue in the near future. You can close this ticket and if after the upgrade we experience problems I will open a new ticket. Thank you for your help. Jackie
Sounds good. Closing as resolved/duplicate of bug 5103 *** This ticket has been marked as a duplicate of ticket 5103 ***
*** Ticket 6890 has been marked as a duplicate of this ticket. ***