Ticket 5545

Summary:	slurmd NODE_Fail due to active_threads=maxthreads(256)
Product:	Slurm	Reporter:	Wei Feinstein <wfeinstein>
Component:	slurmd	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	17.11.3
Hardware:	Linux
OS:	Linux
Site:	LBNL - Lawrence Berkeley National Laboratory	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	catamount107.082818.gz

Description Wei Feinstein 2018-08-09 18:49:41 MDT

I need help understanding this message I have seen on a few nodes 

Nodes are in down state due to NODE_FAIL - 
catamount    up   infinite      4  down* n0002.catamount0,n0051.catamount0,n0062.catamount0,n0081.catamount0

sinfo -R --partition=catamonut

Not responding       slurm     2018-08-08T02:59:47 n0051.catamount0

Job(s) were killed due to this failure:
[2018-08-08T02:59:47.722] Killing job_id 13827786 on failed node n0051.catamount0

The line I see on a few nodes in slurmd.log was 

[2018-08-08T02:54:47.449] active_threads == MAX_THREADS(256)


I noticed that the MAX_THREADS is set in slurmd.c to 256.  
#define MAX_THREADS		256

Unfamiliar with this and would like advice for how to resolve this issue or troubleshoot the cause of the issue.

Comment 1 Wei Feinstein 2018-08-27 10:32:37 MDT

any update to this. Can we change the MAX_Thread count in slurmd.c to a value higher than 256?  Do you have any recommendations on what value should be set?

Thanks

Jackie

Comment 2 Marshall Garey 2018-08-27 11:35:01 MDT

Hi Jackie,

I'm looking into this.

- How often is this happening? Every day? Every week? How many nodes per day or week?
- Can you upload a slurmd log file from one of the failed nodes?
- Can you upload your slurm.conf as well?

You could change MAX_THREAD higher than 256, though depending on what the actual problem is, that may not help the situation at all. I don't recommend it right now.

Comment 3 Wei Feinstein 2018-08-27 17:45:55 MDT

At this time I do not have the log files to show you.  I have not seen it happening since the day I reported it last week.  If I see it happening again I will definitely capture the logs and send them to you along with what I see on each of the node.  In all cases I do remember seeing an orphaned slurmstepd processes hanging around on the node causing me to have to restart slurmd on the node after killing the stepd process. And that brought the node back online.

Comment 4 Marshall Garey 2018-08-28 10:38:59 MDT

Okay. If it happens again, I'd like to see the slurmd log file.

(In reply to Jacqueline Scoggins from comment #3)
> In all cases I do remember seeing an
> orphaned slurmstepd processes hanging around on the node causing me to have
> to restart slurmd on the node after killing the stepd process. And that
> brought the node back online.

That is really important. This may or may not be a separate issue, or might be the cause of the issue. Whenever this happens, can you gdb attach to the hung stepd process and get the following:

thread apply all bt full

We've fixed a couple different problems (in versions after 17.11.3) that caused stepd's to deadlock, so your issue may have already been fixed.

Comment 5 Wei Feinstein 2018-08-28 15:39:09 MDT

I am now seeing this message on the master server -

server_thread_count over limit (256), waiting
[2018-08-28T14:37:19.420] backfill: 256 pending RPCs at cycle end, consider
configuring max_rpc_cnt

The server is still running but I am not sure where this value is set.

Thanks

Jackie

On Tue, Aug 28, 2018 at 9:38 AM, <bugs@schedmd.com> wrote:

> *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=5545#c4> on bug
> 5545 <https://bugs.schedmd.com/show_bug.cgi?id=5545> from Marshall Garey
> <marshall@schedmd.com> *
>
> Okay. If it happens again, I'd like to see the slurmd log file.
>
> (In reply to Jacqueline Scoggins from comment #3 <https://bugs.schedmd.com/show_bug.cgi?id=5545#c3>)> In all cases I do remember seeing an
> > orphaned slurmstepd processes hanging around on the node causing me to have
> > to restart slurmd on the node after killing the stepd process. And that
> > brought the node back online.
>
> That is really important. This may or may not be a separate issue, or might be
> the cause of the issue. Whenever this happens, can you gdb attach to the hung
> stepd process and get the following:
>
> thread apply all bt full
>
> We've fixed a couple different problems (in versions after 17.11.3) that caused
> stepd's to deadlock, so your issue may have already been fixed.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 6 Marshall Garey 2018-08-28 15:43:46 MDT

That just means your controller is busy and has a lot of RPC's to take care of, but it doesn't have anything to do with this bug. If you start seeing performance issues, feel free to submit a separate bug.

max_rpc_cnt is a slurm.conf option. See the slurm.conf man page.

Comment 7 Wei Feinstein 2018-08-28 16:12:16 MDT

Ok thanks

Jackie

On Tue, Aug 28, 2018 at 2:43 PM, <bugs@schedmd.com> wrote:

> *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=5545#c6> on bug
> 5545 <https://bugs.schedmd.com/show_bug.cgi?id=5545> from Marshall Garey
> <marshall@schedmd.com> *
>
> That just means your controller is busy and has a lot of RPC's to take care of,
> but it doesn't have anything to do with this bug. If you start seeing
> performance issues, feel free to submit a separate bug.
>
> max_rpc_cnt is a slurm.conf option. See the slurm.conf man page.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 8 Wei Feinstein 2018-08-28 19:42:44 MDT

Created attachment 7714 [details]
catamount107.082818.gz

I just found one and it is possible it is user caused by I don't know what
is causing it.

[root@n0107.catamount0 ~]#  tail /var/log/slurm/slurmd.log

[2018-08-24T08:21:37.536] launch task 13898426.150 request from
42081.505@10.0.29.104 (port 45261)

[2018-08-24T01:21:37.568] [13898426.150] Munge cryptographic signature
plugin loaded

[2018-08-24T01:21:37.570] [13898426.150] spank_collect_script: Job script
/var/spool/slurmd/job13898426/slurm_script does not exist, ignore

[2018-08-24T01:21:37.573] [13898426.150] task/cgroup:
/slurm/uid_42081/job_13898426: alloc=64000MB mem.limit=64000MB
memsw.limit=64000MB

[2018-08-24T01:21:37.573] [13898426.150] task/cgroup:
/slurm/uid_42081/job_13898426/step_150: alloc=64000MB mem.limit=64000MB
memsw.limit=64000MB

[2018-08-24T01:21:37.574] [13898426.150] debug level = 2

[2018-08-24T01:21:37.575] [13898426.150] starting 1 tasks

[2018-08-24T01:21:37.576] [13898426.150] task 2 (31465) started
2018-08-24T01:21:37

[2018-08-24T01:30:51.320] [13898426.150] task 2 (31465) exited with exit
code 0.

[2018-08-24T19:02:30.917] active_threads == MAX_THREADS(256)


Current state of the node -
sinfo -R --partition=catamount | grep 107
Not responding       slurm     2018-08-28T12:55:51 n0107.catamount0

perceus-00 n0107.catamoun+ 2018-08-28T12:55:51             Unknown  DOWN*
              Not responding slurm(497)

[root@n0107.catamount0 ~]# !ps

ps -eaef|grep slurm

root       5009      1  0 Aug20 ?        00:00:06 /usr/sbin/*slurm*d

root      31459      1  0 Aug24 ?        00:00:00 *slurm*stepd:
[13898426.150]

sacct -j 13898426 -X -o
'jobid,user,account,partition,reqmem,ncpus,start,end,exitcode'

       JobID      User    Account  Partition     ReqMem      NCPUS
      Start                 End ExitCode

------------ --------- ---------- ---------- ---------- ----------
------------------- ------------------- --------

13898426          ding  catamount  catamount    62.50Gn         64
2018-08-23T02:26:08 2018-08-24T03:14:33      0:0

I've attached a gdb of the slurmstepd process for you.

Please let me know if you need anything else.


Jackie




On Mon, Aug 27, 2018 at 10:35 AM, <bugs@schedmd.com> wrote:

> *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=5545#c2> on bug
> 5545 <https://bugs.schedmd.com/show_bug.cgi?id=5545> from Marshall Garey
> <marshall@schedmd.com> *
>
> Hi Jackie,
>
> I'm looking into this.
>
> - How often is this happening? Every day? Every week? How many nodes per day or
> week?
> - Can you upload a slurmd log file from one of the failed nodes?
> - Can you upload your slurm.conf as well?
>
> You could change MAX_THREAD higher than 256, though depending on what the
> actual problem is, that may not help the situation at all. I don't recommend it
> right now.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 9 Marshall Garey 2018-08-29 09:41:49 MDT

The hung/deadlocked stepd is a duplicate of bug 5103. There were series of commits landed in 17.11.6 and 17.11.7 that fix this. I recommend upgrading to the most recent release of 17.11 (17.11.9-2), or waiting for 17.11.10 which will probably be released in the next few weeks.

It looks like is probably the cause of the "slurmd NODE_FAIL due to active_threads=maxthreads(256)" issue, since your backtrace has 259 threads.

Comment 10 Marshall Garey 2018-09-04 10:17:47 MDT

Can I help with anything else on this ticket?

Comment 11 Wei Feinstein 2018-09-04 11:13:20 MDT

No we will have to schedule an update to fix this issue in the near future. You can close this ticket and if after the upgrade we experience problems I will open a new ticket.  Thank you for your help.

Jackie

Comment 12 Marshall Garey 2018-09-04 11:17:24 MDT

Sounds good. Closing as resolved/duplicate of bug 5103

*** This ticket has been marked as a duplicate of ticket 5103 ***

Comment 13 Marshall Garey 2019-04-19 17:09:50 MDT

*** Ticket 6890 has been marked as a duplicate of this ticket. ***