Ticket 3425

Summary:	Nodes indicated State=idle*
Product:	Slurm	Reporter:	Ciaron Linstead <linstead>
Component:	slurmd	Assignee:	Tim Wickberg <tim>
Status:	RESOLVED TIMEDOUT	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	16.05.8
Hardware:	Linux
OS:	Linux
Site:	PIK	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmd log (excerpt)

Description Ciaron Linstead 2017-01-25 09:41:14 MST

Today we had 5 nodes in the "idle*" state - this means "idle and not responding", correct? Jobs could be run on these nodes (tested with --nodelist in sbatch command)

We saw this in the slurmctld log, repeated many times:

[2017-01-25T16:17:55.698] Node cs-f14c03b06 now responding
[2017-01-25T16:17:55.698] Node cs-f14c03b08 now responding
[2017-01-25T16:17:55.698] Node cs-f14c03b10 now responding
[2017-01-25T16:17:55.698] Node cs-f14c03b05 now responding
[2017-01-25T16:17:55.698] Node cs-f14c03b01 now responding
[2017-01-25T16:17:56.683] error: Node ping apparently hung, many nodes may be DOWN or configured SlurmdTimeout should be increased # this is currently 300sec
[2017-01-25T16:18:39.767] error: Nodes cs-f14c03b[01,05-06,08,10] not responding
[2017-01-25T16:19:35.904] Node cs-f14c03b10 now responding
[2017-01-25T16:19:35.904] Node cs-f14c03b08 now responding
[2017-01-25T16:19:35.904] Node cs-f14c03b05 now responding
[2017-01-25T16:19:35.904] Node cs-f14c03b06 now responding
[2017-01-25T16:19:35.904] Node cs-f14c03b01 now responding

slurmd log for one of the affected nodes:

A) at "now responding"

[2017-01-25T16:17:48.697] [1588815.0] debug:  jag_common_poll_data: Task average frequency = 3211541 pid 22680 mem size 102532 827336 time 69.710000(69+0)
[2017-01-25T16:17:48.697] [1588815.0] debug2: energycounted = 1
[2017-01-25T16:17:48.697] [1588815.0] debug:  Step 1588815.0 memory used:1638908 limit:58720256 KB
[2017-01-25T16:17:55.683] debug3: in the service_connection
[2017-01-25T16:17:55.684] debug2: got this type of message 1008
[2017-01-25T16:17:55.684] [1588815.0] debug3: Called _msg_socket_accept
[2017-01-25T16:17:55.684] [1588815.0] debug3: Leaving _msg_socket_accept
[2017-01-25T16:17:55.684] [1588815.0] debug3: Called _msg_socket_readable
[2017-01-25T16:17:55.684] [1588815.0] debug3: Entering _handle_accept (new thread)
[2017-01-25T16:17:55.684] [1588815.0] debug3:   Identity: uid=0, gid=0
[2017-01-25T16:17:55.684] [1588815.0] debug3: Entering _handle_request
[2017-01-25T16:17:55.684] debug:  Entering stepd_stat_jobacct for job 1588815.0


B) at other times, just the normal debug3 messages.

[2017-01-25T16:18:38.699] [1588815.0] debug2:  cpu 0 freq= 3299875
[2017-01-25T16:18:38.699] [1588815.0] debug:  jag_common_poll_data: Task average frequency = 3180631 pid 22680 mem size 102532 827336 time 119.700000(119+0)
[2017-01-25T16:18:38.700] [1588815.0] debug2: energycounted = 1
[2017-01-25T16:18:38.700] [1588815.0] debug:  Step 1588815.0 memory used:1638928 limit:58720256 KB
[2017-01-25T16:18:48.698] [1588815.0] debug2: profile signalling type Task
[2017-01-25T16:18:48.699] [1588815.0] debug2: jag_common_poll_data: 22695 mem size 102564 827340 time 129.880000(129+0)
[2017-01-25T16:18:48.699] [1588815.0] debug2: _get_sys_interface_freq_line: filename = /sys/devices/system/cpu/cpu10/cpufreq/cpuinfo_cur_freq 
[2017-01-25T16:18:48.699] [1588815.0] debug2:  cpu 10 freq= 3299875

"service slurmd restart" on these nodes took more than 1 minute to return, but afterward nodes return to "Idle" or "Allocated" state (without '*')

Do you have any ideas where we should look for more information? We'd like to track down why these nodes were marked as "not responding".

Thanks, and best regards

Ciaron

Comment 1 Tim Wickberg 2017-01-25 10:57:18 MST

Yes - Idle* indicates the node has checked in and is otherwise unoccupied, but hasn't responded to the most recent ping.

The one-minute restart of an otherwise idle slurmd does sound suspicious, and the snippets from the log you've posted don't look unusual, although it'd like to have a longer chunk of that if you get a chance. It'd be nice to get a backtrace from one of these to see what's happening on that node before the restart:

gdb -p <pid of slurmd>
(gdb) thread apply all bt
(gdb) detach

then go ahead and restart it. I suspect that'll quickly point us at the underlying issue.

Comment 2 Ciaron Linstead 2017-01-26 03:14:22 MST

Created attachment 3985 [details]
slurmd log (excerpt)

Attaching log file for an affected node, from startup to end of the day (after problem resolved via slurmd restart)

Comment 3 Ciaron Linstead 2017-01-26 03:15:42 MST

Hi Tim

Thanks for the update. I've attached more of the slurmd log from an affected node.

Thanks also for the gdb tip, I'll try this when we next see the issue.

Best regards

Ciaron

Comment 4 Tim Wickberg 2017-04-04 15:50:17 MDT

Hey Ciaron -

Has this recurred recently?

I still don't have a good lead on the underlying cause, and was hoping a backtrace would be able to highlight the issue quickly.

Although if the underlying problem has disappeared that would also be good to know as well.

- Tim

Comment 5 Ciaron Linstead 2017-04-05 01:58:20 MDT

We haven't seen a repeat of this, so I think we can go ahead and close the ticket. I'll report back if we ever see it again.

Thanks again.

Comment 6 Tim Wickberg 2017-04-18 21:50:08 MDT

Marking resolved/timedout to indicate we weren't able to get enough logs on this to catch the issue. Please reopen if this recurs, or file a new bug.

- Tim