Today we had 5 nodes in the "idle*" state - this means "idle and not responding", correct? Jobs could be run on these nodes (tested with --nodelist in sbatch command) We saw this in the slurmctld log, repeated many times: [2017-01-25T16:17:55.698] Node cs-f14c03b06 now responding [2017-01-25T16:17:55.698] Node cs-f14c03b08 now responding [2017-01-25T16:17:55.698] Node cs-f14c03b10 now responding [2017-01-25T16:17:55.698] Node cs-f14c03b05 now responding [2017-01-25T16:17:55.698] Node cs-f14c03b01 now responding [2017-01-25T16:17:56.683] error: Node ping apparently hung, many nodes may be DOWN or configured SlurmdTimeout should be increased # this is currently 300sec [2017-01-25T16:18:39.767] error: Nodes cs-f14c03b[01,05-06,08,10] not responding [2017-01-25T16:19:35.904] Node cs-f14c03b10 now responding [2017-01-25T16:19:35.904] Node cs-f14c03b08 now responding [2017-01-25T16:19:35.904] Node cs-f14c03b05 now responding [2017-01-25T16:19:35.904] Node cs-f14c03b06 now responding [2017-01-25T16:19:35.904] Node cs-f14c03b01 now responding slurmd log for one of the affected nodes: A) at "now responding" [2017-01-25T16:17:48.697] [1588815.0] debug: jag_common_poll_data: Task average frequency = 3211541 pid 22680 mem size 102532 827336 time 69.710000(69+0) [2017-01-25T16:17:48.697] [1588815.0] debug2: energycounted = 1 [2017-01-25T16:17:48.697] [1588815.0] debug: Step 1588815.0 memory used:1638908 limit:58720256 KB [2017-01-25T16:17:55.683] debug3: in the service_connection [2017-01-25T16:17:55.684] debug2: got this type of message 1008 [2017-01-25T16:17:55.684] [1588815.0] debug3: Called _msg_socket_accept [2017-01-25T16:17:55.684] [1588815.0] debug3: Leaving _msg_socket_accept [2017-01-25T16:17:55.684] [1588815.0] debug3: Called _msg_socket_readable [2017-01-25T16:17:55.684] [1588815.0] debug3: Entering _handle_accept (new thread) [2017-01-25T16:17:55.684] [1588815.0] debug3: Identity: uid=0, gid=0 [2017-01-25T16:17:55.684] [1588815.0] debug3: Entering _handle_request [2017-01-25T16:17:55.684] debug: Entering stepd_stat_jobacct for job 1588815.0 B) at other times, just the normal debug3 messages. [2017-01-25T16:18:38.699] [1588815.0] debug2: cpu 0 freq= 3299875 [2017-01-25T16:18:38.699] [1588815.0] debug: jag_common_poll_data: Task average frequency = 3180631 pid 22680 mem size 102532 827336 time 119.700000(119+0) [2017-01-25T16:18:38.700] [1588815.0] debug2: energycounted = 1 [2017-01-25T16:18:38.700] [1588815.0] debug: Step 1588815.0 memory used:1638928 limit:58720256 KB [2017-01-25T16:18:48.698] [1588815.0] debug2: profile signalling type Task [2017-01-25T16:18:48.699] [1588815.0] debug2: jag_common_poll_data: 22695 mem size 102564 827340 time 129.880000(129+0) [2017-01-25T16:18:48.699] [1588815.0] debug2: _get_sys_interface_freq_line: filename = /sys/devices/system/cpu/cpu10/cpufreq/cpuinfo_cur_freq [2017-01-25T16:18:48.699] [1588815.0] debug2: cpu 10 freq= 3299875 "service slurmd restart" on these nodes took more than 1 minute to return, but afterward nodes return to "Idle" or "Allocated" state (without '*') Do you have any ideas where we should look for more information? We'd like to track down why these nodes were marked as "not responding". Thanks, and best regards Ciaron
Yes - Idle* indicates the node has checked in and is otherwise unoccupied, but hasn't responded to the most recent ping. The one-minute restart of an otherwise idle slurmd does sound suspicious, and the snippets from the log you've posted don't look unusual, although it'd like to have a longer chunk of that if you get a chance. It'd be nice to get a backtrace from one of these to see what's happening on that node before the restart: gdb -p <pid of slurmd> (gdb) thread apply all bt (gdb) detach then go ahead and restart it. I suspect that'll quickly point us at the underlying issue.
Created attachment 3985 [details] slurmd log (excerpt) Attaching log file for an affected node, from startup to end of the day (after problem resolved via slurmd restart)
Hi Tim Thanks for the update. I've attached more of the slurmd log from an affected node. Thanks also for the gdb tip, I'll try this when we next see the issue. Best regards Ciaron
Hey Ciaron - Has this recurred recently? I still don't have a good lead on the underlying cause, and was hoping a backtrace would be able to highlight the issue quickly. Although if the underlying problem has disappeared that would also be good to know as well. - Tim
We haven't seen a repeat of this, so I think we can go ahead and close the ticket. I'll report back if we ever see it again. Thanks again.
Marking resolved/timedout to indicate we weren't able to get enough logs on this to catch the issue. Please reopen if this recurs, or file a new bug. - Tim