Hello, A node on our test cluster is stuck in a completing state: ctlnet1:~ # sinfo -p system PARTITION AVAIL TIMELIMIT NODES STATE NODELIST system down 30:00 1 down* nid00042 system down 30:00 1 comp nid00393 system down 30:00 1 drain nid00021 system down 30:00 145 idle nid00[023,028-030,032-041,043-054,056-063,200-203,208-255,384-392,394-435,440-447] system down 30:00 1 down nid00022 ctlnet1:~ # nid00393 was apparently "rebooted" to get a mode change: slurmctld log (grep for nid00393): [2018-01-05T14:27:50.570] sched: _slurm_rpc_allocate_resources JobId=864627 NodeList=nid00393 usec=58346 [2018-01-05T14:27:56.611] power_save: pid 6939 reboot nodes nid00393 features quad,flat [2018-01-05T14:27:56.611] sched: _slurm_rpc_allocate_resources JobId=864628 NodeList=nid00393 usec=60176 [2018-01-05T14:27:57.599] debug: Still waiting for boot of node nid00393 [2018-01-05T14:27:58.683] update_node: node nid00393 state set to IDLE [2018-01-05T14:28:23.038] Node nid00393 now responding end of slurmd log for nid00393: [2018-01-05T14:27:52.748] [864627.extern] done with job [2018-01-05T14:27:57.776] job_container/cncu: no job for delete(864628) no slurmstepd processes. Node never rebooted. boot-gerty:~ # ssh nid00393 uptime 18:01pm up 19:26, 0 users, load average: 0.08, 0.03, 0.02 boot-gerty:~ # The xtremoted log indicates that the mode change request was received, but not the node_reinit. The node reinits for the reboot from bug 4581 were running during this time. It seems odd that it would attempt to reboot the node, then suddenly decide it was responding. Thanks, Doug
Hi Doug, I would need the full slurmctld logs since I need capmc strings. It seems to me that there could be some collision with bug 4581. If there's a node_reinit in course and here we were trying to change the mode (two capmc_resume invocations) the ret. code of capmc could be -1. This would make _update_all_nodes to exit without requesting a node restart and then try to put nodes online. This would've caused the behavior that you were looking. At the same time, the comp state stuck could be caused because this return nodes to online mode is done setting state as NODE_STATE_POWER_UP, here you could've hit bug 4536. On the other hand in 4581 you comment nodes should come up automatically. This happens for me in my environment, so this can be related to this issue. Please try to send me full logs.
Hi Doug, Did you experienced this situation anymore? I still think that it was due to two capmc_resume invocations and bug 4536 as I noted in my last comment. Log file could give me the confirmation. Thank you
Doug, It's been 20 days since last comment, so I am also resolving this as timed out. As you know if it is still happening or you need us to work more on this just reopen the bug. Thanks for your understanding