Ticket 4583

Summary: knl mode changes too impatient
Product: Slurm Reporter: Doug Jacobsen <dmjacobsen>
Component: KNLAssignee: Felip Moll <felip.moll>
Status: RESOLVED TIMEDOUT QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: felip.moll
Version: 17.11.2   
Hardware: Cray XC   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=4536
https://bugs.schedmd.com/show_bug.cgi?id=4581
Site: NERSC Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Doug Jacobsen 2018-01-05 19:07:53 MST
Hello,

A node on our test cluster is stuck in a completing state:

ctlnet1:~ # sinfo -p system
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
system     down      30:00      1  down* nid00042
system     down      30:00      1   comp nid00393
system     down      30:00      1  drain nid00021
system     down      30:00    145   idle nid00[023,028-030,032-041,043-054,056-063,200-203,208-255,384-392,394-435,440-447]
system     down      30:00      1   down nid00022
ctlnet1:~ #

nid00393 was apparently "rebooted" to get a mode change:

slurmctld log (grep for nid00393):
[2018-01-05T14:27:50.570] sched: _slurm_rpc_allocate_resources JobId=864627 NodeList=nid00393 usec=58346
[2018-01-05T14:27:56.611] power_save: pid 6939 reboot nodes nid00393 features quad,flat
[2018-01-05T14:27:56.611] sched: _slurm_rpc_allocate_resources JobId=864628 NodeList=nid00393 usec=60176
[2018-01-05T14:27:57.599] debug:  Still waiting for boot of node nid00393
[2018-01-05T14:27:58.683] update_node: node nid00393 state set to IDLE
[2018-01-05T14:28:23.038] Node nid00393 now responding


end of slurmd log for nid00393:
[2018-01-05T14:27:52.748] [864627.extern] done with job
[2018-01-05T14:27:57.776] job_container/cncu: no job for delete(864628)

no slurmstepd processes.  Node never rebooted.

boot-gerty:~ # ssh nid00393 uptime
 18:01pm  up  19:26,  0 users,  load average: 0.08, 0.03, 0.02
boot-gerty:~ #


The xtremoted log indicates that the mode change request was received, but not the node_reinit.  The node reinits for the reboot from bug 4581 were running during this time.  It seems odd that it would attempt to reboot the node, then suddenly decide it was responding.

Thanks,
Doug
Comment 2 Felip Moll 2018-01-15 04:37:02 MST
Hi Doug,

I would need the full slurmctld logs since I need capmc strings.

It seems to me that there could be some collision with bug 4581.

If there's a node_reinit in course and here we were trying to change the mode (two capmc_resume invocations) the ret. code of capmc could be -1. This would make _update_all_nodes to exit without requesting a node restart and then try to put nodes online.

This would've caused the behavior that you were looking. At the same time, the comp state stuck could be caused because this return nodes to online mode is done setting state as NODE_STATE_POWER_UP, here you could've hit bug 4536.

On the other hand in 4581 you comment nodes should come up automatically. This happens for me in my environment, so this can be related to this issue.

Please try to send me full logs.
Comment 4 Felip Moll 2018-01-23 08:16:51 MST
Hi Doug,

Did you experienced this situation anymore?

I still think that it was due to two capmc_resume invocations and bug 4536 as I noted in my last comment. Log file could give me the confirmation.

Thank you
Comment 5 Felip Moll 2018-01-26 02:40:13 MST
Doug,

It's been 20 days since last comment, so I am also resolving this as timed out.

As you know if it is still happening or you need us to work more on this just reopen the bug.

Thanks for your understanding