| Summary: | knl mode changes too impatient | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Doug Jacobsen <dmjacobsen> |
| Component: | KNL | Assignee: | Felip Moll <felip.moll> |
| Status: | RESOLVED TIMEDOUT | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | felip.moll |
| Version: | 17.11.2 | ||
| Hardware: | Cray XC | ||
| OS: | Linux | ||
| See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=4536 https://bugs.schedmd.com/show_bug.cgi?id=4581 |
||
| Site: | NERSC | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Doug Jacobsen
2018-01-05 19:07:53 MST
Hi Doug, I would need the full slurmctld logs since I need capmc strings. It seems to me that there could be some collision with bug 4581. If there's a node_reinit in course and here we were trying to change the mode (two capmc_resume invocations) the ret. code of capmc could be -1. This would make _update_all_nodes to exit without requesting a node restart and then try to put nodes online. This would've caused the behavior that you were looking. At the same time, the comp state stuck could be caused because this return nodes to online mode is done setting state as NODE_STATE_POWER_UP, here you could've hit bug 4536. On the other hand in 4581 you comment nodes should come up automatically. This happens for me in my environment, so this can be related to this issue. Please try to send me full logs. Hi Doug, Did you experienced this situation anymore? I still think that it was due to two capmc_resume invocations and bug 4536 as I noted in my last comment. Log file could give me the confirmation. Thank you Doug, It's been 20 days since last comment, so I am also resolving this as timed out. As you know if it is still happening or you need us to work more on this just reopen the bug. Thanks for your understanding |