| Summary: | Nodes indicated State=idle* | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Ciaron Linstead <linstead> |
| Component: | slurmd | Assignee: | Tim Wickberg <tim> |
| Status: | RESOLVED TIMEDOUT | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 16.05.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | PIK | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurmd log (excerpt) | ||
|
Description
Ciaron Linstead
2017-01-25 09:41:14 MST
Yes - Idle* indicates the node has checked in and is otherwise unoccupied, but hasn't responded to the most recent ping. The one-minute restart of an otherwise idle slurmd does sound suspicious, and the snippets from the log you've posted don't look unusual, although it'd like to have a longer chunk of that if you get a chance. It'd be nice to get a backtrace from one of these to see what's happening on that node before the restart: gdb -p <pid of slurmd> (gdb) thread apply all bt (gdb) detach then go ahead and restart it. I suspect that'll quickly point us at the underlying issue. Created attachment 3985 [details]
slurmd log (excerpt)
Attaching log file for an affected node, from startup to end of the day (after problem resolved via slurmd restart)
Hi Tim Thanks for the update. I've attached more of the slurmd log from an affected node. Thanks also for the gdb tip, I'll try this when we next see the issue. Best regards Ciaron Hey Ciaron - Has this recurred recently? I still don't have a good lead on the underlying cause, and was hoping a backtrace would be able to highlight the issue quickly. Although if the underlying problem has disappeared that would also be good to know as well. - Tim We haven't seen a repeat of this, so I think we can go ahead and close the ticket. I'll report back if we ever see it again. Thanks again. Marking resolved/timedout to indicate we weren't able to get enough logs on this to catch the issue. Please reopen if this recurs, or file a new bug. - Tim |