| Summary: | slurmctld nonresponsive from RPCs | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Zach Weidner <zweidner> |
| Component: | slurmctld | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | bart, colbykd |
| Version: | 19.05.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Purdue | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
slurmctld logs
slurm.conf |
||
|
Description
Zach Weidner
2020-11-08 23:49:20 MST
Created attachment 16560 [details]
slurm.conf
Zach Weidner is a sysadmin on our team here at Purdue who has not posted here before. Please let me know if there is anything official required to list him with our staff in your systems. Thanks! Hi You can try to connect with gdb to slurmctld process and null one of the elements in argv. eg.: in _pack_default_job_details() frame set detail_ptr->argv[i+10]=0 set detail_ptr->argv[10]=0 Dominik Thanks Dominik, that appears to have done the trick after a couple rounds of it. My fuzzy recollection from last night is that pack_all_jobs() would run need to run after a certain period of time had elapsed... if so that makes sense to need to do it a couple of times. Either way, slurmctld appears to have caught up and is serving requests normally for us now. Out of curiousity, how long does it take for a job like this to end up rotating out of the slurmctld saved state, and is there anything we can do to protect ourselves from a situation like this in the future? Hi This should be permanent and destructive for modified jobs. The best way to protect from this issue is to update to the current 20.02 version. Other solution can be updated to 19.5.7 and locally applying https://github.com/SchedMD/slurm/commit/f17cf91ccc56ccf87 Dominik Excellent... just noticed further down the thread in 8978 that you said exactly that, sorry. We're scheduling updates for our next regular downtime and we can close this issue. Thanks again, Zach Hi I'm glad this helped. Let us know if there's anything else we can do to help. Dominik *** This ticket has been marked as a duplicate of ticket 8978 *** |