| Summary: | slurmctld repeatedly segfaulting | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Kaylea Nelson <kaylea.nelson> |
| Component: | Scheduling | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | bart, eric.peskin, jay.kubeck, jbooth, tyler.trafford |
| Version: | 18.08.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Yale | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
backtrace of corefile from one of the segfaults
new backtrace after patch from commit mentioned in comment 3 |
||
|
Description
Kaylea Nelson
2019-06-18 09:37:48 MDT
Hi Do you have core file from this crash? If yes, can you generate backtrace? eg.: gdb -ex 't a a bt' -batch slurmctld <corefile> Dominik Created attachment 10631 [details]
backtrace of corefile from one of the segfaults
Hi I think this is duplicate of bug 6837 Check bug 6837 comment 5 We added this fix which prevents from such situation as: https://github.com/SchedMD/slurm/commit/70d12f070908c33 Dominik I have installed patched RPMS on our controller node (still at 18.08.5), but we are still getting segfaults with nearly identical backtrace info. -Tyler Created attachment 10634 [details] new backtrace after patch from commit mentioned in comment 3 Hi Did you apply patch from bug 6837 comment 5 or 70d12f070908c33? To clear your slurmctld state you need to apply patch from bug 6837 comment 5. Could you also send me the output from: gdb slurmctld <corefile> t 1 f 0 p job_resrcs_ptr p *job_resrcs_ptr Dominik Ah, I only applied the one from the github commit. I assumed they were the same. I'm rebuilding with that patch from the bug 6835 comment 5 now. I'll reply back when I have that installed. -Tyler Right, that seems to have fixed things like you predicted. Would you still like that last gdb info? We are back online now. Thank you very much for your help. -Tyler Tyler - I am dropping this down to a sev 3 since you are back online. Dominiks remarks about the gdb info are not needed anymore. Those were just to verify that the patches had been applied correctly. Hi Glad to hear that all is back to normal. With 70d12f070908c33 this bug shouldn't occur anymore. Let me know if you have any additional questions/problem otherwise I will close this bug. Dominik Hi I'm going to go ahead and close the bug. If you have any questions, feel free to reopen the bug. Dominik *** This ticket has been marked as a duplicate of ticket 6837 *** |