Summary: | Slurmctld - crashes - slurmctld: bitstring.c:292: bit_nclear: Assertion `(start) < ((b)[1])' failed. | ||
---|---|---|---|
Product: | Slurm | Reporter: | Damien <damien.leong> |
Component: | slurmctld | Assignee: | Dominik Bartkiewicz <bart> |
Status: | RESOLVED DUPLICATE | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | bart |
Version: | 18.08.6 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Monash University | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: |
slurm core dump
slurm core dump workaround initial crash slurmctld log |
Description
Damien
2019-10-28 02:13:25 MDT
Created attachment 12106 [details]
slurm core dump
Created attachment 12107 [details]
slurm core dump
Created attachment 12108 [details] workaround initial crash Hi Can you apply this and restart the slurmctld? This should move where the crash happens if nothing else. This commit should prevent this issue in the future: https://github.com/SchedMD/slurm/commit/4c48a84a6edb Dominik Thanks for your reply. We are running v18.08.6 Your patch is for v19.05 ... Can it work ? Patch for this file only ? src/common/gres.c Hi You need to apply attachment 12108 [details] and restart slurmctld. Commit https://github.com/SchedMD/slurm/commit/4c48a84a6edb is included in 18.08.8. Dominik Created attachment 12111 [details]
slurmctld log
Hi Dose slurmctld still segfaulting? Does this log is taken after applying the patch? Dominik Thanks, the patch seems to work. Our Slurmctld is back... Hi I'm glad to hear that the slurmctld is working. Can we lower the severity of this ticket to 3? Is any reason why you use 18.08.6, not 18.08.8? Dominik Hi Dominik Yes , Please. We are planning to upgrade to v19.05.X soon, but I am worried for the existing users' scripts with "--gres=gpu:V100:1", the v19.05.X doesn't have GPU info in their gres.conf anymore , and everything is moving towards TRES. Cheers Damien Hi Syntax like "--gres=gpu:V100:1" is supported in 19.05 and we have no plan to remove it in the future. Slurm still takes gres info from gres.conf. To Enable AutoDetect you need to explicitly set it in gres.conf Dominik Hi Dose situation is still sable? Did I answer your concerns? If you have more doubts, please let me know here or open a separate ticket. Dominik Hi Dominik Thanks for your reply. Our slurmctld is running with the mentioned patched. Current plan: 1) Prepare v18.08.8 Just in case... 2) Gather clarity for v19.05.x upgrades - compatibility issues - any depreciated features or commands - Testing This should be a separate ticket if needed. Once again, thanks for your help. Cheers Damien Hi If you can create a new ticket this would be the best option. Dominik Thanks, I will do that. Cheers Damien Hi Closing as duplicate of 6739, please reopen if you have further questions. Dominik *** This ticket has been marked as a duplicate of ticket 6739 *** |