Hi Slurm Support We are using slurm 18.08.6. Our slurmctld crashes, causing multiple core dumps, Couldn't bring it up even after multiple restarts. ./slurmctld -D slurmctld: debug: Log file re-opened slurmctld: pidfile not locked, assuming no running daemon slurmctld: slurmctld version 18.08.6-2 started on cluster m3 slurmctld: Munge cryptographic signature plugin loaded slurmctld: debug: init: Gres GPU plugin loaded slurmctld: Consumable Resources (CR) Node Selection plugin loaded with argument 20 slurmctld: preempt/qos loaded slurmctld: debug: Checkpoint plugin loaded: checkpoint/none slurmctld: debug: AcctGatherEnergy NONE plugin loaded slurmctld: debug: AcctGatherProfile NONE plugin loaded slurmctld: debug: AcctGatherInterconnect NONE plugin loaded slurmctld: debug: AcctGatherFilesystem NONE plugin loaded slurmctld: debug: Job accounting gather cgroup plugin loaded slurmctld: job_submit.lua: initialized slurmctld: ExtSensors NONE plugin loaded slurmctld: debug: switch NONE plugin loaded slurmctld: debug: power_save module disabled, SuspendTime < 0 slurmctld: debug: Requesting control from backup controller m3-mgmt1 ..... ..... ..... slurmctld: gres_per_node:1 node_cnt:0 slurmctld: Recovered JobId=11534397 Assoc=2472 slurmctld: recovered JobId=11528146_11(11538958) StepId=Extern slurmctld: Recovered JobId=11528146_11(11538958) Assoc=361 slurmctld: gres:gpu(7696487) type:P4(13392) job:11538967 state slurmctld: gres_per_node:1 node_cnt:1 slurmctld: gres_bit_step_alloc:NULL slurmctld: gres_bit_alloc[0]:1 slurmctld: gres_cnt_step_alloc[0]:0 slurmctld: recovered JobId=11538967 StepId=Extern slurmctld: Recovered JobId=11538967 Assoc=3026 slurmctld: recovered JobId=11539461 StepId=Extern slurmctld: Recovered JobId=11539461 Assoc=2220 slurmctld: recovered JobId=11539464 StepId=Extern slurmctld: Recovered JobId=11539464 Assoc=2220 slurmctld: recovered JobId=11539468 StepId=Extern slurmctld: Recovered JobId=11539468 Assoc=2220 slurmctld: bitstring.c:292: bit_nclear: Assertion `(start) < ((b)[1])' failed. Aborted Investigation leads to this: # gdb /opt/slurm-18.08.6-2/sbin/slurmctld core.17463 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /opt/slurm-18.08.6-2/sbin/slurmctld...done. [New LWP 17463] [New LWP 17464] [New LWP 17470] [New LWP 17466] [New LWP 17468] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/opt/slurm-18.08.6-2/sbin/slurmctld'. Program terminated with signal 6, Aborted. #0 0x00007ff67b1b6207 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.3.x86_64 lua-5.1.4-15.el7.x86_64 sssd-client-1.16.2-13.el7_6.5.x86_64 (gdb) where #0 0x00007ff67b1b6207 in raise () from /lib64/libc.so.6 #1 0x00007ff67b1b78f8 in abort () from /lib64/libc.so.6 #2 0x00007ff67b1af026 in __assert_fail_base () from /lib64/libc.so.6 #3 0x00007ff67b1af0d2 in __assert_fail () from /lib64/libc.so.6 #4 0x00007ff67b9d2291 in bit_nclear (b=b@entry=0x18f7130, start=start@entry=0, stop=stop@entry=-1) at bitstring.c:292 #5 0x00007ff67b9d4790 in bit_unfmt_hexmask (bitmap=0x18f7130, str=<optimized out>) at bitstring.c:1397 #6 0x00007ff67b9ebbf5 in gres_plugin_job_state_unpack (gres_list=gres_list@entry=0x7ffeb93ae7e0, buffer=buffer@entry=0x17518c0, job_id=11539588, protocol_version=protocol_version@entry=8448) at gres.c:4318 #7 0x000000000045b77e in _load_job_state (buffer=buffer@entry=0x17518c0, protocol_version=<optimized out>) at job_mgr.c:1519 #8 0x000000000045f21c in load_all_job_state () at job_mgr.c:988 #9 0x0000000000499583 in read_slurm_conf (recover=<optimized out>, reconfig=reconfig@entry=false) at read_config.c:1326 #10 0x000000000042b172 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:663 (gdb) ### The problematic job seems to be job_id=11539588 Further investigations: # cd job.11539588/ [root@m3-mgmt2 job.11539588]# ls environment script [root@m3-mgmt2 job.11539588]# cat script #!/bin/bash #SBATCH --job-name=labeling #SBATCH --account=pd87 #SBATCH --time=10:00:00 #SBATCH --ntasks=1 #SBATCH --mem=64000 #SBATCH --cpus-per-task=26 #SBATCH --gpus-per-task=1 #SBATCH --partition=m3g source activate MLenv python labeling.py ### --gpus-per-task=1 is only available for v19.05 right ? Could it has caused problems to our slurmcltd v18.08.06 ? Since, our slurmcltd is down, How can we recover from this ? I have gone through most of our compute nodes, most jobs are still running. slurmctld, sinfo, squeue are dead. I don't want to scancel the other running jobs. Kindly advise and help Thanks Damien
Created attachment 12106 [details] slurm core dump
Created attachment 12107 [details] slurm core dump
Created attachment 12108 [details] workaround initial crash Hi Can you apply this and restart the slurmctld? This should move where the crash happens if nothing else. This commit should prevent this issue in the future: https://github.com/SchedMD/slurm/commit/4c48a84a6edb Dominik
Thanks for your reply. We are running v18.08.6 Your patch is for v19.05 ... Can it work ?
Patch for this file only ? src/common/gres.c
Hi You need to apply attachment 12108 [details] and restart slurmctld. Commit https://github.com/SchedMD/slurm/commit/4c48a84a6edb is included in 18.08.8. Dominik
Created attachment 12111 [details] slurmctld log
Hi Dose slurmctld still segfaulting? Does this log is taken after applying the patch? Dominik
Thanks, the patch seems to work. Our Slurmctld is back...
Hi I'm glad to hear that the slurmctld is working. Can we lower the severity of this ticket to 3? Is any reason why you use 18.08.6, not 18.08.8? Dominik
Hi Dominik Yes , Please. We are planning to upgrade to v19.05.X soon, but I am worried for the existing users' scripts with "--gres=gpu:V100:1", the v19.05.X doesn't have GPU info in their gres.conf anymore , and everything is moving towards TRES. Cheers Damien
Hi Syntax like "--gres=gpu:V100:1" is supported in 19.05 and we have no plan to remove it in the future. Slurm still takes gres info from gres.conf. To Enable AutoDetect you need to explicitly set it in gres.conf Dominik
Hi Dose situation is still sable? Did I answer your concerns? If you have more doubts, please let me know here or open a separate ticket. Dominik
Hi Dominik Thanks for your reply. Our slurmctld is running with the mentioned patched. Current plan: 1) Prepare v18.08.8 Just in case... 2) Gather clarity for v19.05.x upgrades - compatibility issues - any depreciated features or commands - Testing This should be a separate ticket if needed. Once again, thanks for your help. Cheers Damien
Hi If you can create a new ticket this would be the best option. Dominik
Thanks, I will do that. Cheers Damien
Hi Closing as duplicate of 6739, please reopen if you have further questions. Dominik *** This ticket has been marked as a duplicate of ticket 6739 ***