Hi, this is a sev 1 issue. slurmctld crashed on Sherlock and we cannot restart it. We do have core dumps, I'll provide them ASAP. It seems to be this problem: (gdb) bt #0 0x00007f848135f207 in raise () from /lib64/libc.so.6 #1 0x00007f84813608f8 in abort () from /lib64/libc.so.6 #2 0x00007f8481358026 in __assert_fail_base () from /lib64/libc.so.6 #3 0x00007f84813580d2 in __assert_fail () from /lib64/libc.so.6 #4 0x00007f8481b7d6c1 in bit_nclear (b=b@entry=0x7202880, start=start@entry=0, stop=stop@entry=-1) at bitstring.c:292 #5 0x00007f8481b7fc77 in bit_unfmt_hexmask (bitmap=0x7202880, str=<optimized out>) at bitstring.c:1397 #6 0x00007f8481b97f2d in gres_plugin_job_state_unpack (gres_list=gres_list@entry=0x7ffd30a59cd8, buffer=buffer@entry=0x32d6800, job_id=39458415, protocol_version=protocol_version@entry=8448) at gres.c:4318 #7 0x000000000045d079 in _load_job_state (buffer=buffer@entry=0x32d6800, protocol_version=<optimized out>) at job_mgr.c:1519 #8 0x0000000000460941 in load_all_job_state () at job_mgr.c:988 #9 0x000000000049c0cd in read_slurm_conf (recover=recover@entry=2, reconfig=reconfig@entry=false) at read_config.c:1334 #10 0x0000000000424e72 in run_backup (callbacks=callbacks@entry=0x7ffd30a5a860) at backup.c:257 #11 0x000000000042b985 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:607 Thanks Stephane
slurmctld binary and core dumps made available at https://stanford.box.com/s/o2thc6wx92igd9zmvp46b4anqd5qbaw7 Let me know if you need anything else. My colleague Kilian should be online shortly. Thanks for your assistance.
service logs show: Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: recovered JobId=39458398 StepId=Extern Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39458398 Assoc=13174 Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: recovered JobId=39458399 StepId=Extern Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39458399 Assoc=13174 Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39458402 Assoc=3796 Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: recovered JobId=39458403 StepId=Extern Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: recovered JobId=39458403 StepId=0 Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39458403 Assoc=2932 Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39448495_3(39459510) Assoc=7821 Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: recovered JobId=39019164_895(39458407) StepId=Extern Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39019164_895(39458407) Assoc=9368 Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39019164_896(39458408) Assoc=9368 Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: recovered JobId=39019164_897(39458409) StepId=Extern Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39019164_897(39458409) Assoc=9368 Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39019164_898(39458410) Assoc=9368 Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: recovered JobId=39019164_899(39459610) StepId=Extern Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39019164_899(39459610) Assoc=9368 Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: recovered JobId=39458411 StepId=Extern Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39458411 Assoc=16821 Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39458414 Assoc=11509 Mar 21 17:51:55 sh-sl01.int systemd[1]: slurmctld.service: main process exited, code=dumped, status=6/ABRT Mar 21 17:51:55 sh-sl01.int systemd[1]: Unit slurmctld.service entered failed state. Mar 21 17:51:55 sh-sl01.int systemd[1]: slurmctld.service failed.
Created attachment 9662 [details] "thread apply all bt full" output
It looks like the crash happens in GRES functions, on job 39458415. The thing is that job didn't request any GPU, here's the submission script: #!/bin/bash # #BATCH --job-name=test # #SBATCH --time=10:00 #SBATCH --ntasks=1 #SBATCH --gpus-per-task=1 #SBATCH --mem-per-cpu=2G srun /usr/bin/python2.7/python.exe ./SDDLPY/main_in.py Yet, in the "scontrol show job" output we recorded in our prolog, it seems like it has TresPerTask=gpu:1 JobId=39458415 JobName=my_bash UserId=gaddiel(326207) GroupId=rzia(324517) MCS_label=N/A Priority=11442 Nice=0 Account=rzia QOS=normal JobState=CONFIGURING Reason=None Dependency=(null) Requeue=1 Restarts=1 BatchFlag=2 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:00:04 TimeLimit=00:10:00 TimeMin=N/A SubmitTime=2019-03-21T16:15:29 EligibleTime=2019-03-21T16:15:29 AccrueTime=2019-03-21T16:15:30 StartTime=2019-03-21T17:09:37 EndTime=2019-03-21T17:19:37 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-03-21T17:09:37 Partition=normal AllocNode:Sid=sh-ln07:53306 ReqNodeList=(null) ExcNodeList=(null) NodeList=sh-101-36 BatchHost=sh-101-36 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=2G,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* Nodes=sh-101-36 CPU_IDs=4 Mem=2048 GRES_IDX=gpu(IDX:) MinCPUsNode=1 MinMemoryCPU=2G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/users/gaddiel/my_bash WorkDir=/home/users/gaddiel StdErr=/home/users/gaddiel/slurm-39458415.out StdIn=/dev/null StdOut=/home/users/gaddiel/slurm-39458415.out Power= TresPerTask=gpu:1 In any case, we're in urgent need of a way to restart the controller... Thanks! -- Kilian
> The thing is that job didn't request any GPU, here's the submission script: It does, right here: > #SBATCH --gpus-per-task=1 Which then matches up correctly with: > Yet, in the "scontrol show job" output we recorded in our prolog, it seems > like it has TresPerTask=gpu:1 We're looking into how to patch around this without dropping the queue.
(In reply to Tim Wickberg from comment #5) > > The thing is that job didn't request any GPU, here's the submission script: > > It does, right here: > > > #SBATCH --gpus-per-task=1 Argh, sorry I missed that, I'm too used to see --cpus-per-task, that's pretty unusual for our users. Especially considering that the partition this was submitted to does not have any GRES, so I wasn't expecting that. > We're looking into how to patch around this without dropping the queue. Thanks!! Cheers, -- Kilian
Created attachment 9663 [details] workaround initial crash Can you apply this and restart the slurmctld? This should move where the crash happens if nothing else, although I'll caution that you may very well still see another crash. Logs from slurmctld would be nice to have. And did you upgrade slurmctld or other parts of your system since that job was submitted? I have not been able to reproduce this yet, and I'm wondering if this job snuck in under an older slurmctld version.
Hi Tim, (In reply to Tim Wickberg from comment #9) > Created attachment 9663 [details] > workaround initial crash > > Can you apply this and restart the slurmctld? Thanks, doing that now. > This should move where the crash happens if nothing else, although I'll > caution that you may very well still see another crash. > > Logs from slurmctld would be nice to have. Ok, will send those back if it crashes again. > And did you upgrade slurmctld or other parts of your system since that job > was submitted? I have not been able to reproduce this yet, and I'm wondering > if this job snuck in under an older slurmctld version. Nope, we've moved to 18.08.6 pretty much on its release date. And I just realized, but --gpus-per-task is not a 18.08.x option, is it? As it it's not in the sbatch/srun man page. I assume it's coming with 19.05? It's clearly a typo from the user, because of the lack of documentation and the absence of GPU in their partition, but nonetheless, the job went in anyway. Anyway, patching now. Cheers, -- Kilian
All right, patch applied (with "return 0;", because SLURM_SUCCESS is apparently not defined there), and the controller was able to start! It logged this: Mar 21 20:10:08 sh-sl01 slurmctld[170018]: _sync_nodes_to_comp_job: JobId=39458415 in completing state Mar 21 20:10:08 sh-sl01 slurmctld[170018]: error: job_resources_node_inx_to_cpu_inx: Invalid node_inx Mar 21 20:10:08 sh-sl01 slurmctld[170018]: error: job_update_tres_cnt: problem getting offset of JobId=39458415 Mar 21 20:10:08 sh-sl01 slurmctld[170018]: cleanup_completing: JobId=39458415 completion process took 10825 seconds And the same thing for a few other job ids. It looks like things are working now, thanks! Anything else we should look for, should we cancel those jobs if they risk to make the controller crash other places? Thanks thanks thanks! -- Kilian
> Nope, we've moved to 18.08.6 pretty much on its release date. > And I just realized, but --gpus-per-task is not a 18.08.x option, is it? As > it it's not in the sbatch/srun man page. I assume it's coming with 19.05? Yeah... it's used for cons_tres which is coming in 19.05, but had been added into 18.08 at one point (but was intentionally left undocumented). A typo between 'c' and 'g' makes sense there. > It's clearly a typo from the user, because of the lack of documentation and > the absence of GPU in their partition, but nonetheless, the job went in > anyway. Ah... no gpus available in the partition might be a good lead, I did not test that when quickly trying to reproduce. Someone will follow up with you further tomorrow during normal hours - slurmctld logs from right before the crash would likely be of use if you can upload them sometime. - Tim
(In reply to Tim Wickberg from comment #12) > Someone will follow up with you further tomorrow during normal hours - > slurmctld logs from right before the crash would likely be of use if you can > upload them sometime. Great, thanks again Tim, you saved our day! I've dropped the severity and I'll send the logs shortly. Cheers, -- Kilian
Created attachment 9664 [details] slurmctld logs Here's how things happened: 1. a slurmctld restart was initiated around 17:32, to remove a few nodes from the configuration (the sh-04-xx nodes referenced to later on in the log) Mar 21 17:34:06 sh-sl01 slurmctld[162518]: slurmctld version 18.08.6-2 started on cluster sherlock 2. slurmctld stopped successfully, and the problematic job (39458415) was already in queue: it was submitted about an hour earlier, at 16:15:29 3. at 17:34:16, during the job recovery phase, slurmctld crashed with the backtrace provided earlier Mar 21 17:34:16 sh-sl01 slurmctld[162518]: recovered JobId=39458411 StepId=Extern Mar 21 17:34:16 sh-sl01 slurmctld[162518]: Recovered JobId=39458411 Assoc=16821 Mar 21 17:34:16 sh-sl01 slurmctld[162518]: Recovered JobId=39458414 Assoc=11509 Mar 21 17:48:51 sh-sl01 slurmctld[163005]: slurmctld version 18.08.6-2 started on cluster sherlock The next line in the log is a subsequent restart try. Mar 21 17:48:51 sh-sl01 slurmctld[163005]: job_submit.lua: job_submit: initialized Mar 21 17:48:51 sh-sl01 slurmctld[163005]: error: _shutdown_bu_thread:send/recv sh-sl02: Connection refused Mar 21 17:48:56 sh-sl01 slurmctld[163005]: No memory enforcing mechanism configured. Mar 21 17:48:56 sh-sl01 slurmctld[163005]: layouts: no layout to initialize 4. we tried a few more restart after having put back sh-04-xx in the config, but the error was the same.
Hi Could you send me slurmd.log from sh-06-28? Dominik
Created attachment 9669 [details] sh-06-28 log Hi Dominik, > Could you send me slurmd.log from sh-06-28? Sure, here it is! Cheers, -- Kilian
Hi I am able to reproduce this and now I will be working on understanding what is going on there. Dominik
Awesome, thanks! Cheers,
Hi Dominik, I'm wondering if you had any update on this bug, specifically if your patch had been merged. We currently apply it locally on 18.08.6 and would like to know if we still need to carry over this patch in 18.08.7. Thanks! -- Kilian
Hi Sorry but it hasn't been merged to 18.08.7. This commit also fix this issue but it is only in master https://github.com/SchedMD/slurm/commit/89fdeaede7c Dominik
(In reply to Dominik Bartkiewicz from comment #25) > Hi > > Sorry but it hasn't been merged to 18.08.7. > This commit also fix this issue but it is only in master > https://github.com/SchedMD/slurm/commit/89fdeaede7c No worries, thanks for the info! I'll keep your patch from this ticket on 18.08.7 for now, then. Cheers, -- Kilian
Hi This commit fixes this issue in 18.08: https://github.com/SchedMD/slurm/commit/4c48a84a6edb I'm closing this bug as resolved/fixed. Please reopen if you have additional bugs/problems. Dominik
(In reply to Dominik Bartkiewicz from comment #28) > Hi > > This commit fixes this issue in 18.08: > https://github.com/SchedMD/slurm/commit/4c48a84a6edb > I'm closing this bug as resolved/fixed. Please reopen if you have additional > bugs/problems. Thank you! Cheers, -- Kilian
*** Ticket 6923 has been marked as a duplicate of this ticket. ***
*** Ticket 8006 has been marked as a duplicate of this ticket. ***
*** Ticket 8591 has been marked as a duplicate of this ticket. ***