Ticket 6739

Summary:	slurmctld crashes
Product:	Slurm	Reporter:	Stephane Thiell <sthiell>
Component:	slurmctld	Assignee:	Dominik Bartkiewicz <bart>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	bart, damien.leong, fordste5, kilian, nate, rc, sthiell
Version:	18.08.6
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=7826 https://bugs.schedmd.com/show_bug.cgi?id=8285 https://bugs.schedmd.com/show_bug.cgi?id=8651
Site:	Stanford	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	18.08.8
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	"thread apply all bt full" output workaround initial crash slurmctld logs sh-06-28 log

Description Stephane Thiell 2019-03-21 18:57:34 MDT

Hi, this is a sev 1 issue. slurmctld crashed on Sherlock and we cannot restart it. We do have core dumps, I'll provide them ASAP. It seems to be this problem:

(gdb) bt
#0  0x00007f848135f207 in raise () from /lib64/libc.so.6
#1  0x00007f84813608f8 in abort () from /lib64/libc.so.6
#2  0x00007f8481358026 in __assert_fail_base () from /lib64/libc.so.6
#3  0x00007f84813580d2 in __assert_fail () from /lib64/libc.so.6
#4  0x00007f8481b7d6c1 in bit_nclear (b=b@entry=0x7202880, start=start@entry=0, stop=stop@entry=-1) at bitstring.c:292
#5  0x00007f8481b7fc77 in bit_unfmt_hexmask (bitmap=0x7202880, str=<optimized out>) at bitstring.c:1397
#6  0x00007f8481b97f2d in gres_plugin_job_state_unpack (gres_list=gres_list@entry=0x7ffd30a59cd8, buffer=buffer@entry=0x32d6800, job_id=39458415, protocol_version=protocol_version@entry=8448) at gres.c:4318
#7  0x000000000045d079 in _load_job_state (buffer=buffer@entry=0x32d6800, protocol_version=<optimized out>) at job_mgr.c:1519
#8  0x0000000000460941 in load_all_job_state () at job_mgr.c:988
#9  0x000000000049c0cd in read_slurm_conf (recover=recover@entry=2, reconfig=reconfig@entry=false) at read_config.c:1334
#10 0x0000000000424e72 in run_backup (callbacks=callbacks@entry=0x7ffd30a5a860) at backup.c:257
#11 0x000000000042b985 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:607

Thanks
Stephane

Comment 1 Stephane Thiell 2019-03-21 19:03:59 MDT

slurmctld binary and core dumps made available at https://stanford.box.com/s/o2thc6wx92igd9zmvp46b4anqd5qbaw7
Let me know if you need anything else. My colleague Kilian should be online shortly. Thanks for your assistance.

Comment 2 Stephane Thiell 2019-03-21 19:09:57 MDT

service logs show:

Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: recovered JobId=39458398 StepId=Extern
Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39458398 Assoc=13174
Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: recovered JobId=39458399 StepId=Extern
Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39458399 Assoc=13174
Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39458402 Assoc=3796
Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: recovered JobId=39458403 StepId=Extern
Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: recovered JobId=39458403 StepId=0
Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39458403 Assoc=2932
Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39448495_3(39459510) Assoc=7821
Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: recovered JobId=39019164_895(39458407) StepId=Extern
Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39019164_895(39458407) Assoc=9368
Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39019164_896(39458408) Assoc=9368
Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: recovered JobId=39019164_897(39458409) StepId=Extern
Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39019164_897(39458409) Assoc=9368
Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39019164_898(39458410) Assoc=9368
Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: recovered JobId=39019164_899(39459610) StepId=Extern
Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39019164_899(39459610) Assoc=9368
Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: recovered JobId=39458411 StepId=Extern
Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39458411 Assoc=16821
Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39458414 Assoc=11509
Mar 21 17:51:55 sh-sl01.int systemd[1]: slurmctld.service: main process exited, code=dumped, status=6/ABRT
Mar 21 17:51:55 sh-sl01.int systemd[1]: Unit slurmctld.service entered failed state.
Mar 21 17:51:55 sh-sl01.int systemd[1]: slurmctld.service failed.

Comment 3 Kilian Cavalotti 2019-03-21 19:21:17 MDT

Created attachment 9662 [details]
"thread apply all bt full" output

Comment 4 Kilian Cavalotti 2019-03-21 19:30:44 MDT

It looks like the crash happens in GRES functions, on job 39458415. 

The thing is that job didn't request any GPU, here's the submission script:

#!/bin/bash
#
#BATCH --job-name=test
#
#SBATCH --time=10:00
#SBATCH --ntasks=1
#SBATCH --gpus-per-task=1
#SBATCH --mem-per-cpu=2G

srun /usr/bin/python2.7/python.exe ./SDDLPY/main_in.py


Yet, in the "scontrol show job" output we recorded in our prolog, it seems like it has TresPerTask=gpu:1

JobId=39458415 JobName=my_bash
   UserId=gaddiel(326207) GroupId=rzia(324517) MCS_label=N/A
   Priority=11442 Nice=0 Account=rzia QOS=normal
   JobState=CONFIGURING Reason=None Dependency=(null)
   Requeue=1 Restarts=1 BatchFlag=2 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:04 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2019-03-21T16:15:29 EligibleTime=2019-03-21T16:15:29
   AccrueTime=2019-03-21T16:15:30
   StartTime=2019-03-21T17:09:37 EndTime=2019-03-21T17:19:37 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-03-21T17:09:37
   Partition=normal AllocNode:Sid=sh-ln07:53306
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=sh-101-36
   BatchHost=sh-101-36
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=2G,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
     Nodes=sh-101-36 CPU_IDs=4 Mem=2048 GRES_IDX=gpu(IDX:)
   MinCPUsNode=1 MinMemoryCPU=2G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/users/gaddiel/my_bash
   WorkDir=/home/users/gaddiel
   StdErr=/home/users/gaddiel/slurm-39458415.out
   StdIn=/dev/null
   StdOut=/home/users/gaddiel/slurm-39458415.out
   Power=
   TresPerTask=gpu:1


In any case, we're in urgent need of a way to restart the controller... 

Thanks!
-- 
Kilian

Comment 5 Tim Wickberg 2019-03-21 20:34:05 MDT

> The thing is that job didn't request any GPU, here's the submission script:

It does, right here:

> #SBATCH --gpus-per-task=1

Which then matches up correctly with:

> Yet, in the "scontrol show job" output we recorded in our prolog, it seems
> like it has TresPerTask=gpu:1

We're looking into how to patch around this without dropping the queue.

Comment 7 Kilian Cavalotti 2019-03-21 20:38:26 MDT

(In reply to Tim Wickberg from comment #5)
> > The thing is that job didn't request any GPU, here's the submission script:
> 
> It does, right here:
> 
> > #SBATCH --gpus-per-task=1

Argh, sorry I missed that, I'm too used to see --cpus-per-task, that's pretty unusual for our users. Especially considering that the partition this was submitted to does not have any GRES, so I wasn't expecting that.

> We're looking into how to patch around this without dropping the queue.

Thanks!!

Cheers,
--
Kilian

Comment 9 Tim Wickberg 2019-03-21 20:50:39 MDT

Created attachment 9663 [details]
workaround initial crash

Can you apply this and restart the slurmctld?

This should move where the crash happens if nothing else, although I'll caution that you may very well still see another crash.

Logs from slurmctld would be nice to have.

And did you upgrade slurmctld or other parts of your system since that job was submitted? I have not been able to reproduce this yet, and I'm wondering if this job snuck in under an older slurmctld version.

Comment 10 Kilian Cavalotti 2019-03-21 20:54:24 MDT

Hi Tim, 

(In reply to Tim Wickberg from comment #9)
> Created attachment 9663 [details]
> workaround initial crash
> 
> Can you apply this and restart the slurmctld?

Thanks, doing that now.

> This should move where the crash happens if nothing else, although I'll
> caution that you may very well still see another crash.
> 
> Logs from slurmctld would be nice to have.

Ok, will send those back if it crashes again.

> And did you upgrade slurmctld or other parts of your system since that job
> was submitted? I have not been able to reproduce this yet, and I'm wondering
> if this job snuck in under an older slurmctld version.

Nope, we've moved to 18.08.6 pretty much on its release date.
And I just realized, but --gpus-per-task is not a 18.08.x option, is it? As it it's not in the sbatch/srun man page. I assume it's coming with 19.05?

It's clearly a typo from the user, because of the lack of documentation and the absence of GPU in their partition, but nonetheless, the job went in anyway.

Anyway, patching now.

Cheers,
-- 
Kilian

Comment 11 Kilian Cavalotti 2019-03-21 21:14:10 MDT

All right, patch applied (with "return 0;", because SLURM_SUCCESS is apparently not defined there), and the controller was able to start!

It logged this:

Mar 21 20:10:08 sh-sl01 slurmctld[170018]: _sync_nodes_to_comp_job: JobId=39458415 in completing state
Mar 21 20:10:08 sh-sl01 slurmctld[170018]: error: job_resources_node_inx_to_cpu_inx: Invalid node_inx
Mar 21 20:10:08 sh-sl01 slurmctld[170018]: error: job_update_tres_cnt: problem getting offset of JobId=39458415
Mar 21 20:10:08 sh-sl01 slurmctld[170018]: cleanup_completing: JobId=39458415 completion process took 10825 seconds

And the same thing for a few other job ids.

It looks like things are working now, thanks!

Anything else we should look for, should we cancel those jobs if they risk to make the controller crash other places?

Thanks thanks thanks!
-- 
Kilian

Comment 12 Tim Wickberg 2019-03-21 21:46:22 MDT

> Nope, we've moved to 18.08.6 pretty much on its release date.
> And I just realized, but --gpus-per-task is not a 18.08.x option, is it? As
> it it's not in the sbatch/srun man page. I assume it's coming with 19.05?

Yeah... it's used for cons_tres which is coming in 19.05, but had been added into 18.08 at one point (but was intentionally left undocumented).

A typo between 'c' and 'g' makes sense there.

> It's clearly a typo from the user, because of the lack of documentation and
> the absence of GPU in their partition, but nonetheless, the job went in
> anyway.

Ah... no gpus available in the partition might be a good lead, I did not test that when quickly trying to reproduce.

Someone will follow up with you further tomorrow during normal hours - slurmctld logs from right before the crash would likely be of use if you can upload them sometime.

- Tim

Comment 13 Kilian Cavalotti 2019-03-21 21:50:24 MDT

(In reply to Tim Wickberg from comment #12)
> Someone will follow up with you further tomorrow during normal hours -
> slurmctld logs from right before the crash would likely be of use if you can
> upload them sometime.

Great, thanks again Tim, you saved our day!

I've dropped the severity and I'll send the logs shortly.

Cheers,
-- 
Kilian

Comment 14 Kilian Cavalotti 2019-03-21 21:57:40 MDT

Created attachment 9664 [details]
slurmctld logs

Here's how things happened:

1. a slurmctld restart was initiated around 17:32, to remove a few nodes from the configuration (the sh-04-xx nodes referenced to later on in the log)

Mar 21 17:34:06 sh-sl01 slurmctld[162518]: slurmctld version 18.08.6-2 started on cluster sherlock

2. slurmctld stopped successfully, and the problematic job (39458415) was already in queue: it was submitted about an hour earlier, at 16:15:29

3. at 17:34:16, during the job recovery phase, slurmctld crashed with the backtrace provided earlier

Mar 21 17:34:16 sh-sl01 slurmctld[162518]: recovered JobId=39458411 StepId=Extern
Mar 21 17:34:16 sh-sl01 slurmctld[162518]: Recovered JobId=39458411 Assoc=16821
Mar 21 17:34:16 sh-sl01 slurmctld[162518]: Recovered JobId=39458414 Assoc=11509
Mar 21 17:48:51 sh-sl01 slurmctld[163005]: slurmctld version 18.08.6-2 started on cluster sherlock

The next line in the log is a subsequent restart try.

Mar 21 17:48:51 sh-sl01 slurmctld[163005]: job_submit.lua: job_submit: initialized
Mar 21 17:48:51 sh-sl01 slurmctld[163005]: error: _shutdown_bu_thread:send/recv sh-sl02: Connection refused
Mar 21 17:48:56 sh-sl01 slurmctld[163005]: No memory enforcing mechanism configured.
Mar 21 17:48:56 sh-sl01 slurmctld[163005]: layouts: no layout to initialize

4. we tried a few more restart after having put back sh-04-xx in the config, but the error was the same.

Comment 15 Dominik Bartkiewicz 2019-03-22 04:01:12 MDT

Hi

Could you send me slurmd.log from sh-06-28?

Dominik

Comment 16 Kilian Cavalotti 2019-03-22 07:48:19 MDT

Created attachment 9669 [details]
sh-06-28 log

Hi Dominik,

> Could you send me slurmd.log from sh-06-28?

Sure, here it is!

Cheers,
-- 
Kilian

Comment 17 Dominik Bartkiewicz 2019-03-22 08:41:40 MDT

Hi

I am able to reproduce this and now I will be working on understanding what is going on there.

Dominik

Comment 18 Kilian Cavalotti 2019-03-22 08:43:08 MDT

Awesome, thanks!

Cheers,

Comment 24 Kilian Cavalotti 2019-04-12 09:34:49 MDT

Hi Dominik, 

I'm wondering if you had any update on this bug, specifically if your patch had been merged. We currently apply it locally on 18.08.6 and would like to know if we still need to carry over this patch in 18.08.7.

Thanks!
-- 
Kilian

Comment 25 Dominik Bartkiewicz 2019-04-12 10:22:28 MDT

Hi

Sorry but it hasn't been merged to 18.08.7.
This commit also fix this issue but it is only in master
https://github.com/SchedMD/slurm/commit/89fdeaede7c

Dominik

Comment 26 Kilian Cavalotti 2019-04-12 10:24:36 MDT

(In reply to Dominik Bartkiewicz from comment #25)
> Hi
> 
> Sorry but it hasn't been merged to 18.08.7.
> This commit also fix this issue but it is only in master
> https://github.com/SchedMD/slurm/commit/89fdeaede7c

No worries, thanks for the info! I'll keep your patch from this ticket on 18.08.7 for now, then.

Cheers,
-- 
Kilian

Comment 28 Dominik Bartkiewicz 2019-04-15 01:52:52 MDT

Hi

This commit fixes this issue in 18.08:
https://github.com/SchedMD/slurm/commit/4c48a84a6edb
I'm closing this bug as resolved/fixed. Please reopen if you have additional bugs/problems.

Dominik

Comment 29 Kilian Cavalotti 2019-04-15 11:19:02 MDT

(In reply to Dominik Bartkiewicz from comment #28)
> Hi
> 
> This commit fixes this issue in 18.08:
> https://github.com/SchedMD/slurm/commit/4c48a84a6edb
> I'm closing this bug as resolved/fixed. Please reopen if you have additional
> bugs/problems.

Thank you!

Cheers,
--
Kilian

Comment 30 Marshall Garey 2019-04-29 12:20:33 MDT

*** Ticket 6923 has been marked as a duplicate of this ticket. ***

Comment 31 Dominik Bartkiewicz 2019-11-01 03:01:25 MDT

*** Ticket 8006 has been marked as a duplicate of this ticket. ***

Comment 32 Dominik Bartkiewicz 2020-02-27 09:04:39 MST

*** Ticket 8591 has been marked as a duplicate of this ticket. ***