5675 – slurmctld core dumps on job dalloc

Ticket 5675 - slurmctld core dumps on job dalloc

Summary: slurmctld core dumps on job dalloc

Status:	RESOLVED DUPLICATE of ticket 5452

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	17.02.10
Hardware:	Linux Linux

Severity:	2 - High Impact
Assignee:	Marshall Garey
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-09-06 11:04 MDT by ifisk
Modified:	2018-09-12 09:05 MDT (History)
CC List:	1 user (show)

See Also:
Site:	Simons Foundation & Flatiron Institute
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description ifisk 2018-09-06 11:04:59 MDT

We are seeing continuous crashing of the slurmctl daemon.   It seems to be associated with one job, but we cannot cancel it or remove it.  It's during job completion, but we don't know of anything that changed today on our cluster.    


[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/cm/shared/apps/slurm/17.02.10/sbin/slurmctld'.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000004ea7f1 in _step_dealloc_lps (step_ptr=0xb55980) at step_mgr.c:2043
2043    step_mgr.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install slurm-17.02.10-500_cm8.0.x86_64
(gdb) where 
#0  0x00000000004ea7f1 in _step_dealloc_lps (step_ptr=0xb55980) at step_mgr.c:2043
#1  0x00000000004f2551 in post_job_step (step_ptr=0xb55980) at step_mgr.c:4586
#2  0x00000000004e5a8e in _internal_step_complete (job_ptr=0xb550c0, step_ptr=0xb55980) at step_mgr.c:262
#3  0x00000000004e5b22 in delete_step_records (job_ptr=0xb550c0) at step_mgr.c:292
#4  0x0000000000491510 in cleanup_completing (job_ptr=0xb550c0) at job_scheduler.c:4393
#5  0x00000000004a01b7 in make_node_idle (node_ptr=0x7ffff7e967b8, job_ptr=0xb550c0) at node_mgr.c:3754
#6  0x000000000047e18b in job_epilog_complete (job_id=107575, node_name=0x7fffc80008d0 "worker1017", return_code=0) at job_mgr.c:13432
#7  0x00000000004bdd97 in _slurm_rpc_epilog_complete (msg=0x7fffef2d9e50, run_scheduler=0x7fffef2d9d60, running_composite=false) at proc_req.c:1842
#8  0x00000000004b8f3d in slurmctld_req (msg=0x7fffef2d9e50, arg=0x7fffe0000a20) at proc_req.c:349
#9  0x0000000000443870 in _service_connection (arg=0x7fffe0000a20) at controller.c:1133
#10 0x00007ffff77aedc5 in start_thread (arg=0x7fffef2da700) at pthread_create.c:308
#11 0x00007ffff74dd73d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Comment 1 Marshall Garey 2018-09-06 11:10:32 MDT

Hi,

I think we've fixed this. The backtrace looks familiar. I'm double checking that right now.

- Marshall

Comment 2 Marshall Garey 2018-09-06 11:18:09 MDT

Yeah, this looks like a duplicate of bug 5452. It was fixed in this commit on 17.11:

https://github.com/SchedMD/slurm/commit/fef07a40972

You can cherry pick this into 17.02. The NEWS file will conflict, but you can just ignore that conflict and apply the change.

This patch will prevent future issues, and I *think* get the slurmctld up and running again.

Can you verify this gets you up and running again?

If it doesn't, here's a patch that will get the slurmctld running. You should still use the other patch to prevent future issues.

https://bugs.schedmd.com/attachment.cgi?id=7321&action=diff

Comment 3 Marshall Garey 2018-09-06 14:05:39 MDT

Is slurmctld running again?

Comment 4 Marshall Garey 2018-09-06 15:22:55 MDT

Dropping severity to sev-2 since we've identified the problem and have a fix available. Please let us know if it doesn't work and you can move it back to sev-1.

Comment 5 ifisk 2018-09-06 15:30:52 MDT

Hi,

   Thanks.

    We are running Bright, so it’s less easy to patch.   I think we are going to add 

EnforcePartLimits=ALL

To the configuration, which should be a temporary patch.     

Thanks, Ian


> On Sep 6, 2018, at 5:22 PM, bugs@schedmd.com wrote:
> 
> Marshall Garey <mailto:marshall@schedmd.com> changed bug 5675 <https://bugs.schedmd.com/show_bug.cgi?id=5675> 
> What	Removed	Added
> Severity	1 - System not usable            	2 - High Impact            
> 
> Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=5675#c4> on bug 5675 <https://bugs.schedmd.com/show_bug.cgi?id=5675> from Marshall Garey <mailto:marshall@schedmd.com>
> Dropping severity to sev-2 since we've identified the problem and have a fix
> available. Please let us know if it doesn't work and you can move it back to
> sev-1.
> 
> You are receiving this mail because:
> You reported the bug.

Comment 6 Marshall Garey 2018-09-06 15:31:59 MDT

You'll want to REMOVE EnforcePartLimits=ALL. (ALL is the problem.)

EnforcePartLimits=ANY shouldn't have that problem.

Comment 7 Jason Booth 2018-09-06 15:39:15 MDT

Hi Ian,

 Bright should be able to re-roll the tarballs with just the patch or a more recent version of SLURM. If they do provide an updated tarball with a more recent version for SLURM then it may break some integration work that they have done for the specific version, although they would not know for certain without testing.

Kind regards,
Jason

Comment 8 Dylan Simon 2018-09-06 15:56:48 MDT

For the record, we managed to work around the repeated crash issue by removing the node running the affected job (worker1017 from the stack dump) from slurm.conf. It has been stable since then.  (The problem originally started on slurm 17.02.2. We upgraded to 17.02.10 but the crashes continued until removing the node.)  Only a small number of jobs were lost.

We have since changed to EnforcePartLimits=ANY and will see if bright can provide a patch or their patched source so we can build our own.  Thanks.

Comment 9 Marshall Garey 2018-09-06 16:09:08 MDT

Thanks for the update. Definitely see if you can get that patch in.

Since 18.08 was released, we've technically dropped support for 17.02. We're giving sites time to upgrade to either 17.11 or 18.08, so we encourage you to make a plan to upgrade sometime soon.

I'm keeping this ticket open for now.

Comment 10 Marshall Garey 2018-09-12 09:05:02 MDT

Since you've found a workaround and know the fix, I'm going to close this ticket. Please reopen it if you have more issues.

- Marshall

*** This ticket has been marked as a duplicate of ticket 5452 ***