Ticket 6976

Summary:	Slurmd crashes in bit_nclear
Product:	Slurm	Reporter:	Josko Plazonic <plazonic>
Component:	slurmd	Assignee:	Dominik Bartkiewicz <bart>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	bart, matthews
Version:	18.08.6
Hardware:	Linux
OS:	Linux
Site:	Princeton (PICSciE)	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	18.08.8
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Josko Plazonic 2019-05-07 20:58:27 MDT

Good evening,

we've had a couple of crashes like:

#0  0x00002ac6d5a22207 in raise () from /usr/lib64/libc.so.6
#1  0x00002ac6d5a238f8 in abort () from /usr/lib64/libc.so.6
#2  0x00002ac6d5a1b026 in __assert_fail_base () from /usr/lib64/libc.so.6
#3  0x00002ac6d5a1b0d2 in __assert_fail () from /usr/lib64/libc.so.6
#4  0x00002ac6d45e16c1 in bit_nclear (b=b@entry=0x2ac6f40047d0, start=start@entry=0, stop=stop@entry=-1) at bitstring.c:292
#5  0x00002ac6d45e3c77 in bit_unfmt_hexmask (bitmap=0x2ac6f40047d0, str=<optimized out>) at bitstring.c:1397
#6  0x00002ac6d45fbf2d in gres_plugin_job_state_unpack (gres_list=gres_list@entry=0x2ac6f40046b0, buffer=buffer@entry=0x2ac6f4016470, job_id=2943467, 
    protocol_version=protocol_version@entry=8448) at gres.c:4318
#7  0x00002ac6d4645579 in slurm_cred_unpack (buffer=buffer@entry=0x2ac6f4016470, protocol_version=protocol_version@entry=8448) at slurm_cred.c:1309
#8  0x00002ac6d4670f2c in _unpack_batch_job_launch_msg (msg=msg@entry=0x2ac6f40010c8, buffer=0x2ac6f4016470, protocol_version=<optimized out>) at slurm_protocol_pack.c:12814
#9  0x00002ac6d4688606 in unpack_msg (msg=msg@entry=0x2ac6f4001090, buffer=buffer@entry=0x2ac6f4016470) at slurm_protocol_pack.c:1974
#10 0x00002ac6d46519cb in slurm_receive_msg_and_forward (fd=9, orig_addr=<optimized out>, msg=msg@entry=0x2ac6f4001090, timeout=<optimized out>, timeout@entry=0)
    at slurm_protocol_api.c:3821
#11 0x000000000040cac6 in _service_connection (arg=<optimized out>) at slurmd.c:537
#12 0x00002ac6d57d7dd5 in start_thread () from /usr/lib64/libpthread.so.0
#13 0x00002ac6d5ae9ead in clone () from /usr/lib64/libc.so.6

that are very similar to what you recently fixed in #6739 but here we have slurmd crashing, not slurmctld, and in gres_plugin_job_state_unpack rather then node state unpack. I looked at the patch fixing slurmctld and it is not clear to me that it will fix this failure as well.

Unfortunately we don't have much more detail - sbatch file was already deleted and all I can offer is core file and any sacct details (where we have reqgres and allocgres = gpu:0).  I also can't reproduce it - I'll check better tomorrow but this ran on a CPU only node (so no gres.conf at all) but any job that specifies --gres=gpu... (whether 0 or not) should end up on GPU nodes (we force gpu partition if gres is non empty and starts with gpu, otherwise abort job submission). So - weird.

Thanks!

Comment 1 Dominik Bartkiewicz 2019-05-09 03:39:39 MDT

Hi

I think that patch from bug 6739 should protect from this in usual cases.
I still can see some problems in code related to this issue.
I am working on this now and I will let you know when we fix this.

Dominik

Comment 14 Jason Booth 2019-09-05 12:02:15 MDT

*** Ticket 7696 has been marked as a duplicate of this ticket. ***

Comment 18 Dominik Bartkiewicz 2019-09-09 04:28:54 MDT

Hi


We add a series of patches to full protection from such issues.
All this change will be included in 20.02 release.

https://github.com/SchedMD/slurm/commit/0104ec74c
https://github.com/SchedMD/slurm/commit/6e94ef316
https://github.com/SchedMD/slurm/commit/8a49e5e21
https://github.com/SchedMD/slurm/commit/68fd88539

I'll go ahead and close out this ticket.
Dominik