Ticket 6976

Summary: Slurmd crashes in bit_nclear
Product: Slurm Reporter: Josko Plazonic <plazonic>
Component: slurmdAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: bart, matthews
Version: 18.08.6   
Hardware: Linux   
OS: Linux   
Site: Princeton (PICSciE) Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 18.08.8 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Josko Plazonic 2019-05-07 20:58:27 MDT
Good evening,

we've had a couple of crashes like:

#0  0x00002ac6d5a22207 in raise () from /usr/lib64/libc.so.6
#1  0x00002ac6d5a238f8 in abort () from /usr/lib64/libc.so.6
#2  0x00002ac6d5a1b026 in __assert_fail_base () from /usr/lib64/libc.so.6
#3  0x00002ac6d5a1b0d2 in __assert_fail () from /usr/lib64/libc.so.6
#4  0x00002ac6d45e16c1 in bit_nclear (b=b@entry=0x2ac6f40047d0, start=start@entry=0, stop=stop@entry=-1) at bitstring.c:292
#5  0x00002ac6d45e3c77 in bit_unfmt_hexmask (bitmap=0x2ac6f40047d0, str=<optimized out>) at bitstring.c:1397
#6  0x00002ac6d45fbf2d in gres_plugin_job_state_unpack (gres_list=gres_list@entry=0x2ac6f40046b0, buffer=buffer@entry=0x2ac6f4016470, job_id=2943467, 
    protocol_version=protocol_version@entry=8448) at gres.c:4318
#7  0x00002ac6d4645579 in slurm_cred_unpack (buffer=buffer@entry=0x2ac6f4016470, protocol_version=protocol_version@entry=8448) at slurm_cred.c:1309
#8  0x00002ac6d4670f2c in _unpack_batch_job_launch_msg (msg=msg@entry=0x2ac6f40010c8, buffer=0x2ac6f4016470, protocol_version=<optimized out>) at slurm_protocol_pack.c:12814
#9  0x00002ac6d4688606 in unpack_msg (msg=msg@entry=0x2ac6f4001090, buffer=buffer@entry=0x2ac6f4016470) at slurm_protocol_pack.c:1974
#10 0x00002ac6d46519cb in slurm_receive_msg_and_forward (fd=9, orig_addr=<optimized out>, msg=msg@entry=0x2ac6f4001090, timeout=<optimized out>, timeout@entry=0)
    at slurm_protocol_api.c:3821
#11 0x000000000040cac6 in _service_connection (arg=<optimized out>) at slurmd.c:537
#12 0x00002ac6d57d7dd5 in start_thread () from /usr/lib64/libpthread.so.0
#13 0x00002ac6d5ae9ead in clone () from /usr/lib64/libc.so.6

that are very similar to what you recently fixed in #6739 but here we have slurmd crashing, not slurmctld, and in gres_plugin_job_state_unpack rather then node state unpack. I looked at the patch fixing slurmctld and it is not clear to me that it will fix this failure as well.

Unfortunately we don't have much more detail - sbatch file was already deleted and all I can offer is core file and any sacct details (where we have reqgres and allocgres = gpu:0).  I also can't reproduce it - I'll check better tomorrow but this ran on a CPU only node (so no gres.conf at all) but any job that specifies --gres=gpu... (whether 0 or not) should end up on GPU nodes (we force gpu partition if gres is non empty and starts with gpu, otherwise abort job submission). So - weird.

Thanks!
Comment 1 Dominik Bartkiewicz 2019-05-09 03:39:39 MDT
Hi

I think that patch from bug 6739 should protect from this in usual cases.
I still can see some problems in code related to this issue.
I am working on this now and I will let you know when we fix this.

Dominik
Comment 14 Jason Booth 2019-09-05 12:02:15 MDT
*** Ticket 7696 has been marked as a duplicate of this ticket. ***
Comment 18 Dominik Bartkiewicz 2019-09-09 04:28:54 MDT
Hi


We add a series of patches to full protection from such issues.
All this change will be included in 20.02 release.

https://github.com/SchedMD/slurm/commit/0104ec74c
https://github.com/SchedMD/slurm/commit/6e94ef316
https://github.com/SchedMD/slurm/commit/8a49e5e21
https://github.com/SchedMD/slurm/commit/68fd88539

I'll go ahead and close out this ticket.
Dominik