| Summary: | Slurmd crashes in bit_nclear | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Josko Plazonic <plazonic> |
| Component: | slurmd | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | bart, matthews |
| Version: | 18.08.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Princeton (PICSciE) | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 18.08.8 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
Hi I think that patch from bug 6739 should protect from this in usual cases. I still can see some problems in code related to this issue. I am working on this now and I will let you know when we fix this. Dominik *** Ticket 7696 has been marked as a duplicate of this ticket. *** Hi We add a series of patches to full protection from such issues. All this change will be included in 20.02 release. https://github.com/SchedMD/slurm/commit/0104ec74c https://github.com/SchedMD/slurm/commit/6e94ef316 https://github.com/SchedMD/slurm/commit/8a49e5e21 https://github.com/SchedMD/slurm/commit/68fd88539 I'll go ahead and close out this ticket. Dominik |
Good evening, we've had a couple of crashes like: #0 0x00002ac6d5a22207 in raise () from /usr/lib64/libc.so.6 #1 0x00002ac6d5a238f8 in abort () from /usr/lib64/libc.so.6 #2 0x00002ac6d5a1b026 in __assert_fail_base () from /usr/lib64/libc.so.6 #3 0x00002ac6d5a1b0d2 in __assert_fail () from /usr/lib64/libc.so.6 #4 0x00002ac6d45e16c1 in bit_nclear (b=b@entry=0x2ac6f40047d0, start=start@entry=0, stop=stop@entry=-1) at bitstring.c:292 #5 0x00002ac6d45e3c77 in bit_unfmt_hexmask (bitmap=0x2ac6f40047d0, str=<optimized out>) at bitstring.c:1397 #6 0x00002ac6d45fbf2d in gres_plugin_job_state_unpack (gres_list=gres_list@entry=0x2ac6f40046b0, buffer=buffer@entry=0x2ac6f4016470, job_id=2943467, protocol_version=protocol_version@entry=8448) at gres.c:4318 #7 0x00002ac6d4645579 in slurm_cred_unpack (buffer=buffer@entry=0x2ac6f4016470, protocol_version=protocol_version@entry=8448) at slurm_cred.c:1309 #8 0x00002ac6d4670f2c in _unpack_batch_job_launch_msg (msg=msg@entry=0x2ac6f40010c8, buffer=0x2ac6f4016470, protocol_version=<optimized out>) at slurm_protocol_pack.c:12814 #9 0x00002ac6d4688606 in unpack_msg (msg=msg@entry=0x2ac6f4001090, buffer=buffer@entry=0x2ac6f4016470) at slurm_protocol_pack.c:1974 #10 0x00002ac6d46519cb in slurm_receive_msg_and_forward (fd=9, orig_addr=<optimized out>, msg=msg@entry=0x2ac6f4001090, timeout=<optimized out>, timeout@entry=0) at slurm_protocol_api.c:3821 #11 0x000000000040cac6 in _service_connection (arg=<optimized out>) at slurmd.c:537 #12 0x00002ac6d57d7dd5 in start_thread () from /usr/lib64/libpthread.so.0 #13 0x00002ac6d5ae9ead in clone () from /usr/lib64/libc.so.6 that are very similar to what you recently fixed in #6739 but here we have slurmd crashing, not slurmctld, and in gres_plugin_job_state_unpack rather then node state unpack. I looked at the patch fixing slurmctld and it is not clear to me that it will fix this failure as well. Unfortunately we don't have much more detail - sbatch file was already deleted and all I can offer is core file and any sacct details (where we have reqgres and allocgres = gpu:0). I also can't reproduce it - I'll check better tomorrow but this ran on a CPU only node (so no gres.conf at all) but any job that specifies --gres=gpu... (whether 0 or not) should end up on GPU nodes (we force gpu partition if gres is non empty and starts with gpu, otherwise abort job submission). So - weird. Thanks!