Ticket 8360

Summary: slurmctld crashed
Product: Slurm Reporter: ARC Admins <arc-slurm-admins>
Component: slurmctldAssignee: Marshall Garey <marshall>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: bart, gavin
Version: 18.08.8   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=8285
https://bugs.schedmd.com/show_bug.cgi?id=8651
Site: University of Michigan Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: CentOS Machine Name:
CLE Version: Version Fixed: 18.08
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: slurmctld log
slurm.conf
gdb outputs
bandaid - should make slurmctld start

Description ARC Admins 2020-01-21 08:10:37 MST
Created attachment 12785 [details]
slurmctld log

Hello,

We pushed a change out to our slurm.conf this morning and then the slurmctld crashed and core dumped. I'm not seeing anything in our slurm.conf file (or the logs that notes the issue is in the slurm.conf file).

When attempting to load slurmctld using -Dvvv I see:

```
slurmctld: debug3: create_mmap_buf: loaded file `/var/spool/slurm.state/job_state` as Buf
slurmctld: debug3: Version string in job_state header is PROTOCOL_VERSION
slurmctld: debug3: Job id in job_state header is 3446858
slurmctld: recovered JobId=3201741 StepId=Extern
slurmctld: debug3: found correct association
slurmctld: Recovered JobId=3201741 Assoc=8609
slurmctld: debug3: found correct qos
slurmctld: bitstring.c:292: bit_nclear: Assertion `(start) < ((b)[1])' failed.
Aborted (core dumped)
```

Attached is the slurmctld.log

Thanks

David
Comment 1 ARC Admins 2020-01-21 08:11:55 MST
Created attachment 12786 [details]
slurm.conf
Comment 2 ARC Admins 2020-01-21 08:40:32 MST
Hello,

We attempted to go to version 18.08.9 but ran into the following issue with the slurmdbd:

```
[2020-01-21T10:15:53.253] error: mysql_query failed: 1280 Incorrect index name 'type'
alter table tres_table modify `creation_time` bigint unsigned not null, modify `deleted` tinyint default 0 not null, modify `id` int not null auto_increment, modify `type` tinytext not null, modify `name` tinytext not null default '', drop primary key, add primary key (id), drop index type, add unique index (type(20), name(20));
[2020-01-21T10:15:53.253] error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() callback failed
[2020-01-21T10:15:53.253] error: cannot create accounting_storage context for accounting_storage/mysql
[2020-01-21T10:15:53.253] fatal: Unable to initialize accounting_storage/mysql accounting storage plugin
```

We have since reverted to 18.08.8

David
Comment 3 Marshall Garey 2020-01-21 08:44:25 MST
Could you please get the backtrace from the coredump?

In gdb, run:

bt
bt full
thread apply all bt
thread apply all bt full


and upload the results as an attachment
Comment 4 Gavin D. Howard 2020-01-21 08:47:04 MST
Can you also tell us what change you made to your slurm.conf? Either a patch or prose description will do.
Comment 6 ARC Admins 2020-01-21 08:52:39 MST
Created attachment 12788 [details]
gdb outputs
Comment 7 ARC Admins 2020-01-21 08:55:05 MST
Gavin,

The change we made today was adding AllocNodes to the standard-oc partition:

```
PartitionName=standard-oc AllowAccounts=ALL QoS=standard-oc AllocNodes=gl-campus-login,gl-build,gl[3002-3010] Nodes=gl[3002-3010] TRESBillingWeights=cpu=43.03240741,mem=6.147486772G
```

Users needed to be able to submit to this partition from the compute nodes therein as part of their work flow.

David
Comment 8 Marshall Garey 2020-01-21 08:56:49 MST
Created attachment 12789 [details]
bandaid - should make slurmctld start

This looks like something we've seen before (bug 8285). Here's the patch from that bug.

Can you please apply this and restart slurmctld? Let us know if it starts up again or not.

This bug and 8285 look like a duplicate of 6739. However, you're already on 18.08.8 and the fix for 6739 (commit 4c48a84a6edb) is in 18.08.8, so it makes me wonder if there's another very similar bug.

Anyway, this patch should get the slurmctld running.
Comment 9 ARC Admins 2020-01-21 09:26:24 MST
Marshall,

Thanks for the quick turn around.

We're looking at the source files for 18.08.8 and 18.08.9 and don't see the patched code in `src/common/bitstring.c` (lines 1397-1399). In other words, I'm not seeing the new lines the patch would add. So, not sure if this patch is just a stop gap or if SchedMD meant it to go into the source as an official thing, but wanted to point that out. It appears that it's not in 19.05 either, according to my colleague.

Just wanted to note that.

David
Comment 10 ARC Admins 2020-01-21 09:31:22 MST
Marshall et al,

When attempting to apply the patch provided to 18.08.8 source and compiling we're seeing:

```
bitstring.c: In function 'bit_unfmt_hexmask':
bitstring.c:1398:10: error: 'SLURM_SUCCESS' undeclared (first use in this function)
   return SLURM_SUCCESS;
          ^
bitstring.c:1398:10: note: each undeclared identifier is reported only once for each function it appears in
```

David
Comment 11 Marshall Garey 2020-01-21 09:42:30 MST
Sorry, my mistake.

Replace SLURM_SUCCESS with 0 in the patch, and it will compile
Comment 12 ARC Admins 2020-01-21 09:58:59 MST
Marshall et al,

The patch worked. After being applied, the patch enabled us to see the following:

```
[2020-01-21T11:44:41.089] fatal: Invalid node names in partition debug

```

in the slurmctld.log.

Once we fixed the issue with the node not being found in the partition, things worked.

David
Comment 13 Marshall Garey 2020-01-21 10:53:32 MST
Great! I'm changing this to a sev-3.

Two questions:

* When did you upgrade to 18.08.8?
* When was the offending job (3222565) submitted?
(sacct -j 3222565 --format=jobid,submit)

18.08.8 had a fix (commit 4c48a84a6edb) to prevent corrupted job structures, but if the job was submitted before 18.08.8 then it's possible the job struct was corrupted. If the job was submitted after 18.08.8, then we need to fix something else.
Comment 14 ARC Admins 2020-01-21 11:09:44 MST
Marshall,

Thanks. I'm working to track down when we went to 18.08.8, but I don't think it was within the past couple weeks. The job info:

```
[root@glctld ~]# sacct -j 3222565 --format=jobid,submit
       JobID              Submit
------------ -------------------
3222565      2020-01-10T12:34:33
3222565.ext+ 2020-01-10T12:36:57
```

David
Comment 15 ARC Admins 2020-01-21 11:13:42 MST
Marhsall,

From our git commits, it looks like we did the move to 18.08.8 in late September early October.

Best,

David
Comment 16 Marshall Garey 2020-01-21 13:35:30 MST
Thanks for the additional information. I'll keep looking into this. Can you keep the core file and slurmctld binary in case I ask for additional gdb output?
Comment 17 ARC Admins 2020-01-21 13:46:23 MST
Absolutely!

There were a few core dumps as we tried to restart things to understand them better. So, there's a treasure trove of stuff to be mined if needed.

Thanks again!!

David
Comment 18 ARC Admins 2020-03-10 13:38:51 MDT
This can be closed.
We found a node in slurm.conf that was removed from the cluster.  Once slurm.conf was updated by removing that node all things started to work.
We have also now upgraded to 19.05.5

Thanks
Vasile
Comment 19 ARC Admins 2020-03-17 10:00:49 MDT
We discovered the issue to the problem.