Created attachment 12785 [details] slurmctld log Hello, We pushed a change out to our slurm.conf this morning and then the slurmctld crashed and core dumped. I'm not seeing anything in our slurm.conf file (or the logs that notes the issue is in the slurm.conf file). When attempting to load slurmctld using -Dvvv I see: ``` slurmctld: debug3: create_mmap_buf: loaded file `/var/spool/slurm.state/job_state` as Buf slurmctld: debug3: Version string in job_state header is PROTOCOL_VERSION slurmctld: debug3: Job id in job_state header is 3446858 slurmctld: recovered JobId=3201741 StepId=Extern slurmctld: debug3: found correct association slurmctld: Recovered JobId=3201741 Assoc=8609 slurmctld: debug3: found correct qos slurmctld: bitstring.c:292: bit_nclear: Assertion `(start) < ((b)[1])' failed. Aborted (core dumped) ``` Attached is the slurmctld.log Thanks David
Created attachment 12786 [details] slurm.conf
Hello, We attempted to go to version 18.08.9 but ran into the following issue with the slurmdbd: ``` [2020-01-21T10:15:53.253] error: mysql_query failed: 1280 Incorrect index name 'type' alter table tres_table modify `creation_time` bigint unsigned not null, modify `deleted` tinyint default 0 not null, modify `id` int not null auto_increment, modify `type` tinytext not null, modify `name` tinytext not null default '', drop primary key, add primary key (id), drop index type, add unique index (type(20), name(20)); [2020-01-21T10:15:53.253] error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() callback failed [2020-01-21T10:15:53.253] error: cannot create accounting_storage context for accounting_storage/mysql [2020-01-21T10:15:53.253] fatal: Unable to initialize accounting_storage/mysql accounting storage plugin ``` We have since reverted to 18.08.8 David
Could you please get the backtrace from the coredump? In gdb, run: bt bt full thread apply all bt thread apply all bt full and upload the results as an attachment
Can you also tell us what change you made to your slurm.conf? Either a patch or prose description will do.
Created attachment 12788 [details] gdb outputs
Gavin, The change we made today was adding AllocNodes to the standard-oc partition: ``` PartitionName=standard-oc AllowAccounts=ALL QoS=standard-oc AllocNodes=gl-campus-login,gl-build,gl[3002-3010] Nodes=gl[3002-3010] TRESBillingWeights=cpu=43.03240741,mem=6.147486772G ``` Users needed to be able to submit to this partition from the compute nodes therein as part of their work flow. David
Created attachment 12789 [details] bandaid - should make slurmctld start This looks like something we've seen before (bug 8285). Here's the patch from that bug. Can you please apply this and restart slurmctld? Let us know if it starts up again or not. This bug and 8285 look like a duplicate of 6739. However, you're already on 18.08.8 and the fix for 6739 (commit 4c48a84a6edb) is in 18.08.8, so it makes me wonder if there's another very similar bug. Anyway, this patch should get the slurmctld running.
Marshall, Thanks for the quick turn around. We're looking at the source files for 18.08.8 and 18.08.9 and don't see the patched code in `src/common/bitstring.c` (lines 1397-1399). In other words, I'm not seeing the new lines the patch would add. So, not sure if this patch is just a stop gap or if SchedMD meant it to go into the source as an official thing, but wanted to point that out. It appears that it's not in 19.05 either, according to my colleague. Just wanted to note that. David
Marshall et al, When attempting to apply the patch provided to 18.08.8 source and compiling we're seeing: ``` bitstring.c: In function 'bit_unfmt_hexmask': bitstring.c:1398:10: error: 'SLURM_SUCCESS' undeclared (first use in this function) return SLURM_SUCCESS; ^ bitstring.c:1398:10: note: each undeclared identifier is reported only once for each function it appears in ``` David
Sorry, my mistake. Replace SLURM_SUCCESS with 0 in the patch, and it will compile
Marshall et al, The patch worked. After being applied, the patch enabled us to see the following: ``` [2020-01-21T11:44:41.089] fatal: Invalid node names in partition debug ``` in the slurmctld.log. Once we fixed the issue with the node not being found in the partition, things worked. David
Great! I'm changing this to a sev-3. Two questions: * When did you upgrade to 18.08.8? * When was the offending job (3222565) submitted? (sacct -j 3222565 --format=jobid,submit) 18.08.8 had a fix (commit 4c48a84a6edb) to prevent corrupted job structures, but if the job was submitted before 18.08.8 then it's possible the job struct was corrupted. If the job was submitted after 18.08.8, then we need to fix something else.
Marshall, Thanks. I'm working to track down when we went to 18.08.8, but I don't think it was within the past couple weeks. The job info: ``` [root@glctld ~]# sacct -j 3222565 --format=jobid,submit JobID Submit ------------ ------------------- 3222565 2020-01-10T12:34:33 3222565.ext+ 2020-01-10T12:36:57 ``` David
Marhsall, From our git commits, it looks like we did the move to 18.08.8 in late September early October. Best, David
Thanks for the additional information. I'll keep looking into this. Can you keep the core file and slurmctld binary in case I ask for additional gdb output?
Absolutely! There were a few core dumps as we tried to restart things to understand them better. So, there's a treasure trove of stuff to be mined if needed. Thanks again!! David
This can be closed. We found a node in slurm.conf that was removed from the cluster. Once slurm.conf was updated by removing that node all things started to work. We have also now upgraded to 19.05.5 Thanks Vasile
We discovered the issue to the problem.