Ticket 3891

Summary:	slurmdbd agent queue is full
Product:	Slurm	Reporter:	Josko Plazonic <plazonic>
Component:	slurmdbd	Assignee:	Dominik Bartkiewicz <bart>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	da, dwg, fullop, mcoyne, mej, peltzpl, sts
Version:	17.02.4
Hardware:	Linux
OS:	Linux
Site:	Princeton (PICSciE)	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	17.02.6
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmdbd log slurmctld log Possibly corrupted dbd.messages

Description Josko Plazonic 2017-06-13 10:45:40 MDT

We are in the midst of a maintenance and we are having trouble with our central slurmdbd.  Or rather slurmctld's are deeply unhappy about slurmdbd.  We have a lot of errors like:

[2017-06-13T12:44:19.349] error: slurmdbd: agent queue is full, discarding request

when I turn on debugging on slurmdbd I see a LOT of stuff like:

[2017-06-13T12:44:37.493] debug2: DBD_NODE_STATE: NODE:perseus-r2c3n7 STATE:DOWN REASON:(null) UID:123 TIME:1497371615
[2017-06-13T12:44:37.496] debug2: inserting perseus-r2c3n7(perseus) with tres of '1=28,2=128000'
[2017-06-13T12:44:37.496] debug2: 9(as_mysql_cluster.c:1352) query
update "perseus_event_table" set time_end=1497371615 where time_end=0 and node_name='perseus-r2c3n7';insert into "perseus_event_table" (node_name, state, tres, time_start, reason, reason_uid) values ('perseus-r2c3n7', 100385, '1=28,2=128000', 1497371615, '(null)', 123) on duplicate key update time_end=0;
[2017-06-13T12:44:37.499] debug4: got 0 commits

constantly.

We tried truncating event tables but it does not help, very quickly complaints are back.  Suggestions?

Comment 1 Tim Wickberg 2017-06-13 10:59:41 MDT

That's not good.

A few obvious questions first:

- Is there still space available for MySQL to write out additional records?

- Have you restarted slurmdbd?

- Is there any note of potential problems in the MySQL logs?

Comment 2 Tim Wickberg 2017-06-13 11:04:32 MDT

Can you walk me through the sequence of events leading to this?

There should have been earlier signs of problems with the slurmdbd connection in the slurmctld log, way before it started to discard updates. Are you able to attach some of those?

Why did you truncate the event table?

Comment 3 Josko Plazonic 2017-06-13 11:15:22 MDT

We had errors like:

update "perseus_event_table" set time_end=1497354340 where time_end=0 and node_name='perseus-r4c2n11';insert into "perseus_event_table" (node_name, state, tres, time_start, reason, reason_uid) values ('perseus-r4c2n11', 100385, '1=28,2=128000', 1497354340, '(null)', 123) on duplicate key update time_end=0;

this morning.  And when I lifted debugging that's all I see - constant stream of CLUSTER_event_table updates.  It only stops if I stop slurmctld from that cluster and it continues even if I stop all of the slurmds.

I guess I thought duplicate key is bad and that it is failing to update entries but I guess that's not the case.

We tried downgrading slurm back down to 17.02.2 but does not seem to help.

Comment 4 Josko Plazonic 2017-06-13 11:21:23 MDT

Forgot to say that errors started shortly after our maintenance began, first one is:

Jun 13 06:02:04 tiger-pbs slurmctld: [2017-06-13T06:01:58.397] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jun 13 06:04:03 tiger-pbs slurmctld: *** RESTART SLURMDBD NOW ***
and then it continues with constant stream of:
Jun 13 06:04:04 tiger-pbs slurmctld: [2017-06-13T06:03:58.717] error: slurmdbd: agent queue is full, discarding request
....

We only upgraded slurmdbd at 9:36am. There is nothing there around the above time, we have this then:

[2017-06-13T06:00:00.841] error: id_assoc 1121 doesn't have any tres
[2017-06-13T07:00:00.230] error: id_assoc 1121 doesn't have any tres
[2017-06-13T07:54:50.711] error: mysql_query failed: 1213 Deadlock found when trying to get lock; try restarting transaction
update "perseus_event_table" set time_end=1497354340 where time_end=0 and node_name='perseus-r4c2n11';insert into "perseus_event_table" (node_name, state, tres, time_start, reason, reason_uid) values ('perseus-r4c2n11', 100385, '1=28,2=128000', 1497354340, '(null)', 123) on duplicate key update time_end=0;

Comment 5 Tim Wickberg 2017-06-13 11:28:44 MDT

(In reply to Josko Plazonic from comment #4)
> Forgot to say that errors started shortly after our maintenance began, first
> one is:
> 
> Jun 13 06:02:04 tiger-pbs slurmctld: [2017-06-13T06:01:58.397] error:
> slurmdbd: agent queue filling, RESTART SLURMDBD NOW
> Jun 13 06:04:03 tiger-pbs slurmctld: *** RESTART SLURMDBD NOW ***
> and then it continues with constant stream of:
> Jun 13 06:04:04 tiger-pbs slurmctld: [2017-06-13T06:03:58.717] error:
> slurmdbd: agent queue is full, discarding request
> ....
> 

I need to see what happened before that; the original failure would have been much earlier than this. Can you attach logs for that time period? If you'd prefer to keep them private feel free to send them direct to me instead.

Comment 6 Tim Wickberg 2017-06-13 11:30:27 MDT

(In reply to Josko Plazonic from comment #3)
> We had errors like:
> 
> update "perseus_event_table" set time_end=1497354340 where time_end=0 and
> node_name='perseus-r4c2n11';insert into "perseus_event_table" (node_name,
> state, tres, time_start, reason, reason_uid) values ('perseus-r4c2n11',
> 100385, '1=28,2=128000', 1497354340, '(null)', 123) on duplicate key update
> time_end=0;
> 
> this morning.  And when I lifted debugging that's all I see - constant
> stream of CLUSTER_event_table updates.  It only stops if I stop slurmctld
> from that cluster and it continues even if I stop all of the slurmds.
> 
> I guess I thought duplicate key is bad and that it is failing to update
> entries but I guess that's not the case.
> 
> We tried downgrading slurm back down to 17.02.2 but does not seem to help.

Removing the event table has likely caused some further damage; reverting the slurmdbd at this point is unlikely to help.

Did you take backup of the database before truncating the event table?

Comment 7 Josko Plazonic 2017-06-13 11:33:49 MDT

Sorry, no.  I have backups from last night (midnight or so) but why should event table matter anyway? Can't we just throw away all events?

Comment 8 Josko Plazonic 2017-06-13 11:35:22 MDT

Also - nothing relevant in slurmdbd logs.  Just complaints about large processing times and that 

[2017-06-12T12:00:00.765] error: id_assoc 1121 doesn't have any tres

Comment 9 Tim Wickberg 2017-06-13 11:48:22 MDT

(In reply to Josko Plazonic from comment #7)
> Sorry, no.  I have backups from last night (midnight or so) but why should
> event table matter anyway? Can't we just throw away all events?

Not safely, no. I don't believe we've ever told anyone to truncate that (or any table), and especially not that it would be safe to do.

To back up a step - can you restart slurmctld on the afflicted cluster, and provide us all the logs from it restarting forwards, along with the corresponding slurmdbd logs?

Comment 10 Josko Plazonic 2017-06-13 11:53:18 MDT

I am going to need more detail here - restarting slurmctld with debuglevel 10 does not give me much of anything about slurmdbd - just a ton of noise about jobs.  So probably I need to set something specific?  Best if you tell me what to set on both ends.

Comment 11 Tim Wickberg 2017-06-13 11:58:52 MDT

DebugLevel of "debug" should be sufficient for both. Anything further and you'll be seeing quite a lot of noise.

Comment 12 Josko Plazonic 2017-06-13 12:08:34 MDT

Created attachment 4761 [details]
slurmdbd log

Comment 13 Josko Plazonic 2017-06-13 12:08:50 MDT

Created attachment 4763 [details]
slurmctld log

Comment 14 Josko Plazonic 2017-06-13 12:10:16 MDT

[2017-06-13T14:04:07.925] 8(as_mysql_cluster.c:1593) We have the same tres as before for tiger, no need to update the database.
[2017-06-13T14:04:07.925] 8(as_mysql_cluster.c:1615) we have the same nodes in the cluster as before no need to update the database.
[2017-06-13T14:04:16.687] debug:  REQUEST_PERSIST_INIT: CLUSTER:perseus VERSION:7936 UID:922 IP:10.33.11.130 CONN:10
[2017-06-13T14:04:16.726] 10(as_mysql_jobacct_process.c:488) query
select t1.account, t1.admin_comment, t1.array_max_tasks, t1.array_task_str, t1.cpus_req, t1.derived_ec, t1.derived_es, t1.exit_code, t1.id_array_job, t1.id_array_task, t1.id_assoc, t1.id_block, t1.id_group, t1.id_job, t1.id_qos, t1.id_resv, t3.resv_name, t1.id_user, t1.id_wckey, t1.job_db_inx, t1.job_name, t1.kill_requid, t1.mem_req, t1.node_inx, t1.nodelist, t1.nodes_alloc, t1.partition, t1.priority, t1.state, t1.time_eligible, t1.time_end, t1.time_start, t1.time_submit, t1.time_suspended, t1.timelimit, t1.track_steps, t1.wckey, t1.gres_alloc, t1.gres_req, t1.gres_used, t1.tres_alloc, t1.tres_req, t2.acct, t2.lft, t2.user from "perseus_job_table" as t1 left join "perseus_assoc_table" as t2 on t1.id_assoc=t2.id_assoc left join "perseus_resv_table" as t3 on t1.id_resv=t3.id_resv && ((t1.time_start && (t3.time_start < t1.time_start && (t3.time_end >= t1.time_start || t3.time_end = 0))) || ((t3.time_start < t1.time_submit && (t3.time_end >= t1.time_submit || t3.time_end = 0)) || (t3.time_start > t1.time_submit))) where ((t1.time_start && ((!t1.time_end && t1.state=1) || (1497369856 between t1.time_start and t1.time_end)))) group by id_job, time_submit desc
[2017-06-13T14:04:18.092] error: CONN:8 Failed to unpack DBD_NODE_STATE message
[2017-06-13T14:04:28.157] error: CONN:8 Failed to unpack DBD_NODE_STATE message
[2017-06-13T14:04:38.220] error: CONN:8 Failed to unpack DBD_NODE_STATE message
[2017-06-13T14:04:48.284] error: CONN:8 Failed to unpack DBD_NODE_STATE message
[2017-06-13T14:04:58.121] debug:  REQUEST_PERSIST_INIT: CLUSTER:tiger VERSION:7936 UID:497 IP:172.23.10.10 CONN:12

Maybe this is relevant?

Comment 15 Tim Wickberg 2017-06-13 12:36:40 MDT

[2017-06-13T14:04:06.288] debug:  slurmdbd: PERSIST_RC is -1 from DBD_NODE_STATE(1432): Failed to unpack DBD_NODE_STATE message

Yes, that'd be the source of the problems. Once slurmctld gets stuck trying to send that message across it's blocked any successive messages. We need to get that fixed before anything can move on.

Comment 16 Tim Wickberg 2017-06-13 13:10:16 MDT

Are you able to attach the dbd.messages file from the StateSaveLocation, as well as a much longer chunk of the slurmctld logs?

There's no diagnostic info I can have you readily get, and it'd be easier for us to step through and see how that one message was corrupted.

We're expecting that something earlier in the slurmctld log would pinpoint how that corrupted message was created, but if we have the state file we should be able to get you a workaround and get things mostly back to normal. (Albeit with a missing events table, which may continue to generate some nuisance warning messages.)

- Tim

Comment 17 Tim Wickberg 2017-06-13 13:30:33 MDT

As an alternative path (although we'd still like the info so we can see how this happened, and ensure it cannot recur) if you remove the dbd.messages file and restart slurmctld on that cluster then things should go back to normal - although with some additional lost of accounting info.

But - you're actively loosing accounting info right now - so the trade off is whether you'd prefer to wait for us to find a way to skip that errant message and get the info out of that dbd.messages file sent over correctly (while continuing to loose current updates), or throw that file away in favor of getting the current cluster status recorded.

- Tim

Comment 18 Josko Plazonic 2017-06-13 13:55:35 MDT

We were playing with exact this thing as well. We managed to skip the unpacking error (exactly by moving the dbd file out of the way) but rather quickly we got back to agent queue being full and messages being discarded.

To make the long story short - we think the problem is in the fact that this particular time, on this maintenance, ALL of our nodes were offline.  We rebooted them with reboot_nodes and they never came back in any way because of nhc failing on filesystem check and they ended up in states similar to:

none                 root      Unknown             della-r2c4n8,della-r4c4n13

Somehow that made all slurmctld's bombard slurmdbd with event table updates like the one I mentioned at the beginning of the ticket ("comment 0").  Once we resumed nodes on 2 of our biggest clusters things settled and complaints about agent queue stopped. 

So it seems to us that all of these status/event table updates were serialized with slurmdbd and mariadb not being able to catch up - i.e. clusters were sending updates faster then slurmdbd could commit them so backlog accumulated. Mysql was constantly pegged at 100% (i.e. taking 1 CPU).  

BTW, nothing is going lost - we are still down so we don't care about accounting.

Comment 19 Tim Wickberg 2017-06-13 13:58:48 MDT

That unpack error should still not happen - if we can get a copy of the file that gets stuck on that I'd like to trace that out.

In the meantime, increasing MaxJobs will also cause an increase to the permitted size of the dbd.messages file, which should help smooth over any buffering issues like this. I'd suggest MaxJobs=100000 at a minimum, and if the slurmctld has sufficient memory you can go all the way to a million at least for now.

Comment 20 Josko Plazonic 2017-06-13 15:13:56 MDT

I can see if I can find which file that was but I am not too sure it is that significant.  That error did not begin these problems - they started right at the beginning of the downtime.  We probably managed to corrupt the file with repeat restarts of slurmctld or something like it.

Regarding suggested workaround - it sounds to me that that would only increase what we are caching and if the remote end still can't catch up it would only postpone the problem. Do you know why so many *_event_table updates are sent and why so frequently?

Comment 21 Tim Wickberg 2017-06-21 15:19:12 MDT

(In reply to Josko Plazonic from comment #20)
> I can see if I can find which file that was but I am not too sure it is that
> significant.  That error did not begin these problems - they started right
> at the beginning of the downtime.  We probably managed to corrupt the file
> with repeat restarts of slurmctld or something like it.
> 
> Regarding suggested workaround - it sounds to me that that would only
> increase what we are caching and if the remote end still can't catch up it
> would only postpone the problem. Do you know why so many *_event_table
> updates are sent and why so frequently?

We have one open private bug that's similar to this - there does seem to be one odd edge case where all nodes are offline and an unreasonable number of node update messages are created and serialized.

None of that should have led to any invalid message being generated though. I'm guessing that this point that the state file is long gone, and we won't have a chance to pick through and sort out how that happened?

Comment 23 Tim Wickberg 2017-06-21 15:21:25 MDT

Lowering the severity here as well, as I believe this was worked around a week ago.

Comment 24 Josko Plazonic 2017-06-22 07:23:37 MDT

Created attachment 4807 [details]
Possibly corrupted dbd.messages

Comment 25 Josko Plazonic 2017-06-22 07:24:57 MDT

I am pretty sure the dbd.messages I just attached is the corrupted one.  

I might as well have some debugging logs for the flood of event updates.  Do you need any of that, i.e. should I look for them? It would definitely be nice to get this fixed...

Comment 30 Danny Auble 2017-06-29 17:04:50 MDT

*** Ticket 3939 has been marked as a duplicate of this ticket. ***

Comment 36 Dominik Bartkiewicz 2017-07-03 05:25:31 MDT

Hi

At the end of last week we found code which would result in a packed msg_type, but nothing else.
Commit fix this and also adds some additional validation during rpc packing
https://github.com/SchedMD/slurm/commit/3e00ede593d16250



Dominik