5875 – error: slurmdbd: agent queue is full (8197776), discarding DBD_NODE_STATE:1432 request

Ticket 5875 - error: slurmdbd: agent queue is full (8197776), discarding DBD_NODE_STATE:1432 request

Summary: error: slurmdbd: agent queue is full (8197776), discarding DBD_NODE_STATE:143...

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Database (show other tickets)
Version:	18.08.1
Hardware:	Linux Linux

Severity:	2 - High Impact
Assignee:	Alejandro Sanchez
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-10-17 22:43 MDT by Greg Wickham
Modified:	2018-10-25 03:11 MDT (History)
CC List:	4 users (show)

See Also:	5166 5772 5613
Site:	KAUST
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	17.11.12 18.08.3 19.05.0pre2
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Show reservation (2.90 KB, application/x-gzip) 2018-10-18 05:25 MDT, Greg Wickham	Details
Output of sacctmgr show res (525 bytes, application/x-gzip) 2018-10-18 05:29 MDT, Greg Wickham	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Greg Wickham 2018-10-17 22:43:36 MDT

Checking slurmdbd we see the following continuous error stream (verbose):

slurmdbd: error: There is no reservation by id 171, time_start 1539754865, and cluster 'dragon'
slurmdbd: debug2: DBD_JOB_START: ELIGIBLE CALL ID:13137121 NAME:BBCPU
slurmdbd: debug2: as_mysql_slurmdb_job_start() called
slurmdbd: debug2: DBD_JOB_START: ELIGIBLE CALL ID:13137121 NAME:BBCPU
slurmdbd: debug2: as_mysql_slurmdb_job_start() called
slurmdbd: debug2: DBD_MODIFY_RESV: called
slurmdbd: error: There is no reservation by id 171, time_start 1539754865, and cluster 'dragon'
slurmdbd: debug2: DBD_JOB_START: ELIGIBLE CALL ID:13137121 NAME:BBCPU
slurmdbd: debug2: as_mysql_slurmdb_job_start() called
slurmdbd: debug2: DBD_JOB_START: ELIGIBLE CALL ID:13137121 NAME:BBCPU
slurmdbd: debug2: as_mysql_slurmdb_job_start() called
slurmdbd: debug2: DBD_MODIFY_RESV: called
slurmdbd: error: There is no reservation by id 171, time_start 1539754865, and cluster 'dragon'
slurmdbd: debug2: DBD_JOB_START: ELIGIBLE CALL ID:13137121 NAME:BBCPU

Comment 1 Tim Wickberg 2018-10-17 22:45:40 MDT

Comment #11 on bug 2741 describes how to quickly fix this.

Please run through those directions immediately; that error message indicates you are already loosing some accounting records unfortunately.

Comment 2 Greg Wickham 2018-10-17 22:51:36 MDT

Hi Tim,

Insert of inserting a reservation I amended the start time of an existing reservation:

MariaDB [slurm_acct_db]> select id_resv,deleted,resv_name,time_start,time_end from dragon_resv_table where id_resv = 171 order by time_start;
+---------+---------+-------------------+------------+------------+
| id_resv | deleted | resv_name         | time_start | time_end   |
+---------+---------+-------------------+------------+------------+
|     171 |       0 | MAINTENANCE_OCT18 | 1539345600 | 1539347156 |
|     171 |       0 | MAINTENANCE_OCT18 | 1539347156 | 1539347159 |
|     171 |       0 | MAINTENANCE_OCT18 | 1539347159 | 1539500607 |
|     171 |       0 | MAINTENANCE_OCT18 | 1539500607 | 1539623740 |
|     171 |       0 | MAINTENANCE_OCT18 | 1539623740 | 1539623741 |
|     171 |       0 | MAINTENANCE_OCT18 | 1539623741 | 1539623805 |
|     171 |       0 | MAINTENANCE_OCT18 | 1539623805 | 1539626391 |
|     171 |       0 | MAINTENANCE_OCT18 | 1539626391 | 1539665777 |
|     171 |       0 | MAINTENANCE_OCT18 | 1539665777 | 1539692244 |
|     171 |       0 | MAINTENANCE_OCT18 | 1539692244 | 1539701584 |
|     171 |       0 | MAINTENANCE_OCT18 | 1539701584 | 1539703720 |
|     171 |       0 | MAINTENANCE_OCT18 | 1539703720 | 1539703721 |
|     171 |       0 | MAINTENANCE_OCT18 | 1539703721 | 1539703747 |
|     171 |       0 | MAINTENANCE_OCT18 | 1539703747 | 1541019600 |
+---------+---------+-------------------+------------+------------+
14 rows in set (0.00 sec)

MariaDB [slurm_acct_db]> update dragon_resv_table set time_start = 1539754865 where time_start = 1539703747 and id_resv = 171;
Query OK, 1 row affected (0.03 sec)
Rows matched: 1  Changed: 1  Warnings: 0

MariaDB [slurm_acct_db]> 

. . . 

Once this was done entries were being added to the database.

The DBD agent queue size is dropping.

Was: 
DBD Agent queue size: 8197776

A couple of minutes later . .
DBD Agent queue size: 8154080


As we are in a maintenance session it seems that virtually all of the pending inserts were node up/down messages.

Comment 3 Tim Wickberg 2018-10-17 22:54:21 MDT

Do you know how long ago was that maintenance reservation was created, and what version of the slurmdbd/slurmctld were running at that point?

Comment 5 Greg Wickham 2018-10-18 01:26:38 MDT

The maintenance reservation was created several months ago (we have to plan these long in advance). The slurm version was 17.

It also failed again (messages started queuing up). I implemented the fix you suggested on the other ticket:

MariaDB [slurm_acct_db]> insert into dragon_resv_table (id_resv, deleted, time_start, resv_name) values (171 , 0, 1539794255, 'xxxbugxxx');
Query OK, 1 row affected (0.06 sec)


seems to have fixed the issue (again).

$ sdiag  | grep DBD
DBD Agent queue size: 8182880

$ sdiag  | grep DBD
DBD Agent queue size: 8179716


   -Greg

Comment 6 Greg Wickham 2018-10-18 01:38:48 MDT

$ sdiag | grep DBD
DBD Agent queue size: 8000060

Comment 7 Alejandro Sanchez 2018-10-18 05:02:54 MDT

Hi Greg. Did the error stream start appearing since anyone requested a reservation update?

If so, do you have such scontrol request command?

If the reservation was created 17 days ago, I find it odd that the first time_start for such reservation is 1539345600 (6 days ago). Can you attach the output of 'sacctmgr show resv'?

Comment 8 Alejandro Sanchez 2018-10-18 05:09:14 MDT

Can you also show 'scontrol show resv'? I'd to know as much info as possible about the afflicted reservation, including flags.

Comment 9 Greg Wickham 2018-10-18 05:25:35 MDT

Created attachment 8056 [details]
Show reservation

Comment 10 Greg Wickham 2018-10-18 05:29:25 MDT

(In reply to Alejandro Sanchez from comment #7)
> Hi Greg. Did the error stream start appearing since anyone requested a
> reservation update?

Our system went into maintenance (full shutdown) Friday 12th October. Then over the next couple of days slurm was upgraded (and other things). Initially it was working fine (slurmdbd).

The only changes to reservations was:

 - MAINTENANCE_OCT18 - Was extended (foresight perhaps - we ran over the initial maintenance window)

 - VCC_WONKA - Had more nodes added to it.

> 
> If so, do you have such scontrol request command?

No.

> 
> If the reservation was created 17 days ago, I find it odd that the first
> time_start for such reservation is 1539345600 (6 days ago). Can you attach
> the output of 'sacctmgr show resv'?

will do.

Comment 11 Greg Wickham 2018-10-18 05:29:56 MDT

Created attachment 8057 [details]
Output of sacctmgr show res

Comment 12 Greg Wickham 2018-10-18 05:30:16 MDT

slurmdbd queue is empty:

DBD Agent queue size: 0

Comment 13 Alejandro Sanchez 2018-10-18 06:14:55 MDT

Thanks for the information. I can reproduce this and we'll come back to you.

Comment 35 Greg Wickham 2018-10-24 23:12:08 MDT

I see 18.08.3 is out.

Should we upgrade?

Will upgrading from 18.08.1 to 18.08.3 impact running jobs?

(I expect slurmdbd and slurmctld will be simple).

(Will a 18.08.1 slurmd talk to an 18.08.3 slurmctld?)

If we restart slurmd on the nodes will an 18.08.1 slurmstepd talk to 18.08.3 daemons?

thanks,

   -g

Comment 37 Alejandro Sanchez 2018-10-25 02:41:54 MDT

(In reply to Greg Wickham from comment #35)
> I see 18.08.3 is out.

Right. We decided to tag earlier than expected due to a regression discovered working on this bug. This is the fix

https://github.com/SchedMD/slurm/commit/ea71e10d3acc2ffff06e9ede10848a09b
 
> Should we upgrade?

Yes, as soon as possible.
 
> Will upgrading from 18.08.1 to 18.08.3 impact running jobs?

It shouldn't.
 
> (I expect slurmdbd and slurmctld will be simple).
> 
> (Will a 18.08.1 slurmd talk to an 18.08.3 slurmctld?)

Yes. Please follow the web guidelines:

https://slurm.schedmd.com/quickstart_admin.html#upgrade
 
> If we restart slurmd on the nodes will an 18.08.1 slurmstepd talk to 18.08.3
> daemons?

Yes.
 
> thanks,
> 
>    -g

Thanks for reporting the issue.