Checking slurmdbd we see the following continuous error stream (verbose): slurmdbd: error: There is no reservation by id 171, time_start 1539754865, and cluster 'dragon' slurmdbd: debug2: DBD_JOB_START: ELIGIBLE CALL ID:13137121 NAME:BBCPU slurmdbd: debug2: as_mysql_slurmdb_job_start() called slurmdbd: debug2: DBD_JOB_START: ELIGIBLE CALL ID:13137121 NAME:BBCPU slurmdbd: debug2: as_mysql_slurmdb_job_start() called slurmdbd: debug2: DBD_MODIFY_RESV: called slurmdbd: error: There is no reservation by id 171, time_start 1539754865, and cluster 'dragon' slurmdbd: debug2: DBD_JOB_START: ELIGIBLE CALL ID:13137121 NAME:BBCPU slurmdbd: debug2: as_mysql_slurmdb_job_start() called slurmdbd: debug2: DBD_JOB_START: ELIGIBLE CALL ID:13137121 NAME:BBCPU slurmdbd: debug2: as_mysql_slurmdb_job_start() called slurmdbd: debug2: DBD_MODIFY_RESV: called slurmdbd: error: There is no reservation by id 171, time_start 1539754865, and cluster 'dragon' slurmdbd: debug2: DBD_JOB_START: ELIGIBLE CALL ID:13137121 NAME:BBCPU
Comment #11 on bug 2741 describes how to quickly fix this. Please run through those directions immediately; that error message indicates you are already loosing some accounting records unfortunately.
Hi Tim, Insert of inserting a reservation I amended the start time of an existing reservation: MariaDB [slurm_acct_db]> select id_resv,deleted,resv_name,time_start,time_end from dragon_resv_table where id_resv = 171 order by time_start; +---------+---------+-------------------+------------+------------+ | id_resv | deleted | resv_name | time_start | time_end | +---------+---------+-------------------+------------+------------+ | 171 | 0 | MAINTENANCE_OCT18 | 1539345600 | 1539347156 | | 171 | 0 | MAINTENANCE_OCT18 | 1539347156 | 1539347159 | | 171 | 0 | MAINTENANCE_OCT18 | 1539347159 | 1539500607 | | 171 | 0 | MAINTENANCE_OCT18 | 1539500607 | 1539623740 | | 171 | 0 | MAINTENANCE_OCT18 | 1539623740 | 1539623741 | | 171 | 0 | MAINTENANCE_OCT18 | 1539623741 | 1539623805 | | 171 | 0 | MAINTENANCE_OCT18 | 1539623805 | 1539626391 | | 171 | 0 | MAINTENANCE_OCT18 | 1539626391 | 1539665777 | | 171 | 0 | MAINTENANCE_OCT18 | 1539665777 | 1539692244 | | 171 | 0 | MAINTENANCE_OCT18 | 1539692244 | 1539701584 | | 171 | 0 | MAINTENANCE_OCT18 | 1539701584 | 1539703720 | | 171 | 0 | MAINTENANCE_OCT18 | 1539703720 | 1539703721 | | 171 | 0 | MAINTENANCE_OCT18 | 1539703721 | 1539703747 | | 171 | 0 | MAINTENANCE_OCT18 | 1539703747 | 1541019600 | +---------+---------+-------------------+------------+------------+ 14 rows in set (0.00 sec) MariaDB [slurm_acct_db]> update dragon_resv_table set time_start = 1539754865 where time_start = 1539703747 and id_resv = 171; Query OK, 1 row affected (0.03 sec) Rows matched: 1 Changed: 1 Warnings: 0 MariaDB [slurm_acct_db]> . . . Once this was done entries were being added to the database. The DBD agent queue size is dropping. Was: DBD Agent queue size: 8197776 A couple of minutes later . . DBD Agent queue size: 8154080 As we are in a maintenance session it seems that virtually all of the pending inserts were node up/down messages.
Do you know how long ago was that maintenance reservation was created, and what version of the slurmdbd/slurmctld were running at that point?
The maintenance reservation was created several months ago (we have to plan these long in advance). The slurm version was 17. It also failed again (messages started queuing up). I implemented the fix you suggested on the other ticket: MariaDB [slurm_acct_db]> insert into dragon_resv_table (id_resv, deleted, time_start, resv_name) values (171 , 0, 1539794255, 'xxxbugxxx'); Query OK, 1 row affected (0.06 sec) seems to have fixed the issue (again). $ sdiag | grep DBD DBD Agent queue size: 8182880 $ sdiag | grep DBD DBD Agent queue size: 8179716 -Greg
$ sdiag | grep DBD DBD Agent queue size: 8000060
Hi Greg. Did the error stream start appearing since anyone requested a reservation update? If so, do you have such scontrol request command? If the reservation was created 17 days ago, I find it odd that the first time_start for such reservation is 1539345600 (6 days ago). Can you attach the output of 'sacctmgr show resv'?
Can you also show 'scontrol show resv'? I'd to know as much info as possible about the afflicted reservation, including flags.
Created attachment 8056 [details] Show reservation
(In reply to Alejandro Sanchez from comment #7) > Hi Greg. Did the error stream start appearing since anyone requested a > reservation update? Our system went into maintenance (full shutdown) Friday 12th October. Then over the next couple of days slurm was upgraded (and other things). Initially it was working fine (slurmdbd). The only changes to reservations was: - MAINTENANCE_OCT18 - Was extended (foresight perhaps - we ran over the initial maintenance window) - VCC_WONKA - Had more nodes added to it. > > If so, do you have such scontrol request command? No. > > If the reservation was created 17 days ago, I find it odd that the first > time_start for such reservation is 1539345600 (6 days ago). Can you attach > the output of 'sacctmgr show resv'? will do.
Created attachment 8057 [details] Output of sacctmgr show res
slurmdbd queue is empty: DBD Agent queue size: 0
Thanks for the information. I can reproduce this and we'll come back to you.
I see 18.08.3 is out. Should we upgrade? Will upgrading from 18.08.1 to 18.08.3 impact running jobs? (I expect slurmdbd and slurmctld will be simple). (Will a 18.08.1 slurmd talk to an 18.08.3 slurmctld?) If we restart slurmd on the nodes will an 18.08.1 slurmstepd talk to 18.08.3 daemons? thanks, -g
(In reply to Greg Wickham from comment #35) > I see 18.08.3 is out. Right. We decided to tag earlier than expected due to a regression discovered working on this bug. This is the fix https://github.com/SchedMD/slurm/commit/ea71e10d3acc2ffff06e9ede10848a09b > Should we upgrade? Yes, as soon as possible. > Will upgrading from 18.08.1 to 18.08.3 impact running jobs? It shouldn't. > (I expect slurmdbd and slurmctld will be simple). > > (Will a 18.08.1 slurmd talk to an 18.08.3 slurmctld?) Yes. Please follow the web guidelines: https://slurm.schedmd.com/quickstart_admin.html#upgrade > If we restart slurmd on the nodes will an 18.08.1 slurmstepd talk to 18.08.3 > daemons? Yes. > thanks, > > -g Thanks for reporting the issue.