Ticket 4847

Summary:	packmem: Buffer size limit exceeded
Product:	Slurm	Reporter:	Sophie Créno <sophie.creno>
Component:	Database	Assignee:	Felip Moll <felip.moll>
Status:	RESOLVED FIXED	QA Contact:
Severity:	2 - High Impact
Priority:	---	CC:	felip.moll, vlegrand
Version:	- Unsupported Older Versions
Hardware:	Linux
OS:	Linux
Site:	Institut Pasteur	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	17.11, 18.08
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Excerpt of slurmdbd logs with most of duplicated messages expunged Excerpt of slurmctld when the initial slurmdbd crash occured runaway jobs sdiag output

Description Sophie Créno 2018-03-01 07:30:32 MST

Hello,

We are currently in 16.05.9 and are going to upgrade to 17.11.3 next week.
Given our slurmdbd configuration:

ArchiveEvents=yes
ArchiveJobs=yes
ArchiveResvs=yes
ArchiveSteps=yes
ArchiveSuspend=yes

PurgeEventAfter=12months
PurgeJobAfter=12months
PurgeResvAfter=12months
PurgeStepAfter=1months
PurgeSuspendAfter=12months

slurmdbd tried to create its archive and purge the database at midnight.
After a deluge of

error: packmem: Buffer size limit exceeded (4294914810 > 4294901760)

it finally crashed and the base was corrupted. We add to restore it, but
since it was containing the same data to archive and purge, slurmdbd
behaved the same way when we restarted it. To be able to restore
the service, we had to postpone the archive/purge in the config file:

PurgeEventAfter=14months
PurgeJobAfter=14months
PurgeResvAfter=14months
PurgeStepAfter=1months
PurgeSuspendAfter=14months

Could you confirm that it comes from the purge step and tell us what we
should do to avoid such problem in the future? Is it a bug?
* If yes, does the upgrade to 17.11.3 correct it and can we let it try
to purge the database during the upgrade?
* if it's not, I guess we would have to purge manually. Do you have recommendations for that?

I imagine that we should change our slurmdbd config to avoid the problem
next time. If it's the size of the data to purge (1 month of data) that
is responsible for the crash as it seems, what parameters should we put?
Creating an archive for every single day doesn't seem reasonable.

Thanks in advance,

Comment 1 Felip Moll 2018-03-01 08:15:37 MST

Hi Sophie,

Could you attach slurmctld and slurmdbd logs in order to investigate the problem and give you a proper answer?

Is the system currently working? If it is I will set this bug to sev-2 since sev-1 means the system is not usable and the site stopped production.

Thanks

Comment 2 Sophie Créno 2018-03-01 08:20:53 MST

With our work around, we were able to restart the service so I downgrade to 
severity -2.

Comment 3 Sophie Créno 2018-03-01 08:55:51 MST

Created attachment 6263 [details]
Excerpt of slurmdbd logs with most of duplicated messages expunged

Comment 4 Sophie Créno 2018-03-01 08:57:03 MST

Created attachment 6264 [details]
Excerpt of slurmctld when the initial slurmdbd crash occured

Comment 5 Felip Moll 2018-03-01 13:06:11 MST

Sophie,

I need some more info.

First)

I would need to know the tune parameters of the database.

I see many ER_LOCK_WAIT_TIMEOUT in the logs coming from MySQL. This error could happen if some other queries are being performed while doing the purge/archive/rollup.

I am specially interested in:

innodb_buffer_pool_size
innodb_log_file_size
innodb_lock_wait_timeout

There are some basic tuning guidelines documented here: http://slurm.schedmd.com/accounting.html

Second)

Running low on disk space can also cause similar issues, which is the status of your database storage? And which is the underlying hardware that supports it (ssd, mechanical disks, lun?).

Third)

I see "We have more allocated time than is possible" errors, I suppose it happened only when the purge made everything crash, but just to be sure, can you run and send me the output of:

sacctmgr show runaway

Fourth)

How many records does your database have? How big it is? And specially how many records for last month?

select count(*) from tars_step_table where time_start <= UNIX_TIMESTAMP('2018-02-01 00:00:00');

The archives and purges goes in the order: event, suspend, step, job, resv. As you guessed it seems from your logs that it failed in tars_step_table, so this table has the numbers to be the cause of the issue.

Probably doing consecutive purges would do the trick:

Set:
PurgeStepAfter=12months

and force the rollup.

Then:
PurgeStepAfter=9months

Then:
PurgeStepAfter=6months

Then:
PurgeStepAfter=3months

and so on...

But first let me see your numbers.

Comment 6 Sophie Créno 2018-03-02 07:17:53 MST

Created attachment 6280 [details]
runaway jobs

Hello,

> First)
> 
> I would need to know the tune parameters of the database.
> 
> I see many ER_LOCK_WAIT_TIMEOUT in the logs coming from MySQL. 
> This error could happen if some other queries are being performed
> while doing the purge/archive/rollup.
> 
> I am specially interested in:
> 
> innodb_buffer_pool_size
> innodb_log_file_size
> innodb_lock_wait_timeout
> 
> There are some basic tuning guidelines documented here:
> http://slurm.schedmd.com/accounting.html
Yes sorry, I should have expected that question.

innodb_buffer_pool_size = 24075M
long_query_time = 100
innodb_log_file_size = 128M
innodb-lock-wait-timeout = 600


> Second)
> 
> Running low on disk space can also cause similar issues, which is 
> the status of your database storage? And which is the underlying
> hardware that supports it (ssd, mechanical disks, lun?).
$ df -h  /var/lib/mysql/
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb         79G   40G   35G  54% /

The machine is a VM and the disk is a SAN (EMC Unity). The downtime
next week is among other things to replace the VM by a bare metal machine.


> Third)
> 
> I see "We have more allocated time than is possible" errors, I suppose it
> happened only when the purge made everything crash, but just to be sure, 
> can you run and send me the output of:
> 
>   sacctmgr show runaway
We were not surprised of these messages given the downtime due to the crash.
So yes they are related to the runaway jobs. I did a 
sacctmgr show runaway
just after the daemons restarted. There were only few of them. I hadn't 
cleaned them just in case it could be of any use to you. Now there are
a lot. See the attachment.


> Fourth)
> 
> How many records does your database have? How big it is? And specially
> how many records for last month?
> 
> select count(*) from tars_step_table where time_start <=
> UNIX_TIMESTAMP('2018-02-01 00:00:00');
> 
> The archives and purges goes in the order: event, suspend, step, job, 
> resv.  As you guessed it seems from your logs that it failed in 
> tars_step_table, so this table has the numbers to be the cause of
> the issue.
I would have expected a bigger figure:

mysql> select count(*) from tars_step_table where time_start <= UNIX_TIMESTAMP('2018-02-01 00:00:00');
+----------+
| count(*) |
+----------+
|      128 |
+----------+

Thanks,

Comment 7 Felip Moll 2018-03-05 09:52:27 MST

(In reply to Sophie Créno from comment #6)
> innodb_buffer_pool_size = 24075M
> long_query_time = 100
> innodb_log_file_size = 128M
> innodb-lock-wait-timeout = 600
> 

With the info I see your parameters looks good to me.

> > Second)
> The machine is a VM and the disk is a SAN (EMC Unity). The downtime
> next week is among other things to replace the VM by a bare metal machine.

Understood.
 
> > Third)
>   sacctmgr show runaway
> We were not surprised of these messages given the downtime due to the crash.
> So yes they are related to the runaway jobs. I did a 
> sacctmgr show runaway
> just after the daemons restarted. There were only few of them. I hadn't 
> cleaned them just in case it could be of any use to you. Now there are
> a lot. See the attachment.

I need some more time to analyze how to deal with this jobs, please don't delete them for now.

> > Fourth)
> I would have expected a bigger figure:

Yes, me too.

Can you show the results for the other tables?: event, suspend, step, job, resv
 

Thanks

Comment 8 Felip Moll 2018-03-06 05:07:41 MST

> (In reply to Sophie Créno from comment #6)
> >   sacctmgr show runaway
> > We were not surprised of these messages given the downtime due to the crash.
> > So yes they are related to the runaway jobs. I did a 
> > sacctmgr show runaway
> > just after the daemons restarted. There were only few of them. I hadn't 
> > cleaned them just in case it could be of any use to you. Now there are
> > a lot. See the attachment.

Well, the first few of them, after looking in deep to logs and code seems to be related to steps that couldn't be completed while the database was crashing, i.e:

2018-03-01T10:09:14+01:00 tars-acct.it.pasteur.fr slurmdbd[4974]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction#012update "tars_step_table" set time_end=1519874593, state=3, kill_requid=-1, exit_code=0, user_sec=0, user_usec=0, sys_sec=0, sys_usec=3999, max_disk_read=0.000000, max_disk_read_task=65534, max_disk_read_node=4294967294, ave_disk_read=0.000000, max_disk_write=0.000000, max_disk_write_task=65534, max_disk_write_node=4294967294, ave_disk_write=0.000000, max_vsize=224856, max_vsize_task=0, max_vsize_node=0, ave_vs
2018-03-01T10:09:14+01:00 tars-acct.it.pasteur.fr slurmdbd[4974]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program

This query is done in "as_mysql_step_complete" and if it fails it can cause the runaway jobs.

I don't care about this first ones, as you said.

--

In what regards to the other thousands of runaway jobs, they seems to be the job steps that couldn't be successfully purged. Note that there are at least 51173 entries what would tell us there were probably too much job steps to be purged at a time.


About the database, it wasn't clear to me... from here:

> it finally crashed and the base was corrupted. We add to restore it, but
> since it was containing the same data to archive and purge, slurmdbd 
> behaved the same way when we restarted it. To be able to restore 
> the service, we had to postpone the archive/purge in the config file:

is this the sequence you followed?:

1. DB Crashed
2. Restore database mysqldump
3. DB Crash again
4. Modify config file
5. Start DBD

If so, between 3 and 4 there should have been a mysql restore. Is it possible you're running with a corrupted DB?

And more, did you run a *fix* of runaway jobs or just shown them? It's odd that time start is equal to time end, this would mean you tried a fix.

> Can you show the results for the other tables?: event, suspend, step, job,resv

I guess that your low select count from tars_step_table is due because you are running with the 'crashed' database that incompletely purged most of the step entries?


Before upgrading we need to fix this database lost jobs, then see if everything is working fine and then purge it in smaller steps.

Comment 9 Felip Moll 2018-03-06 05:18:17 MST

> And more, did you run a *fix* of runaway jobs or just shown them? It's odd
> that time start is equal to time end, this would mean you tried a fix.

I misspoke this, state seen is 'pending' so time_start = time_end is expected and you did not fix it.

Please, answer my previous questions and then we will try to fix this runaway jobs.

Comment 10 Sophie Créno 2018-03-06 08:13:08 MST

> In what regards to the other thousands of runaway jobs, they seems to be the
> job steps that couldn't be successfully purged. Note that there are at least
> 51173 entries what would tell us there were probably too much job steps to 
> be purged at a time.
> 
> About the database, it wasn't clear to me... from here:
> 
>> it finally crashed and the base was corrupted. We add to restore it, but
>> since it was containing the same data to archive and purge, slurmdbd 
>> behaved the same way when we restarted it. To be able to restore 
>> the service, we had to postpone the archive/purge in the config file:
> 
> is this the sequence you followed?:
> 
> 1. DB Crashed
> 2. Restore database mysqldump
> 3. DB Crash again
Stopped before crashing

> 4. Modify config file
> 5. Start DBD
Yes.

> If so, between 3 and 4 there should have been a mysql restore. Is it possible
> you're running with a corrupted DB?
Maybe. We stopped it before a new crash when we saw mysql reaching almost 100%
of CPU and the 

error: packmem: Buffer size limit exceeded

in slurmdbd logs. But you are probably right. Even if we hadn't 

InnoDB: Your database may be corrupt or you may have copied the InnoDB
InnoDB: tablespace but not the InnoDB log files. See
InnoDB: http://dev.mysql.com/doc/refman/5.1/en/forcing-innodb-recovery.html
InnoDB: for more information.

but only :

180301 12:31:17 [Note] /usr/libexec/mysqld: Normal shutdown

180301 12:31:17 [Note] Event Scheduler: Purging the queue. 0 events
180301 12:31:17  InnoDB: Starting shutdown...
180301 12:31:21  InnoDB: Shutdown completed; log sequence number 5 334554841
180301 12:31:21 [Note] /usr/libexec/mysqld: Shutdown complete

180301 12:31:21 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
180301 12:31:22 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
180301 12:31:23  InnoDB: Initializing buffer pool, size = 23.5G
180301 12:31:25  InnoDB: Completed initialization of buffer pool
180301 12:31:25  InnoDB: Started; log sequence number 5 334554841
180301 12:31:25 [Note] Event Scheduler: Loaded 0 events
180301 12:31:25 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.73-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution

in mysqld logs, we should have have set

innodb_force_recovery=1

/etc/my.cnf nevertheless.


>> Can you show the results for the other tables?: event, suspend, step, job,resv

mysql> select count(*) from tars_event_table where time_start <= UNIX_TIMESTAMP('2018-02-01 00:00:00');
+----------+
| count(*) |
+----------+
|     9950 |
+----------+

mysql> select count(*) from tars_job_table where time_start <= UNIX_TIMESTAMP('2018-02-01 00:00:00');
+----------+
| count(*) |
+----------+
| 18761513 |
+----------+

mysql> select count(*) from tars_resv_table where time_start <= UNIX_TIMESTAMP('2018-02-01 00:00:00');
+----------+
| count(*) |
+----------+
|       18 |
+----------+

We don't use suspension.

 
> I guess that your low select count from tars_step_table is due because you are
> running with the 'crashed' database that incompletely purged most of the step
> entries?
> 
> Before upgrading we need to fix this database lost jobs, then see if 
> everything is working fine and then purge it in smaller steps.
  Ok. We are repopulating a clone database with the dump of last night. We will
apply your procedure on it and when everything seems fine, we will reproduce
the procedure on the production on. Maybe we could wait until the downtime
(no job submitted nor running) to apply the procedure on the production db?

Thanks,

Comment 11 Felip Moll 2018-03-06 10:53:13 MST

> > 3. DB Crash again
> Stopped before crashing

At this point it would probably have been better to restore everything again and started fresh with modified configuration.

 > InnoDB: Your database may be corrupt or you may have copied the InnoDB

This is not good, but I hope MySQL have solved the problem automatically.

> mysql> select count(*) from tars_job_table where time_start <=
> UNIX_TIMESTAMP('2018-02-01 00:00:00');
> +----------+
> | count(*) |
> +----------+
> | 18761513 |
> +----------+

This is a big number and must probably been the cause to this issue, aside also from step table.

> Ok. We are repopulating a clone database with the dump of last night. We
> will apply your procedure on it and when everything seems fine, we will reproduce
> the procedure on the production on. Maybe we could wait until the downtime
> (no job submitted nor running) to apply the procedure on the production db?

Good, I think it is the safest plan.

Waiting until the downtime is something I was gonna suggest also. Reducing the possible calls and interaction with the database is worth to do.
There's another option but less safe and is to change the slurmdbd port to avoid communication while it is archiving.

In order to force the archive on your test bed, you can use 'sacctmgr archive dump' or set the last_ran_table to now-1:

update <yourcluster>_last_ran_table SET
    hourly_rollup = UNIX_TIMESTAMP('2018-03-06 00:00:00'), 
    daily_rollup = UNIX_TIMESTAMP('2018-03-06 00:00:00'),
    monthly_rollup = UNIX_TIMESTAMP('2018-03-06 00:00:00');

Tell me how it looks like.

Comment 12 Felip Moll 2018-03-06 12:06:14 MST

(In reply to Felip Moll from comment #11)
> > > 3. DB Crash again
> > Stopped before crashing
> 
> At this point it would probably have been better to restore everything again
> and started fresh with modified configuration.
> 
>  > InnoDB: Your database may be corrupt or you may have copied the InnoDB
> 
> This is not good, but I hope MySQL have solved the problem automatically.
> 
> > mysql> select count(*) from tars_job_table where time_start <=
> > UNIX_TIMESTAMP('2018-02-01 00:00:00');
> > +----------+
> > | count(*) |
> > +----------+
> > | 18761513 |
> > +----------+
> 
> This is a big number and must probably been the cause to this issue, aside
> also from step table.
> 
> > Ok. We are repopulating a clone database with the dump of last night. We
> > will apply your procedure on it and when everything seems fine, we will reproduce
> > the procedure on the production on. Maybe we could wait until the downtime
> > (no job submitted nor running) to apply the procedure on the production db?
> 
> Good, I think it is the safest plan.
> 
> Waiting until the downtime is something I was gonna suggest also. Reducing
> the possible calls and interaction with the database is worth to do.
> There's another option but less safe and is to change the slurmdbd port to
> avoid communication while it is archiving.
> 
> In order to force the archive on your test bed, you can use 'sacctmgr
> archive dump' or set the last_ran_table to now-1:
> 
> update <yourcluster>_last_ran_table SET
>     hourly_rollup = UNIX_TIMESTAMP('2018-03-06 00:00:00'), 
>     daily_rollup = UNIX_TIMESTAMP('2018-03-06 00:00:00'),
>     monthly_rollup = UNIX_TIMESTAMP('2018-03-06 00:00:00');
> 
> Tell me how it looks like.

Sophie, there's a possibility that you have hit a bug and runaway jobs aren't really runaway ones. Will inform you when I've more info.

In the meantime it is still worth to continue with the commented plan, so keep going.

Comment 13 Sophie Créno 2018-03-07 04:07:02 MST

Hello,

I tried sacctmgr archive dump on the test database and whatever I do 
I always have the same message: Would you like to commit changes?

Example:
sacctmgr archive dump jobs PurgejobAfter=12months directory=/pasteur/services/policy01/slurm/tmp/test_archive/archives
sacctmgr: SUCCESS
Would you like to commit changes? (You have 30 seconds to decide)

It does nothing and the directory remains empty.


When I started the slurmdbd, it didn't tried to archive/purge even if I had
restored the initial configuration:

ArchiveEvents=yes
ArchiveJobs=yes
ArchiveResvs=yes
ArchiveSteps=yes
ArchiveSuspend=yes

PurgeEventAfter=12months
PurgeJobAfter=12months
PurgeResvAfter=12months
PurgeStepAfter=1months
PurgeSuspendAfter=12months


We still have the jobs in the db:

MariaDB [slurm_acct_db]> select count(*) from tars_job_table where time_start <= UNIX_TIMESTAMP('2018-02-01 00:00:00');                         
+----------+
| count(*) |
+----------+
| 18761519 |
+----------+


The number of runaway jobs has increased compare to last request and correspond
to the figure obtain from the production db.

Comment 14 Sophie Créno 2018-03-07 07:08:48 MST

I've also done

> update <yourcluster>_last_ran_table SET
>     hourly_rollup = UNIX_TIMESTAMP('2018-03-06 00:00:00'), 
>     daily_rollup = UNIX_TIMESTAMP('2018-03-06 00:00:00'),
>     monthly_rollup = UNIX_TIMESTAMP('2018-03-06 00:00:00');

but even after a restart of slurmdbd, the archive/purge isn't triggered.
Did I miss something?

About the runaway jobs, I forgot to mention that we have the following
message:
sacctmgr: slurmdbd: reopening connection
sacctmgr: slurmdbd: reopening connection
sacctmgr: slurmdbd: reopening connection
sacctmgr: slurmdbd: reopening connection
sacctmgr: error: slurmdbd: Sending message type 1488: 11: No error
sacctmgr: error: Failed to fix runaway job: Resource temporarily unavailable

Should we adjust something? 

Thanks,

Comment 15 Felip Moll 2018-03-07 09:35:14 MST

(In reply to Sophie Créno from comment #14)
> I've also done
> 
> > update <yourcluster>_last_ran_table SET
> >     hourly_rollup = UNIX_TIMESTAMP('2018-03-06 00:00:00'), 
> >     daily_rollup = UNIX_TIMESTAMP('2018-03-06 00:00:00'),
> >     monthly_rollup = UNIX_TIMESTAMP('2018-03-06 00:00:00');
> 
> but even after a restart of slurmdbd, the archive/purge isn't triggered.
> Did I miss something?
> 
> About the runaway jobs, I forgot to mention that we have the following
> message:
> sacctmgr: slurmdbd: reopening connection
> sacctmgr: slurmdbd: reopening connection
> sacctmgr: slurmdbd: reopening connection
> sacctmgr: slurmdbd: reopening connection
> sacctmgr: error: slurmdbd: Sending message type 1488: 11: No error
> sacctmgr: error: Failed to fix runaway job: Resource temporarily unavailable
> 
> Should we adjust something? 
> 
> Thanks,

Can you start slurmdbd in debug level (slurmdbd -Dvvv), try to purge jobs, and send me back slurmdbd.log?

I assume you are confirming purge when using sacctmgr (responding yes).

Comment 16 Sophie Créno 2018-03-07 10:23:22 MST

Here is what I have obtained. When I saw that there was nothing to roll up, I changed
from 12 months to 10 bit I still have the same output from slurmdbd:

slurmdbd: debug:  auth plugin for Munge (http://code.google.com/p/munge/) loaded
slurmdbd: debug2: mysql_connect() called for db slurm_acct_db
slurmdbd: debug2: It appears the table conversions have already taken place, hooray!
slurmdbd: Accounting storage MYSQL plugin loaded
slurmdbd: pidfile not locked, assuming no running daemon
slurmdbd: debug2: ArchiveDir        = /pasteur/services/policy01/slurm/tmp/test_archive/archives
slurmdbd: debug2: ArchiveScript     = (null)
slurmdbd: debug2: AuthInfo          = (null)
slurmdbd: debug2: AuthType          = auth/munge
slurmdbd: debug2: CommitDelay       = 0
slurmdbd: debug2: DbdAddr           = tars-pacct
slurmdbd: debug2: DbdBackupHost     = (null)
slurmdbd: debug2: DbdHost           = tars-pacct
slurmdbd: debug2: DbdPort           = 6819
slurmdbd: debug2: DebugFlags        = (null)
slurmdbd: debug2: DebugLevel        = 6
slurmdbd: debug2: DefaultQOS        = (null)
slurmdbd: debug2: LogFile           = (null)
slurmdbd: debug2: MessageTimeout    = 300
slurmdbd: debug2: PidFile           = /var/run/slurmdbd.pid
slurmdbd: debug2: PluginDir         = /pasteur/services/policy01/slurm/tmp/test_archive/install/slurm-16.05/lib/slurm
slurmdbd: debug2: PrivateData       = none
slurmdbd: debug2: PurgeEventAfter   = 10 months*
slurmdbd: debug2: PurgeJobAfter     = 10 months*
slurmdbd: debug2: PurgeResvAfter    = 10 months*
slurmdbd: debug2: PurgeStepAfter    = 1 months*
slurmdbd: debug2: PurgeSuspendAfter = 10 months*
slurmdbd: debug2: SlurmUser         = slurm(497)
slurmdbd: debug2: StorageBackupHost = (null)
slurmdbd: debug2: StorageHost       = localhost
slurmdbd: debug2: StorageLoc        = slurm_acct_db
slurmdbd: debug2: StoragePort       = 3306
slurmdbd: debug2: StorageType       = accounting_storage/mysql
slurmdbd: debug2: StorageUser       = slurmdbd
slurmdbd: debug2: TCPTimeout        = 2
slurmdbd: debug2: TrackWCKey        = 0
slurmdbd: debug2: TrackSlurmctldDown= 0
slurmdbd: debug2: acct_storage_p_get_connection: request new connection 1
slurmdbd: debug:  post user: couldn't get a uid for user fklein
slurmdbd: slurmdbd version 16.05.9 started
slurmdbd: debug2: running rollup at Wed Mar 07 18:05:57 2018
slurmdbd: debug2: No need to roll cluster tars this hour 1520442000 <= 1520442000
slurmdbd: debug2: No need to roll cluster tars this day 1520377200 <= 1520377200
slurmdbd: debug2: No need to roll cluster tars this month 1519858800 <= 1519858800
slurmdbd: debug2: Got 1 of 1 rolled up
slurmdbd: debug2: Everything rolled up


so the rollup is not triggered. Maybe it's normal and I didn't catch something
regarding the date to put in monthly_rollup = UNIX_TIMESTAMP('2018-03-06 00:00:00').

But contrary to previously, this time, I have a file being written (still ongoing)
when I launch

sacctmgr archive dump jobs PurgejobAfter=10months directory=/pasteur/services/policy01/slurm/tmp/test_archive/archives  

$ ll
total 724085
-rw------- 1 slurm  slurm 652171962 Mar  7 18:09 tars_job_archive_2017-03-01T00:00:00_2017-04-01T00:00:00

It's March of last year so it is what we wanted to archive and purge.

> I assume you are confirming purge when using sacctmgr (responding yes).
Yes, sorry, I forgot to say that.

Comment 17 Felip Moll 2018-03-08 05:31:45 MST

(In reply to Sophie Créno from comment #16)
> slurmdbd: debug2: No need to roll cluster tars this hour 1520442000 <=
> 1520442000
> slurmdbd: debug2: No need to roll cluster tars this day 1520377200 <=
> 1520377200
> slurmdbd: debug2: No need to roll cluster tars this month 1519858800 <=
> 1519858800

This is strange, if you modified the last_ran_table and restarted slurmdbd we should see your values in the log, and I see old ones.
The unix timestamp that is shown in the logs correspond to (https://www.epochconverter.com/):

hour: GMT: Wednesday 7 March de 2018 @ 17:00:00
day:  GMT: Tuesday 6 March de 2018 @ 23:00:00
month: GMT: Wednesday 28 February de 2018 @ 23:00:00

So it seems the values were not been correctly modified.

I tried it and it works for me:

a) stop slurmdbd
b) in the mysql database, run

UPDATE <cluster>_last_ran_table 
SET 
    hourly_rollup  = UNIX_TIMESTAMP('2017-01-01 00:00:00'),
    daily_rollup   = UNIX_TIMESTAMP('2017-01-01 00:00:00'),
    monthly_rollup = UNIX_TIMESTAMP('2017-01-01 00:00:00');

c) restart slurmdbd, and it'll automatically regenerate the rollup data. The slurmdbd.log should have some messages related to it progressing through.


But anyway, if you are getting the data with sacctmgr it is fine.


> But contrary to previously, this time, I have a file being written (still
> ongoing) when I launch
> 
> sacctmgr archive dump jobs PurgejobAfter=10months
> directory=/pasteur/services/policy01/slurm/tmp/test_archive/archives  


Good, have you seen the job entries being reduced?

MariaDB [slurm_acct_db]> select count(*) from tars_job_table where time_start <= UNIX_TIMESTAMP('2018-02-01 00:00:00');                         
+----------+
| count(*) |
+----------+
| 18761519 |
+----------+

I commented internally and this are a lot of records, reduce as much as you can before upgrading.



Regarding the runaway jobs, don't worry about them now. They are not really runaway jobs, they have an end_time set so they are fine. They appear here probably because the corruption caused to the database during the crash. I am working on a patch for avoid this false runaway jobs situation.

Comment 22 Sophie Créno 2018-03-09 06:18:02 MST

Hello,

>> But contrary to previously, this time, I have a file being written (still
> ongoing) when I launch > > sacctmgr archive dump jobs PurgejobAfter=10months
> directory=/pasteur/services/policy01/slurm/tmp/test_archive/archives
> 
> 
> Good, have you seen the job entries being reduced?

Yes. We have proceeded as you suggested in Comment5 and it worked smoothly:
+----------+
| count(*) |
+----------+
|  3026821 |
+----------+


and we were able to do the conversion and upgrade without problems regarding
the DB:
slurmdbd: Conversion done: success!

Thanks a lot for your help.


> Regarding the runaway jobs, don't worry about them now. They are not really
> runaway jobs, they have an end_time set so they are fine. They appear here
> probably because the corruption caused to the database during the crash. 
> I am working on a patch for avoid this false runaway jobs situation.
What should we do regarding the runaway jobs now that we are in 17.11.4?
Wait for the patch or try to clean them?


Thanks,

Comment 24 Felip Moll 2018-03-09 07:20:36 MST

(In reply to Sophie Créno from comment #22)
> and we were able to do the conversion and upgrade without problems regarding
> the DB:
> slurmdbd: Conversion done: success!
> 
> Thanks a lot for your help.


This is awesome Sophie.

Does this mean that you have everything in production and with the new version already?

I suppose the runaway list is not increasing, is it? Ignore all of them for now, I will come back when I have the exact steps for solving it.



One other question just for curiosity, this purge settings that you shown in comment 0, was it a new config or were already there?

Comment 25 Sophie Créno 2018-03-09 08:00:15 MST

> Does this mean that you have everything in production and with the new version
> already?
Yes. We had to complete the upgrade before this morning.


> I suppose the runaway list is not increasing, is it? Ignore all of them for
> now, I will come back when I have the exact steps for solving it.
Yes, they are but, as you said, they are not real runaway jobs since they are
not considered as running/pending anymore, contrary to real ones. Example
of job reported as a runaway one:

          ID       Name  Partition    Cluster      State           TimeStart             TimeEnd 
------------ ---------- ---------- ---------- ---------- ------------------- ------------------- 
29274643           eQTL  dedicated       tars    PENDING 2018-03-09T14:46:37 2018-03-09T14:46:37 

but 

$ sacct -j 29274643
       JobID    JobName      User  Partition        QOS      State               Start    Elapsed                 End ExitCode      NCPUS   NNodes        NodeList 
------------ ---------- --------- ---------- ---------- ---------- ------------------- ---------- ------------------- -------- ---------- -------- --------------- 
29274643_2         eQTL     user1  dedicated       fast  COMPLETED 2018-03-09T14:51:00   00:30:33 2018-03-09T15:21:33      0:0          1        1        tars-492 
29274643_2.+      batch                                  COMPLETED 2018-03-09T14:51:00   00:30:33 2018-03-09T15:21:33      0:0          1        1        tars-492 
29274643_1         eQTL     user1  dedicated       fast  COMPLETED 2018-03-09T14:27:22   00:30:36 2018-03-09T14:57:58      0:0          1        1        tars-490 
29274643_1.+      batch                                  COMPLETED 2018-03-09T14:27:22   00:30:36 2018-03-09T14:57:58      0:0          1        1        tars-490 

 
> One other question just for curiosity, this purge settings that you shown in
> comment 0, was it a new config or were already there?
They are the ones we have from the very beginning.

Comment 26 Felip Moll 2018-03-09 09:39:36 MST

(In reply to Sophie Créno from comment #25)

> > I suppose the runaway list is not increasing, is it? Ignore all of them for
> > now, I will come back when I have the exact steps for solving it.
> Yes, they are but, as you said, they are not real runaway jobs since they are
> not considered as running/pending anymore, contrary to real ones. 

If they are increasing it could be a sign of small jobs running and database overload.
I still have to investigate which causes this symptom.
I have the patch almost ready that will avoid this but some slowness can still be there.

Can you show me an sdiag and the load of the mysql server?


> > One other question just for curiosity, this purge settings that you shown in
> > comment 0, was it a new config or were already there?
> They are the ones we have from the very beginning.

I am wondering then why it failed this particular day and not before.
These millions of entries should have been purged before, but seems that this purge wasn't working until you have the problem.
Does it make sense or I am missing something?

Comment 27 Sophie Créno 2018-03-09 10:41:48 MST

Created attachment 6342 [details]
sdiag output

> Can you show me an sdiag and the load of the mysql server?
Find the sdiag output in attachment. 

top - 18:33:47 up 1 day, 43 min,  3 users,  load average: 0.55, 0.40, 0.38
Tasks: 830 total,   1 running, 829 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  264416644k total, 34098476k used, 230318168k free,   111752k buffers
Swap: 16777212k total,        0k used, 16777212k free, 10996644k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                
 4492 mysql     20   0  140g  19g  13m S  2.0  7.9  18:46.73 /usr/sbin/mysqld --basedir=/usr --datadir=/opt/data/mariadb/data --plugin-dir=/usr/lib64/mysql/plugin --user=mysql


Do you want something more specific?


> I am wondering then why it failed this particular day and not before.
We did have crash before (see bug4295).

> These millions of entries should have been purged before, 
I don't think so given the configuration in comment0. I'm not surprised
to have such a big figure with 1 year of data retention.
But I wonder if I'm not missing something too since, even after the cited bug,
we still have generated 1 file per month as archive. So the configuration 
remained the same whereas it shouldn't have I guess.

Comment 30 Felip Moll 2018-03-12 12:13:52 MDT

(In reply to Sophie Créno from comment #27)
> Do you want something more specific?

It's enough thanks, I don't see your server is under high load.
 
> > I am wondering then why it failed this particular day and not before.
> We did have crash before (see bug4295).
> 
> > These millions of entries should have been purged before, 
> I don't think so given the configuration in comment0. I'm not surprised
> to have such a big figure with 1 year of data retention.

Yes, you're right. I talked internally and the suggestion is to purge in smaller steps, maybe month by month.

> But I wonder if I'm not missing something too since, even after the cited
> bug,we still have generated 1 file per month as archive. So the configuration 
> remained the same whereas it shouldn't have I guess.

The purge takes place at the start of the each purge interval, so if the unit is Months it will purge every month, so you will get 1 file per month.

For example, if the purge time is 2 months, the purge would happen at the beginning of each month. 

See: https://bugs.schedmd.com/show_bug.cgi?id=4295#c4

Does it make sense?

One more thing: we recently discovered an issue with MySQL 5.1 that causes the upgrade process to be slow or to never end. This probably was the cause for the problem about 'very long upgrading time' that affected many people. Since you've successfully migrated, I assume you upgraded also to a newer MySQL version. If not it is recommended to do so.


Finally, I commited two patches for the runaway jobs problem that will be available in next 17.11.5 release. This is planned for this week if nothing changes:

Commits are:
6d4518feb2578c21fab675f42b639eb02844c17f - sacctmgr not show already completed jobs as runaway ones
6bc5dc305ffbb92345fb721a08c8a095564f4dc8 - Incorrectly set the job status on mysql

Comment 31 Sophie Créno 2018-03-12 12:37:48 MDT

Hello,

> Yes, you're right. I talked internally and the suggestion is to purge in
> smaller steps, maybe month by month.
Sorry, I'm probably missing something because we already purge/archive
every month the data older than a year. Do you mean that we should keep
only the last month data in the db?


> For example, if the purge time is 2 months, the purge would happen at the
> beginning of each month.
> 
> See:https://bugs.schedmd.com/show_bug.cgi?id=4295#c4
> 
> Does it make sense?
Yes, I perfectly understand. I just wonder why we haven't changed that in 
the configuration after that bug since we already experienced the problem
of not being able to purge correctly.


> One more thing: we recently discovered an issue with MySQL 5.1 that causes the
> upgrade process to be slow or to never end. This probably was the cause for the
> problem about 'very long upgrading time' that affected many people. Since
> you've successfully migrated, I assume you upgraded also to a newer MySQL
> version. If not it is recommended to do so.
We run 10.2.13-MariaDB, so I guess that this problem has already been fixed
since it is more or less MySQL 5.6/5.7.


> Finally, I commited two patches for the runaway jobs problem that will be
> available in next 17.11.5 release. This is planned for this week if nothing
> changes.
Ok, we will retrieve them at the same time them.

Thanks for everything.

Comment 32 Felip Moll 2018-03-12 13:47:58 MDT

(In reply to Sophie Créno from comment #31)
> Hello,
> 
> > Yes, you're right. I talked internally and the suggestion is to purge in
> > smaller steps, maybe month by month.
> Sorry, I'm probably missing something because we already purge/archive
> every month the data older than a year. Do you mean that we should keep
> only the last month data in the db?

Yes, I mean keep data only for the last month... if you can't purge so often you can try other estimations like keeping only 5M jobs: in your case that you have roughly 19M jobs / year (comment 10), this would mean to keep about 3 months of records.

In the end it is a matter of the quantity of storage you have for the database, the performance on the server, and so on.

Tell me if you need something more about this issue, otherwise I will proceed to close it.

Thanks.

Comment 34 Felip Moll 2018-03-15 01:34:56 MDT

Hi Sophie,

Finally patches have been commited for:

17.11.5 release
6d4518feb2578c
6bc5dc305ffbb9

18.08 release
037bb3cbf52365

Given you have upgraded successfully and everything seems to be clear I close this bug, but please feel free to reopen if you are still missing something.

Thank you,
Felip

Comment 35 Sophie Créno 2018-03-15 03:19:59 MDT

Hello,

Sorry, I was away from the office.

> Yes, I mean keep data only for the last month... if you can't purge so often
> you can try other estimations like keeping only 5M jobs: in your case that you
> have roughly 19M jobs / year (comment 10), this would mean to keep about 3
> months of records.
That's what we have chosen: keep 3 months in the db.


> Finally patches have been commited for:
> 
> 17.11.5 release
> 6d4518feb2578c
> 6bc5dc305ffbb9
> 
> 18.08 release
> 037bb3cbf52365
Excellent, I'll install them today with the 17.11.5.


> Given you have upgraded successfully and everything seems to be clear I close
> this bug, but please feel free to reopen if you are still missing something.
Yes, sure. Thanks again.