| Summary: | packmem: Buffer size limit exceeded | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Sophie Créno <sophie.creno> |
| Component: | Database | Assignee: | Felip Moll <felip.moll> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | felip.moll, vlegrand |
| Version: | - Unsupported Older Versions | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Institut Pasteur | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 17.11, 18.08 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
Excerpt of slurmdbd logs with most of duplicated messages expunged
Excerpt of slurmctld when the initial slurmdbd crash occured runaway jobs sdiag output |
||
|
Description
Sophie Créno
2018-03-01 07:30:32 MST
Hi Sophie, Could you attach slurmctld and slurmdbd logs in order to investigate the problem and give you a proper answer? Is the system currently working? If it is I will set this bug to sev-2 since sev-1 means the system is not usable and the site stopped production. Thanks With our work around, we were able to restart the service so I downgrade to severity -2. Created attachment 6263 [details]
Excerpt of slurmdbd logs with most of duplicated messages expunged
Created attachment 6264 [details]
Excerpt of slurmctld when the initial slurmdbd crash occured
Sophie, I need some more info. First) I would need to know the tune parameters of the database. I see many ER_LOCK_WAIT_TIMEOUT in the logs coming from MySQL. This error could happen if some other queries are being performed while doing the purge/archive/rollup. I am specially interested in: innodb_buffer_pool_size innodb_log_file_size innodb_lock_wait_timeout There are some basic tuning guidelines documented here: http://slurm.schedmd.com/accounting.html Second) Running low on disk space can also cause similar issues, which is the status of your database storage? And which is the underlying hardware that supports it (ssd, mechanical disks, lun?). Third) I see "We have more allocated time than is possible" errors, I suppose it happened only when the purge made everything crash, but just to be sure, can you run and send me the output of: sacctmgr show runaway Fourth) How many records does your database have? How big it is? And specially how many records for last month? select count(*) from tars_step_table where time_start <= UNIX_TIMESTAMP('2018-02-01 00:00:00'); The archives and purges goes in the order: event, suspend, step, job, resv. As you guessed it seems from your logs that it failed in tars_step_table, so this table has the numbers to be the cause of the issue. Probably doing consecutive purges would do the trick: Set: PurgeStepAfter=12months and force the rollup. Then: PurgeStepAfter=9months Then: PurgeStepAfter=6months Then: PurgeStepAfter=3months and so on... But first let me see your numbers. Created attachment 6280 [details] runaway jobs Hello, > First) > > I would need to know the tune parameters of the database. > > I see many ER_LOCK_WAIT_TIMEOUT in the logs coming from MySQL. > This error could happen if some other queries are being performed > while doing the purge/archive/rollup. > > I am specially interested in: > > innodb_buffer_pool_size > innodb_log_file_size > innodb_lock_wait_timeout > > There are some basic tuning guidelines documented here: > http://slurm.schedmd.com/accounting.html Yes sorry, I should have expected that question. innodb_buffer_pool_size = 24075M long_query_time = 100 innodb_log_file_size = 128M innodb-lock-wait-timeout = 600 > Second) > > Running low on disk space can also cause similar issues, which is > the status of your database storage? And which is the underlying > hardware that supports it (ssd, mechanical disks, lun?). $ df -h /var/lib/mysql/ Filesystem Size Used Avail Use% Mounted on /dev/sdb 79G 40G 35G 54% / The machine is a VM and the disk is a SAN (EMC Unity). The downtime next week is among other things to replace the VM by a bare metal machine. > Third) > > I see "We have more allocated time than is possible" errors, I suppose it > happened only when the purge made everything crash, but just to be sure, > can you run and send me the output of: > > sacctmgr show runaway We were not surprised of these messages given the downtime due to the crash. So yes they are related to the runaway jobs. I did a sacctmgr show runaway just after the daemons restarted. There were only few of them. I hadn't cleaned them just in case it could be of any use to you. Now there are a lot. See the attachment. > Fourth) > > How many records does your database have? How big it is? And specially > how many records for last month? > > select count(*) from tars_step_table where time_start <= > UNIX_TIMESTAMP('2018-02-01 00:00:00'); > > The archives and purges goes in the order: event, suspend, step, job, > resv. As you guessed it seems from your logs that it failed in > tars_step_table, so this table has the numbers to be the cause of > the issue. I would have expected a bigger figure: mysql> select count(*) from tars_step_table where time_start <= UNIX_TIMESTAMP('2018-02-01 00:00:00'); +----------+ | count(*) | +----------+ | 128 | +----------+ Thanks, (In reply to Sophie Créno from comment #6) > innodb_buffer_pool_size = 24075M > long_query_time = 100 > innodb_log_file_size = 128M > innodb-lock-wait-timeout = 600 > With the info I see your parameters looks good to me. > > Second) > The machine is a VM and the disk is a SAN (EMC Unity). The downtime > next week is among other things to replace the VM by a bare metal machine. Understood. > > Third) > sacctmgr show runaway > We were not surprised of these messages given the downtime due to the crash. > So yes they are related to the runaway jobs. I did a > sacctmgr show runaway > just after the daemons restarted. There were only few of them. I hadn't > cleaned them just in case it could be of any use to you. Now there are > a lot. See the attachment. I need some more time to analyze how to deal with this jobs, please don't delete them for now. > > Fourth) > I would have expected a bigger figure: Yes, me too. Can you show the results for the other tables?: event, suspend, step, job, resv Thanks > (In reply to Sophie Créno from comment #6) > > sacctmgr show runaway > > We were not surprised of these messages given the downtime due to the crash. > > So yes they are related to the runaway jobs. I did a > > sacctmgr show runaway > > just after the daemons restarted. There were only few of them. I hadn't > > cleaned them just in case it could be of any use to you. Now there are > > a lot. See the attachment. Well, the first few of them, after looking in deep to logs and code seems to be related to steps that couldn't be completed while the database was crashing, i.e: 2018-03-01T10:09:14+01:00 tars-acct.it.pasteur.fr slurmdbd[4974]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction#012update "tars_step_table" set time_end=1519874593, state=3, kill_requid=-1, exit_code=0, user_sec=0, user_usec=0, sys_sec=0, sys_usec=3999, max_disk_read=0.000000, max_disk_read_task=65534, max_disk_read_node=4294967294, ave_disk_read=0.000000, max_disk_write=0.000000, max_disk_write_task=65534, max_disk_write_node=4294967294, ave_disk_write=0.000000, max_vsize=224856, max_vsize_task=0, max_vsize_node=0, ave_vs 2018-03-01T10:09:14+01:00 tars-acct.it.pasteur.fr slurmdbd[4974]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program This query is done in "as_mysql_step_complete" and if it fails it can cause the runaway jobs. I don't care about this first ones, as you said. -- In what regards to the other thousands of runaway jobs, they seems to be the job steps that couldn't be successfully purged. Note that there are at least 51173 entries what would tell us there were probably too much job steps to be purged at a time. About the database, it wasn't clear to me... from here: > it finally crashed and the base was corrupted. We add to restore it, but > since it was containing the same data to archive and purge, slurmdbd > behaved the same way when we restarted it. To be able to restore > the service, we had to postpone the archive/purge in the config file: is this the sequence you followed?: 1. DB Crashed 2. Restore database mysqldump 3. DB Crash again 4. Modify config file 5. Start DBD If so, between 3 and 4 there should have been a mysql restore. Is it possible you're running with a corrupted DB? And more, did you run a *fix* of runaway jobs or just shown them? It's odd that time start is equal to time end, this would mean you tried a fix. > Can you show the results for the other tables?: event, suspend, step, job,resv I guess that your low select count from tars_step_table is due because you are running with the 'crashed' database that incompletely purged most of the step entries? Before upgrading we need to fix this database lost jobs, then see if everything is working fine and then purge it in smaller steps. > And more, did you run a *fix* of runaway jobs or just shown them? It's odd
> that time start is equal to time end, this would mean you tried a fix.
I misspoke this, state seen is 'pending' so time_start = time_end is expected and you did not fix it.
Please, answer my previous questions and then we will try to fix this runaway jobs.
> In what regards to the other thousands of runaway jobs, they seems to be the > job steps that couldn't be successfully purged. Note that there are at least > 51173 entries what would tell us there were probably too much job steps to > be purged at a time. > > About the database, it wasn't clear to me... from here: > >> it finally crashed and the base was corrupted. We add to restore it, but >> since it was containing the same data to archive and purge, slurmdbd >> behaved the same way when we restarted it. To be able to restore >> the service, we had to postpone the archive/purge in the config file: > > is this the sequence you followed?: > > 1. DB Crashed > 2. Restore database mysqldump > 3. DB Crash again Stopped before crashing > 4. Modify config file > 5. Start DBD Yes. > If so, between 3 and 4 there should have been a mysql restore. Is it possible > you're running with a corrupted DB? Maybe. We stopped it before a new crash when we saw mysql reaching almost 100% of CPU and the error: packmem: Buffer size limit exceeded in slurmdbd logs. But you are probably right. Even if we hadn't InnoDB: Your database may be corrupt or you may have copied the InnoDB InnoDB: tablespace but not the InnoDB log files. See InnoDB: http://dev.mysql.com/doc/refman/5.1/en/forcing-innodb-recovery.html InnoDB: for more information. but only : 180301 12:31:17 [Note] /usr/libexec/mysqld: Normal shutdown 180301 12:31:17 [Note] Event Scheduler: Purging the queue. 0 events 180301 12:31:17 InnoDB: Starting shutdown... 180301 12:31:21 InnoDB: Shutdown completed; log sequence number 5 334554841 180301 12:31:21 [Note] /usr/libexec/mysqld: Shutdown complete 180301 12:31:21 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended 180301 12:31:22 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql 180301 12:31:23 InnoDB: Initializing buffer pool, size = 23.5G 180301 12:31:25 InnoDB: Completed initialization of buffer pool 180301 12:31:25 InnoDB: Started; log sequence number 5 334554841 180301 12:31:25 [Note] Event Scheduler: Loaded 0 events 180301 12:31:25 [Note] /usr/libexec/mysqld: ready for connections. Version: '5.1.73-log' socket: '/var/lib/mysql/mysql.sock' port: 3306 Source distribution in mysqld logs, we should have have set innodb_force_recovery=1 /etc/my.cnf nevertheless. >> Can you show the results for the other tables?: event, suspend, step, job,resv mysql> select count(*) from tars_event_table where time_start <= UNIX_TIMESTAMP('2018-02-01 00:00:00'); +----------+ | count(*) | +----------+ | 9950 | +----------+ mysql> select count(*) from tars_job_table where time_start <= UNIX_TIMESTAMP('2018-02-01 00:00:00'); +----------+ | count(*) | +----------+ | 18761513 | +----------+ mysql> select count(*) from tars_resv_table where time_start <= UNIX_TIMESTAMP('2018-02-01 00:00:00'); +----------+ | count(*) | +----------+ | 18 | +----------+ We don't use suspension. > I guess that your low select count from tars_step_table is due because you are > running with the 'crashed' database that incompletely purged most of the step > entries? > > Before upgrading we need to fix this database lost jobs, then see if > everything is working fine and then purge it in smaller steps. Ok. We are repopulating a clone database with the dump of last night. We will apply your procedure on it and when everything seems fine, we will reproduce the procedure on the production on. Maybe we could wait until the downtime (no job submitted nor running) to apply the procedure on the production db? Thanks, > > 3. DB Crash again > Stopped before crashing At this point it would probably have been better to restore everything again and started fresh with modified configuration. > InnoDB: Your database may be corrupt or you may have copied the InnoDB This is not good, but I hope MySQL have solved the problem automatically. > mysql> select count(*) from tars_job_table where time_start <= > UNIX_TIMESTAMP('2018-02-01 00:00:00'); > +----------+ > | count(*) | > +----------+ > | 18761513 | > +----------+ This is a big number and must probably been the cause to this issue, aside also from step table. > Ok. We are repopulating a clone database with the dump of last night. We > will apply your procedure on it and when everything seems fine, we will reproduce > the procedure on the production on. Maybe we could wait until the downtime > (no job submitted nor running) to apply the procedure on the production db? Good, I think it is the safest plan. Waiting until the downtime is something I was gonna suggest also. Reducing the possible calls and interaction with the database is worth to do. There's another option but less safe and is to change the slurmdbd port to avoid communication while it is archiving. In order to force the archive on your test bed, you can use 'sacctmgr archive dump' or set the last_ran_table to now-1: update <yourcluster>_last_ran_table SET hourly_rollup = UNIX_TIMESTAMP('2018-03-06 00:00:00'), daily_rollup = UNIX_TIMESTAMP('2018-03-06 00:00:00'), monthly_rollup = UNIX_TIMESTAMP('2018-03-06 00:00:00'); Tell me how it looks like. (In reply to Felip Moll from comment #11) > > > 3. DB Crash again > > Stopped before crashing > > At this point it would probably have been better to restore everything again > and started fresh with modified configuration. > > > InnoDB: Your database may be corrupt or you may have copied the InnoDB > > This is not good, but I hope MySQL have solved the problem automatically. > > > mysql> select count(*) from tars_job_table where time_start <= > > UNIX_TIMESTAMP('2018-02-01 00:00:00'); > > +----------+ > > | count(*) | > > +----------+ > > | 18761513 | > > +----------+ > > This is a big number and must probably been the cause to this issue, aside > also from step table. > > > Ok. We are repopulating a clone database with the dump of last night. We > > will apply your procedure on it and when everything seems fine, we will reproduce > > the procedure on the production on. Maybe we could wait until the downtime > > (no job submitted nor running) to apply the procedure on the production db? > > Good, I think it is the safest plan. > > Waiting until the downtime is something I was gonna suggest also. Reducing > the possible calls and interaction with the database is worth to do. > There's another option but less safe and is to change the slurmdbd port to > avoid communication while it is archiving. > > In order to force the archive on your test bed, you can use 'sacctmgr > archive dump' or set the last_ran_table to now-1: > > update <yourcluster>_last_ran_table SET > hourly_rollup = UNIX_TIMESTAMP('2018-03-06 00:00:00'), > daily_rollup = UNIX_TIMESTAMP('2018-03-06 00:00:00'), > monthly_rollup = UNIX_TIMESTAMP('2018-03-06 00:00:00'); > > Tell me how it looks like. Sophie, there's a possibility that you have hit a bug and runaway jobs aren't really runaway ones. Will inform you when I've more info. In the meantime it is still worth to continue with the commented plan, so keep going. Hello,
I tried sacctmgr archive dump on the test database and whatever I do
I always have the same message: Would you like to commit changes?
Example:
sacctmgr archive dump jobs PurgejobAfter=12months directory=/pasteur/services/policy01/slurm/tmp/test_archive/archives
sacctmgr: SUCCESS
Would you like to commit changes? (You have 30 seconds to decide)
It does nothing and the directory remains empty.
When I started the slurmdbd, it didn't tried to archive/purge even if I had
restored the initial configuration:
ArchiveEvents=yes
ArchiveJobs=yes
ArchiveResvs=yes
ArchiveSteps=yes
ArchiveSuspend=yes
PurgeEventAfter=12months
PurgeJobAfter=12months
PurgeResvAfter=12months
PurgeStepAfter=1months
PurgeSuspendAfter=12months
We still have the jobs in the db:
MariaDB [slurm_acct_db]> select count(*) from tars_job_table where time_start <= UNIX_TIMESTAMP('2018-02-01 00:00:00');
+----------+
| count(*) |
+----------+
| 18761519 |
+----------+
The number of runaway jobs has increased compare to last request and correspond
to the figure obtain from the production db.
I've also done
> update <yourcluster>_last_ran_table SET
> hourly_rollup = UNIX_TIMESTAMP('2018-03-06 00:00:00'),
> daily_rollup = UNIX_TIMESTAMP('2018-03-06 00:00:00'),
> monthly_rollup = UNIX_TIMESTAMP('2018-03-06 00:00:00');
but even after a restart of slurmdbd, the archive/purge isn't triggered.
Did I miss something?
About the runaway jobs, I forgot to mention that we have the following
message:
sacctmgr: slurmdbd: reopening connection
sacctmgr: slurmdbd: reopening connection
sacctmgr: slurmdbd: reopening connection
sacctmgr: slurmdbd: reopening connection
sacctmgr: error: slurmdbd: Sending message type 1488: 11: No error
sacctmgr: error: Failed to fix runaway job: Resource temporarily unavailable
Should we adjust something?
Thanks,
(In reply to Sophie Créno from comment #14) > I've also done > > > update <yourcluster>_last_ran_table SET > > hourly_rollup = UNIX_TIMESTAMP('2018-03-06 00:00:00'), > > daily_rollup = UNIX_TIMESTAMP('2018-03-06 00:00:00'), > > monthly_rollup = UNIX_TIMESTAMP('2018-03-06 00:00:00'); > > but even after a restart of slurmdbd, the archive/purge isn't triggered. > Did I miss something? > > About the runaway jobs, I forgot to mention that we have the following > message: > sacctmgr: slurmdbd: reopening connection > sacctmgr: slurmdbd: reopening connection > sacctmgr: slurmdbd: reopening connection > sacctmgr: slurmdbd: reopening connection > sacctmgr: error: slurmdbd: Sending message type 1488: 11: No error > sacctmgr: error: Failed to fix runaway job: Resource temporarily unavailable > > Should we adjust something? > > Thanks, Can you start slurmdbd in debug level (slurmdbd -Dvvv), try to purge jobs, and send me back slurmdbd.log? I assume you are confirming purge when using sacctmgr (responding yes). Here is what I have obtained. When I saw that there was nothing to roll up, I changed from 12 months to 10 bit I still have the same output from slurmdbd: slurmdbd: debug: auth plugin for Munge (http://code.google.com/p/munge/) loaded slurmdbd: debug2: mysql_connect() called for db slurm_acct_db slurmdbd: debug2: It appears the table conversions have already taken place, hooray! slurmdbd: Accounting storage MYSQL plugin loaded slurmdbd: pidfile not locked, assuming no running daemon slurmdbd: debug2: ArchiveDir = /pasteur/services/policy01/slurm/tmp/test_archive/archives slurmdbd: debug2: ArchiveScript = (null) slurmdbd: debug2: AuthInfo = (null) slurmdbd: debug2: AuthType = auth/munge slurmdbd: debug2: CommitDelay = 0 slurmdbd: debug2: DbdAddr = tars-pacct slurmdbd: debug2: DbdBackupHost = (null) slurmdbd: debug2: DbdHost = tars-pacct slurmdbd: debug2: DbdPort = 6819 slurmdbd: debug2: DebugFlags = (null) slurmdbd: debug2: DebugLevel = 6 slurmdbd: debug2: DefaultQOS = (null) slurmdbd: debug2: LogFile = (null) slurmdbd: debug2: MessageTimeout = 300 slurmdbd: debug2: PidFile = /var/run/slurmdbd.pid slurmdbd: debug2: PluginDir = /pasteur/services/policy01/slurm/tmp/test_archive/install/slurm-16.05/lib/slurm slurmdbd: debug2: PrivateData = none slurmdbd: debug2: PurgeEventAfter = 10 months* slurmdbd: debug2: PurgeJobAfter = 10 months* slurmdbd: debug2: PurgeResvAfter = 10 months* slurmdbd: debug2: PurgeStepAfter = 1 months* slurmdbd: debug2: PurgeSuspendAfter = 10 months* slurmdbd: debug2: SlurmUser = slurm(497) slurmdbd: debug2: StorageBackupHost = (null) slurmdbd: debug2: StorageHost = localhost slurmdbd: debug2: StorageLoc = slurm_acct_db slurmdbd: debug2: StoragePort = 3306 slurmdbd: debug2: StorageType = accounting_storage/mysql slurmdbd: debug2: StorageUser = slurmdbd slurmdbd: debug2: TCPTimeout = 2 slurmdbd: debug2: TrackWCKey = 0 slurmdbd: debug2: TrackSlurmctldDown= 0 slurmdbd: debug2: acct_storage_p_get_connection: request new connection 1 slurmdbd: debug: post user: couldn't get a uid for user fklein slurmdbd: slurmdbd version 16.05.9 started slurmdbd: debug2: running rollup at Wed Mar 07 18:05:57 2018 slurmdbd: debug2: No need to roll cluster tars this hour 1520442000 <= 1520442000 slurmdbd: debug2: No need to roll cluster tars this day 1520377200 <= 1520377200 slurmdbd: debug2: No need to roll cluster tars this month 1519858800 <= 1519858800 slurmdbd: debug2: Got 1 of 1 rolled up slurmdbd: debug2: Everything rolled up so the rollup is not triggered. Maybe it's normal and I didn't catch something regarding the date to put in monthly_rollup = UNIX_TIMESTAMP('2018-03-06 00:00:00'). But contrary to previously, this time, I have a file being written (still ongoing) when I launch sacctmgr archive dump jobs PurgejobAfter=10months directory=/pasteur/services/policy01/slurm/tmp/test_archive/archives $ ll total 724085 -rw------- 1 slurm slurm 652171962 Mar 7 18:09 tars_job_archive_2017-03-01T00:00:00_2017-04-01T00:00:00 It's March of last year so it is what we wanted to archive and purge. > I assume you are confirming purge when using sacctmgr (responding yes). Yes, sorry, I forgot to say that. (In reply to Sophie Créno from comment #16) > slurmdbd: debug2: No need to roll cluster tars this hour 1520442000 <= > 1520442000 > slurmdbd: debug2: No need to roll cluster tars this day 1520377200 <= > 1520377200 > slurmdbd: debug2: No need to roll cluster tars this month 1519858800 <= > 1519858800 This is strange, if you modified the last_ran_table and restarted slurmdbd we should see your values in the log, and I see old ones. The unix timestamp that is shown in the logs correspond to (https://www.epochconverter.com/): hour: GMT: Wednesday 7 March de 2018 @ 17:00:00 day: GMT: Tuesday 6 March de 2018 @ 23:00:00 month: GMT: Wednesday 28 February de 2018 @ 23:00:00 So it seems the values were not been correctly modified. I tried it and it works for me: a) stop slurmdbd b) in the mysql database, run UPDATE <cluster>_last_ran_table SET hourly_rollup = UNIX_TIMESTAMP('2017-01-01 00:00:00'), daily_rollup = UNIX_TIMESTAMP('2017-01-01 00:00:00'), monthly_rollup = UNIX_TIMESTAMP('2017-01-01 00:00:00'); c) restart slurmdbd, and it'll automatically regenerate the rollup data. The slurmdbd.log should have some messages related to it progressing through. But anyway, if you are getting the data with sacctmgr it is fine. > But contrary to previously, this time, I have a file being written (still > ongoing) when I launch > > sacctmgr archive dump jobs PurgejobAfter=10months > directory=/pasteur/services/policy01/slurm/tmp/test_archive/archives Good, have you seen the job entries being reduced? MariaDB [slurm_acct_db]> select count(*) from tars_job_table where time_start <= UNIX_TIMESTAMP('2018-02-01 00:00:00'); +----------+ | count(*) | +----------+ | 18761519 | +----------+ I commented internally and this are a lot of records, reduce as much as you can before upgrading. Regarding the runaway jobs, don't worry about them now. They are not really runaway jobs, they have an end_time set so they are fine. They appear here probably because the corruption caused to the database during the crash. I am working on a patch for avoid this false runaway jobs situation. Hello, >> But contrary to previously, this time, I have a file being written (still > ongoing) when I launch > > sacctmgr archive dump jobs PurgejobAfter=10months > directory=/pasteur/services/policy01/slurm/tmp/test_archive/archives > > > Good, have you seen the job entries being reduced? Yes. We have proceeded as you suggested in Comment5 and it worked smoothly: +----------+ | count(*) | +----------+ | 3026821 | +----------+ and we were able to do the conversion and upgrade without problems regarding the DB: slurmdbd: Conversion done: success! Thanks a lot for your help. > Regarding the runaway jobs, don't worry about them now. They are not really > runaway jobs, they have an end_time set so they are fine. They appear here > probably because the corruption caused to the database during the crash. > I am working on a patch for avoid this false runaway jobs situation. What should we do regarding the runaway jobs now that we are in 17.11.4? Wait for the patch or try to clean them? Thanks, (In reply to Sophie Créno from comment #22) > and we were able to do the conversion and upgrade without problems regarding > the DB: > slurmdbd: Conversion done: success! > > Thanks a lot for your help. This is awesome Sophie. Does this mean that you have everything in production and with the new version already? I suppose the runaway list is not increasing, is it? Ignore all of them for now, I will come back when I have the exact steps for solving it. One other question just for curiosity, this purge settings that you shown in comment 0, was it a new config or were already there? > Does this mean that you have everything in production and with the new version > already? Yes. We had to complete the upgrade before this morning. > I suppose the runaway list is not increasing, is it? Ignore all of them for > now, I will come back when I have the exact steps for solving it. Yes, they are but, as you said, they are not real runaway jobs since they are not considered as running/pending anymore, contrary to real ones. Example of job reported as a runaway one: ID Name Partition Cluster State TimeStart TimeEnd ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- 29274643 eQTL dedicated tars PENDING 2018-03-09T14:46:37 2018-03-09T14:46:37 but $ sacct -j 29274643 JobID JobName User Partition QOS State Start Elapsed End ExitCode NCPUS NNodes NodeList ------------ ---------- --------- ---------- ---------- ---------- ------------------- ---------- ------------------- -------- ---------- -------- --------------- 29274643_2 eQTL user1 dedicated fast COMPLETED 2018-03-09T14:51:00 00:30:33 2018-03-09T15:21:33 0:0 1 1 tars-492 29274643_2.+ batch COMPLETED 2018-03-09T14:51:00 00:30:33 2018-03-09T15:21:33 0:0 1 1 tars-492 29274643_1 eQTL user1 dedicated fast COMPLETED 2018-03-09T14:27:22 00:30:36 2018-03-09T14:57:58 0:0 1 1 tars-490 29274643_1.+ batch COMPLETED 2018-03-09T14:27:22 00:30:36 2018-03-09T14:57:58 0:0 1 1 tars-490 > One other question just for curiosity, this purge settings that you shown in > comment 0, was it a new config or were already there? They are the ones we have from the very beginning. (In reply to Sophie Créno from comment #25) > > I suppose the runaway list is not increasing, is it? Ignore all of them for > > now, I will come back when I have the exact steps for solving it. > Yes, they are but, as you said, they are not real runaway jobs since they are > not considered as running/pending anymore, contrary to real ones. If they are increasing it could be a sign of small jobs running and database overload. I still have to investigate which causes this symptom. I have the patch almost ready that will avoid this but some slowness can still be there. Can you show me an sdiag and the load of the mysql server? > > One other question just for curiosity, this purge settings that you shown in > > comment 0, was it a new config or were already there? > They are the ones we have from the very beginning. I am wondering then why it failed this particular day and not before. These millions of entries should have been purged before, but seems that this purge wasn't working until you have the problem. Does it make sense or I am missing something? Created attachment 6342 [details] sdiag output > Can you show me an sdiag and the load of the mysql server? Find the sdiag output in attachment. top - 18:33:47 up 1 day, 43 min, 3 users, load average: 0.55, 0.40, 0.38 Tasks: 830 total, 1 running, 829 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 264416644k total, 34098476k used, 230318168k free, 111752k buffers Swap: 16777212k total, 0k used, 16777212k free, 10996644k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4492 mysql 20 0 140g 19g 13m S 2.0 7.9 18:46.73 /usr/sbin/mysqld --basedir=/usr --datadir=/opt/data/mariadb/data --plugin-dir=/usr/lib64/mysql/plugin --user=mysql Do you want something more specific? > I am wondering then why it failed this particular day and not before. We did have crash before (see bug4295). > These millions of entries should have been purged before, I don't think so given the configuration in comment0. I'm not surprised to have such a big figure with 1 year of data retention. But I wonder if I'm not missing something too since, even after the cited bug, we still have generated 1 file per month as archive. So the configuration remained the same whereas it shouldn't have I guess. (In reply to Sophie Créno from comment #27) > Do you want something more specific? It's enough thanks, I don't see your server is under high load. > > I am wondering then why it failed this particular day and not before. > We did have crash before (see bug4295). > > > These millions of entries should have been purged before, > I don't think so given the configuration in comment0. I'm not surprised > to have such a big figure with 1 year of data retention. Yes, you're right. I talked internally and the suggestion is to purge in smaller steps, maybe month by month. > But I wonder if I'm not missing something too since, even after the cited > bug,we still have generated 1 file per month as archive. So the configuration > remained the same whereas it shouldn't have I guess. The purge takes place at the start of the each purge interval, so if the unit is Months it will purge every month, so you will get 1 file per month. For example, if the purge time is 2 months, the purge would happen at the beginning of each month. See: https://bugs.schedmd.com/show_bug.cgi?id=4295#c4 Does it make sense? One more thing: we recently discovered an issue with MySQL 5.1 that causes the upgrade process to be slow or to never end. This probably was the cause for the problem about 'very long upgrading time' that affected many people. Since you've successfully migrated, I assume you upgraded also to a newer MySQL version. If not it is recommended to do so. Finally, I commited two patches for the runaway jobs problem that will be available in next 17.11.5 release. This is planned for this week if nothing changes: Commits are: 6d4518feb2578c21fab675f42b639eb02844c17f - sacctmgr not show already completed jobs as runaway ones 6bc5dc305ffbb92345fb721a08c8a095564f4dc8 - Incorrectly set the job status on mysql Hello, > Yes, you're right. I talked internally and the suggestion is to purge in > smaller steps, maybe month by month. Sorry, I'm probably missing something because we already purge/archive every month the data older than a year. Do you mean that we should keep only the last month data in the db? > For example, if the purge time is 2 months, the purge would happen at the > beginning of each month. > > See:https://bugs.schedmd.com/show_bug.cgi?id=4295#c4 > > Does it make sense? Yes, I perfectly understand. I just wonder why we haven't changed that in the configuration after that bug since we already experienced the problem of not being able to purge correctly. > One more thing: we recently discovered an issue with MySQL 5.1 that causes the > upgrade process to be slow or to never end. This probably was the cause for the > problem about 'very long upgrading time' that affected many people. Since > you've successfully migrated, I assume you upgraded also to a newer MySQL > version. If not it is recommended to do so. We run 10.2.13-MariaDB, so I guess that this problem has already been fixed since it is more or less MySQL 5.6/5.7. > Finally, I commited two patches for the runaway jobs problem that will be > available in next 17.11.5 release. This is planned for this week if nothing > changes. Ok, we will retrieve them at the same time them. Thanks for everything. (In reply to Sophie Créno from comment #31) > Hello, > > > Yes, you're right. I talked internally and the suggestion is to purge in > > smaller steps, maybe month by month. > Sorry, I'm probably missing something because we already purge/archive > every month the data older than a year. Do you mean that we should keep > only the last month data in the db? Yes, I mean keep data only for the last month... if you can't purge so often you can try other estimations like keeping only 5M jobs: in your case that you have roughly 19M jobs / year (comment 10), this would mean to keep about 3 months of records. In the end it is a matter of the quantity of storage you have for the database, the performance on the server, and so on. Tell me if you need something more about this issue, otherwise I will proceed to close it. Thanks. Hi Sophie, Finally patches have been commited for: 17.11.5 release 6d4518feb2578c 6bc5dc305ffbb9 18.08 release 037bb3cbf52365 Given you have upgraded successfully and everything seems to be clear I close this bug, but please feel free to reopen if you are still missing something. Thank you, Felip Hello, Sorry, I was away from the office. > Yes, I mean keep data only for the last month... if you can't purge so often > you can try other estimations like keeping only 5M jobs: in your case that you > have roughly 19M jobs / year (comment 10), this would mean to keep about 3 > months of records. That's what we have chosen: keep 3 months in the db. > Finally patches have been commited for: > > 17.11.5 release > 6d4518feb2578c > 6bc5dc305ffbb9 > > 18.08 release > 037bb3cbf52365 Excellent, I'll install them today with the 17.11.5. > Given you have upgraded successfully and everything seems to be clear I close > this bug, but please feel free to reopen if you are still missing something. Yes, sure. Thanks again. |