Ticket 4736

Summary: slurmdbd error message after upgrade
Product: Slurm Reporter: mengxing cheng <mxcheng>
Component: slurmdbdAssignee: Felip Moll <felip.moll>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: tim
Version: 17.11.2   
Hardware: Linux   
OS: Linux   
Site: University of Chicago Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description mengxing cheng 2018-02-05 09:13:03 MST
Dear support,

slurmdbd and mysqld reported error message right after upgrade to 17.11.2. The slurmdbd took more than one and half hours to upgrade, possibly due to too many rows in job and step tables. The slurm production appears to be running well since then. But I want to know how serious are the error messages. Thank you!


slurmdbd
[2018-02-02T15:21:01.538] adding column features after federation in table cluster_table
[2018-02-02T15:21:01.538] dropping column fed_weight from table cluster_table
[2018-02-02T15:21:01.760] pre-converting job table for midway
[2018-02-02T15:21:14.941] adding column pack_job_id after id_group in table "midway_job_table"
[2018-02-02T15:21:14.941] adding column pack_job_offset after pack_job_id in table "midway_job_table"
[2018-02-02T15:21:14.941] adding column mcs_label after kill_requid in table "midway_job_table"
[2018-02-02T15:21:14.942] adding column work_dir after wckey in table "midway_job_table"
[2018-02-02T15:21:14.942] adding key old_tuple (id_job, id_assoc, time_submit) to table "midway_job_table"
[2018-02-02T15:21:14.942] adding key pack_job (pack_job_id) to table "midway_job_table"
[2018-02-02T15:24:27.301] adding column unused_wall after tres in table "midway_resv_table"
[2018-02-02T15:24:58.894] converting step table for midway
[2018-02-02T15:25:01.591] converting job table for midway
[2018-02-02T17:10:10.810] converting resv table for midway
[2018-02-02T17:10:12.243] error: _update_unused_wall, total job time 0.000000 is greater than total resv time -20.
[2018-02-02T17:10:12.508] error: _update_unused_wall, total job time 0.000000 is greater than total resv time -105510.
[2018-02-02T17:10:12.588] converting cluster tables for midway
[2018-02-02T17:10:12.589] converting assoc table for midway
[2018-02-02T17:10:12.947] dropping column priority from table federation_table
[2018-02-02T17:10:13.096] converting QOS table
[2018-02-02T17:10:13.096] Conversion done: success!
[2018-02-02T17:10:14.855] slurmdbd version 17.11.2 started
[2018-02-02T17:29:21.026] DBD_JOB_COMPLETE: cluster not registered
[2018-02-02T18:00:33.004] error: We have more allocated time than is possible (526520387648 > 306801252000) for cluster midway(85222570) from 2017-11-29T19:00:00 - 2017-11-29T20:00:00 tres 2
[2018-02-02T18:00:33.005] error: We have more time than is possible (306801252000+3265344000+0)(310066596000) > 306801252000 for cluster midway(85222570) from 2017-11-29T19:00:00 - 2017-11-29T20:00:00 tres 2
[2018-02-02T18:00:37.663] error: We have more allocated time than is possible (484323864750 > 306801252000) for cluster midway(85222570) from 2017-11-29T21:00:00 - 2017-11-29T22:00:00 tres 2
[2018-02-02T18:00:37.664] error: We have more time than is possible (306801252000+3265344000+0)(310066596000) > 306801252000 for cluster midway(85222570) from 2017-11-29T21:00:00 - 2017-11-29T22:00:00 tres 2
[2018-02-02T18:00:39.943] error: We have more allocated time than is possible (548215803518 > 306801252000) for cluster midway(85222570) from 2017-11-29T22:00:00 - 2017-11-29T23:00:00 tres 2
[2018-02-02T18:00:39.943] error: We have more time than is possible (306801252000+3265344000+0)(310066596000) > 306801252000 for cluster midway(85222570) from 2017-11-29T22:00:00 - 2017-11-29T23:00:00 tres 2


mysqld
180202 17:09:50  InnoDB: ERROR: the age of the last checkpoint is 120792482,
InnoDB: which exceeds the log group capacity 120792269.
InnoDB: If you are using big BLOB or TEXT rows, you must set the
InnoDB: combined size of log files at least 10 times bigger than the
InnoDB: largest such row.
180202 17:10:06  InnoDB: ERROR: the age of the last checkpoint is 120792528,
InnoDB: which exceeds the log group capacity 120792269.
InnoDB: If you are using big BLOB or TEXT rows, you must set the
InnoDB: combined size of log files at least 10 times bigger than the
InnoDB: largest such row.
Comment 2 Felip Moll 2018-02-06 06:52:58 MST
The error indicates that you are trying to insert too much data into InnoDB too quickly and the InnoDB log is filling up before the data can be flushed into the main data files. What underlying storage are you using for the database and for the innodb log files?

It is also possible that your MySQL innodb_log_file_size is way too small, increasing that could help.

I cannot confirm that but it is also possible that the transaction have not ended correctly and the migration of job table is done partially. If you still have the old database, you could compare that all the jobs until the update are in the new database.

You can also check for orphaned jobs in mysql:

'sacctmgr show lostjobs'

Send me the output of this command.

--------
In order to tune the database:
Step 1) Check the following parameters in /etc/my.cnf and increase them depending on your current values:

[mysqld]
innodb_log_buffer_size          = xxM -> up to 3 times your value
innodb_buffer_pool_size         = yyG 
innodb_log_file_size            = zzM -> up to 6 times your value

It is worth to read more about this parameters, I am not a DBA expert and this is only an idea.

Step 2) You must stop mysql in a clean way, very important:

mysql -uroot -p -e"SET GLOBAL innodb_fast_shutdown = 0;"

Step 3) systemctl stop mysql

Step 4) mv /var/lib/mysql/ib_logfile* /some/tmp_backup_dir/

Step 5)  systemctl start mysql

innodb_fast_shutdown = 0 forces InnoDB to completely purge transactional changes from all InnoDB moving parts, including the transactional logs (ib_logfile0, ib_logfile1).

Step 6) Check everything is working fine, sacct, jobs, etc.

Step 7) You can remove the backup

 rm -rf /some/tmp_backup_dir/
----

Tell me how everything goes.


+Extra info:
https://dba.stackexchange.com/questions/16510/mysql-innodb-innodb-error-the-age-of-the-last-checkpoint-is-innodb-which-exc
https://www.percona.com/blog/2008/11/21/how-to-calculate-a-good-innodb-log-file-size/
Comment 3 mengxing cheng 2018-02-06 17:30:25 MST
Felip, thank you for help! We use mysqld. I don't see lostjobs. 

# sacctmgr show lostjobs
Runaway Jobs: No runaway jobs found

Partially updated database is a problem I worry about. Slurmdbd reported a lot following errors on Feb 05th though it is fine now. Could you check if those errors are related to the upgrade or some other problem. 

2018-02-05T10:17:49.454] error: mysql_query failed: 1064 You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'loop interactions') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx' at line 1
insert into "midway_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, account, `partition`, node_inx, array_task_str, array_task_pending, tres_alloc, tres_req, work_dir) values (42890296, UNIX_TIMESTAMP(), 42890295, 0, 0, 4294967294, 36730, 172, 425015474, 425015474, 'midway2-0190', 0, 2160, 1517847464, 1517847464, 1517847464, 'angle', 0, 1, 131946, 1, 1, 9223372036854783808, 'pi-dinner', 'broadwl', '361', NULL, 0, '1=1,2=8000,3=18446744073709551614,4=1,5=1', '1=1,2=8000,4=1', '/project/dinner/lhong/pT432_pE318_CIIADP/SMD_HETK_preOpen/SMD_HETK/umbrella/analysis/A'-loop interactions') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_assoc=36730, id_user=425015474, id_group=425015474, nodelist='midway2-0190', id_resv=0, timelimit=2160, time_submit=1517847464, time_eligible=1517847464, time_start=1517847464, mod_time=UNIX_TIMESTAMP(), job_name='angle', track_steps=0, id_qos=172, state=greatest(state, 1), priority=131946, cpus_req=1, nodes_alloc=1, mem_req=9223372036854783808, id_array_job=42890295, id_array_task=0, pack_job_id=0, pack_job_offset=4294967294, account='pi-dinner', `partition`='broadwl', node_inx='361', array_task_str=NULL, array_task_pending=0, tres_alloc='1=1,2=8000,3=18446744073709551614,4=1,5=1', tres_req='1=1,2=8000,4=1', work_dir='/project/dinner/lhong/pT432_pE318_CIIADP/SMD_HETK_preOpen/SMD_HETK/umbrella/analysis/A'-loop interactions'
[2018-02-05T10:17:49.455] error: It looks like the storage has gone away trying to reconnect
[2018-02-05T10:17:49.455] error: mysql_query failed: 1064 You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'loop interactions') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx' at line 1
insert into "midway_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, account, `partition`, node_inx, array_task_str, array_task_pending, tres_alloc, tres_req, work_dir) values (42890296, UNIX_TIMESTAMP(), 42890295, 0, 0, 4294967294, 36730, 172, 425015474, 425015474, 'midway2-0190', 0, 2160, 1517847464, 1517847464, 1517847464, 'angle', 0, 1, 131946, 1, 1, 9223372036854783808, 'pi-dinner', 'broadwl', '361', NULL, 0, '1=1,2=8000,3=18446744073709551614,4=1,5=1', '1=1,2=8000,4=1', '/project/dinner/lhong/pT432_pE318_CIIADP/SMD_HETK_preOpen/SMD_HETK/umbrella/analysis/A'-loop interactions') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_assoc=36730, id_user=425015474, id_group=425015474, nodelist='midway2-0190', id_resv=0, timelimit=2160, time_submit=1517847464, time_eligible=1517847464, time_start=1517847464, mod_time=UNIX_TIMESTAMP(), job_name='angle', track_steps=0, id_qos=172, state=greatest(state, 1), priority=131946, cpus_req=1, nodes_alloc=1, mem_req=9223372036854783808, id_array_job=42890295, id_array_task=0, pack_job_id=0, pack_job_offset=4294967294, account='pi-dinner', `partition`='broadwl', node_inx='361', array_task_str=NULL, array_task_pending=0, tres_alloc='1=1,2=8000,3=18446744073709551614,4=1,5=1', tres_req='1=1,2=8000,4=1', work_dir='/project/dinner/lhong/pT432_pE318_CIIADP/SMD_HETK_preOpen/SMD_HETK/umbrella/analysis/A'-loop interactions'
[2018-02-05T10:17:49.456] error: mysql_query failed: 1064 You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'loop interactions') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx' at line 1
insert into "midway_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, account, `partition`, node_inx, array_task_str, array_task_pending, tres_alloc, tres_req, work_dir) values (42890297, UNIX_TIMESTAMP(), 42890295, 1, 0, 4294967294, 36730, 172, 425015474, 425015474, 'midway2-0221', 0, 2160, 1517847464, 1517847464, 1517847464, 'angle', 0, 1, 131946, 1, 1, 9223372036854783808, 'pi-dinner', 'broadwl', '380', NULL, 0, '1=1,2=8000,3=18446744073709551614,4=1,5=1', '1=1,2=8000,4=1', '/project/dinner/lhong/pT432_pE318_CIIADP/SMD_HETK_preOpen/SMD_HETK/umbrella/analysis/A'-loop interactions') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_assoc=36730, id_user=425015474, id_group=425015474, nodelist='midway2-0221', id_resv=0, timelimit=2160, time_submit=1517847464, time_eligible=1517847464, time_start=1517847464, mod_time=UNIX_TIMESTAMP(), job_name='angle', track_steps=0, id_qos=172, state=greatest(state, 1), priority=131946, cpus_req=1, nodes_alloc=1, mem_req=9223372036854783808, id_array_job=42890295, id_array_task=1, pack_job_id=0, pack_job_offset=4294967294, account='pi-dinner', `partition`='broadwl', node_inx='380', array_task_str=NULL, array_task_pending=0, tres_alloc='1=1,2=8000,3=18446744073709551614,4=1,5=1', tres_req='1=1,2=8000,4=1', work_dir='/project/dinner/lhong/pT432_pE318_CIIADP/SMD_HETK_preOpen/SMD_HETK/umbrella/analysis/A'-loop interactions'
[2018-02-05T10:17:49.456] error: It looks like the storage has gone away trying to reconnect 


I don't see error in mysqld.log but there are many many following logs in mysql-slow.log
# Query_time: 0.000180  Lock_time: 0.000059 Rows_sent: 0  Rows_examined: 0
SET timestamp=1517868102;
select distinct t1.id_wckey, t1.is_def, t1.wckey_name, t1.user from "midway_wckey_table" as t1 where t1.deleted=0 && (t1.is_def=1) && (t1.user='abourassa') order by wckey_name, user;
# User@Host: slurm[slurm] @ midway-mgt.rcc.uchicago.edu [128.135.112.70]
# Query_time: 0.000179  Lock_time: 0.000049 Rows_sent: 0  Rows_examined: 0
SET timestamp=1517868104;
select distinct t1.id_wckey, t1.is_def, t1.wckey_name, t1.user from "midway_wckey_table" as t1 where t1.deleted=0 && (t1.is_def=1) && (t1.user='abrax') order by wckey_name, user;
Comment 4 Felip Moll 2018-02-07 05:17:59 MST
Hi Mengxing,

(In reply to mengxing cheng from comment #3)
> Felip, thank you for help! We use mysqld. I don't see lostjobs. 
> 
> # sacctmgr show lostjobs
> Runaway Jobs: No runaway jobs found
> 

This is good, if you didn't see the database to crash probably the update was fine.


> Partially updated database is a problem I worry about.

You can just check what I said in first comment, to compare the list of jobs from the old and the new database, you will be 100% sure then that all data is migrated, but I am confident that it is.

> Slurmdbd reported a
> lot following errors on Feb 05th though it is fine now. Could you check if
> those errors are related to the upgrade or some other problem. 
> 

The errors you report are coming from a bug. Your user have a directory with "'" character and we are not escaping this correctly:

"/project/dinner/lhong/pT432_pE318_CIIADP/SMD_HETK_preOpen/SMD_HETK/umbrella/analysis/A'-loop interactions"

I will start digging into it and providing a fix for this issue. In the meantime I recommend you to look in the filesystem for weird paths like this. They can give problems not only with Slurm but other programs and utilities.


> 
> I don't see error in mysqld.log but there are many many following logs in
> mysql-slow.log
> # Query_time: 0.000180  Lock_time: 0.000059 Rows_sent: 0  Rows_examined: 0
> SET timestamp=1517868102;
> select distinct t1.id_wckey, t1.is_def, t1.wckey_name, t1.user from
> "midway_wckey_table" as t1 where t1.deleted=0 && (t1.is_def=1) &&
> (t1.user='abourassa') order by wckey_name, user;
> # User@Host: slurm[slurm] @ midway-mgt.rcc.uchicago.edu [128.135.112.70]
> # Query_time: 0.000179  Lock_time: 0.000049 Rows_sent: 0  Rows_examined: 0
> SET timestamp=1517868104;
> select distinct t1.id_wckey, t1.is_def, t1.wckey_name, t1.user from
> "midway_wckey_table" as t1 where t1.deleted=0 && (t1.is_def=1) &&
> (t1.user='abrax') order by wckey_name, user;

This indicates your mysql is slow. Check comment #2 and tell me which are your values in my.cnf, also tell me which underlying filesystem type and hardware have for the database. You definitively need some tuning in MySQL.

Is your database used for anything else than Slurm?
Comment 7 mengxing cheng 2018-02-07 14:24:25 MST
Felip, 

The problem of directory path including special ' character has been fixed. Thanks for pointing it out. I am checking database but glad to know that you are confident in the upgrade success. 


The filesystem is ext4 on top of drbd (Distributed Replicated Block Device) HA replication constructed from mechanic HDDs. I agree that they are not fast backend storage system for database. The database server hosts a slurmdbd of another very small cluster of 150 nodes. 

Here is my.cnf
# cat /etc/my.cnf 
[mysqld]

#port = 8881
#skip_grant_tables=1
#innodb_force_recovery=1
#innodb_purge_threads=0

datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
user=mysql
# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0

log-bin=/var/lib/mysql/mysql-bin.log
relay-log=/var/lib/mysql/mysql-relay-bin.log
server-id=10

query_cache_type=1

key_buffer              = 256M

expire_logs_days        = 3

innodb_file_per_table
innodb_buffer_pool_size=4G
innodb_log_file_size=64M
#innodb_log_file_size=128M
#innodb_lock_wait_timeout=400
innodb_lock_wait_timeout=900
thread_cache_size=4
table_open_cache=512

query_cache_size = 64M
query_cache_limit = 32M
join_buffer_size = 32M

tmp_table_size = 1G
max_heap_table_size = 1G

slow-query-log = 1
slow-query-log-file = /var/log/mysql-slow.log
long_query_time = 1
log-queries-not-using-indexes

[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
Comment 8 Felip Moll 2018-02-09 08:54:29 MST
(In reply to mengxing cheng from comment #7)
> Felip, 
> 
> The problem of directory path including special ' character has been fixed.
> Thanks for pointing it out. I am checking database but glad to know that you
> are confident in the upgrade success. 
> 
> 
> The filesystem is ext4 on top of drbd (Distributed Replicated Block Device)
> HA replication constructed from mechanic HDDs. I agree that they are not
> fast backend storage system for database. The database server hosts a
> slurmdbd of another very small cluster of 150 nodes. 
> 
> Here is my.cnf
> # cat /etc/my.cnf 
> [mysqld]
> 
> #port = 8881
> #skip_grant_tables=1
> #innodb_force_recovery=1
> #innodb_purge_threads=0
> 
> datadir=/var/lib/mysql
> socket=/var/lib/mysql/mysql.sock
> user=mysql
> # Disabling symbolic-links is recommended to prevent assorted security risks
> symbolic-links=0
> 
> log-bin=/var/lib/mysql/mysql-bin.log
> relay-log=/var/lib/mysql/mysql-relay-bin.log
> server-id=10
> 
> query_cache_type=1
> 
> key_buffer              = 256M
> 
> expire_logs_days        = 3
> 
> innodb_file_per_table
> innodb_buffer_pool_size=4G
> innodb_log_file_size=64M
> #innodb_log_file_size=128M
> #innodb_lock_wait_timeout=400
> innodb_lock_wait_timeout=900
> thread_cache_size=4
> table_open_cache=512
> 
> query_cache_size = 64M
> query_cache_limit = 32M
> join_buffer_size = 32M
> 
> tmp_table_size = 1G
> max_heap_table_size = 1G
> 
> slow-query-log = 1
> slow-query-log-file = /var/log/mysql-slow.log
> long_query_time = 1
> log-queries-not-using-indexes
> 
> [mysqld_safe]
> log-error=/var/log/mysqld.log
> pid-file=/var/run/mysqld/mysqld.pid


Given this information I think that innodb_log_file_size could be increased. I see a commented value of 128MB, did you make this change in the past? DRBD over mechanical disks is not the best for a MySQL, but probably tuning this values will help.


So, I recommend you to do:

1. Tune the DB values in my.conf as noted in previous comments
2. Do the job queries on old and new databases
3. I have the patch ready and pending to review&commit, once done I'll let you know the commit # just in case you want to apply it.

Regards,
Felip
Comment 9 mengxing cheng 2018-02-09 10:35:35 MST
Felip, 

Thank you very much for help. I can wait for new release for the special character issue because it is not urgent. You can close this ticket if you like. 



Mengxing